Bioinformatics For Dummies, 2nd Edition

Published by
Wiley Publishing, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.

Trademarks: Wiley, the Wiley Publishing logo, For Dummies, the Dummies Man logo, A Reference for the Rest of Us!, The Dummies Way, Dummies Daily, The Fun and Easy Way, Dummies.com, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services, please contact our Customer Care Department within the U.S. at 800-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

About the Authors

Jean-Michel Claverie is Professor of Medical Bioinformatics at the School of Medicine of the Université de la Méditerranée, and a consultant in genomics and bioinformatics. He is the founder and current head of the Structural & Genomic Information Laboratory, located in Marseilles, a sunny city on the Mediterranean coast of France. Using science as a pretext to travel, Jean-Michel has held positions in Paris (France), Sherbrooke (PQ, Canada), the Salk Institute (La Jolla, CA), the Pasteur Institute (Paris), Incyte pharmaceutical (Palo Alto, CA); and the National Center for Biotechnology Information (Bethesda, MD). He has used computers in biology since the early days –– his Ph.D. work involved modeling biochemical reactions by programming an 8K Honeywell 516 computer right from the console switches! Although he has no clear recollection of it, he has been credited with introducing the French word “bioinformatique” in the late eighties, before involuntarily coining the catchy “bioinformatics” by mistranslating it while giving a talk in English!

Jean-Michel’s current research interests are in microbial and structural genomics, and in the development of bioinformatic methods for the prediction of gene function. He is the author or coauthor of more than 150 scientific publications, and a member of numerous international review panels and scientific councils. In his spare time, he enjoys the relaxed pace of life in Marseilles, with his wife Chantal and their two sons, Nicholas and Raphael.

Cedric Notredame is a researcher at the French National Centre for Scientific Research. Cedric has used and abused the facilities offered by science to wander around Europe. After a Ph.D. at EMBL (Heidelberg, Germany) and at the European Bioinformatics Institute (Cambridge, UK) under the supervision of Des Higgins (yes, the ClustalW guy), Cedric did a post-doc at the National Institute of Medical Research (London, UK), in the lab of Willie Taylor and under the supervision of Jaap Heringa. He then did a post-doc in Lausanne (Switzerland) with Phillip Bucher, and remained involved with the Swiss Institute of Bioinformatics for several years. Having had his share of rain, snow, and wind, Cedric has finally settled in Marseilles, where the sun and the sea are simply warmer than any other place he has lived in.

Cedric dedicates most of his research to the multiple sequence alignment problem and its many applications in biology. His friends claim that his entire life (past, present, future) is somehow stuffed into the T-Coffee multiple-sequence alignment package. When he is not busy dismantling T-Coffee and brewing new sequences, Cedric enjoys life in the company of his wife, Marita.

Dedication

This is for my parents Monique and Jack, for keeping me in school, and for Chantal, for keeping me happy — in and out of the lab. It’s also for my daughter Vanessa, and my sons Nicholas and Raphael, for reminding me that not everything in life is scientific.

–– J-MC

This is for my wife Marita, my daughter Lina, my mother Marie and in memory of my grandparents, Simone and Louis.

–– CN

Authors’ Acknowledgments

The entire Wiley staff did a great job pulling together to publish this book on tight deadlines. We’d especially like to thank our tireless project editor, Paul Levesque, and Barry Childs-Helton, who did a great job copyediting a text full of obscure biochemical words.

We’d also like to thank Amey Godse, our technical editor. Amey nailed down major and minor inaccuracies alike. His many suggestions did much to improve the book.

We also have to thank the bioinformatics community for creating the many great Web resources that we describe in this book and for making them available for free over the Internet. We personally know a number of the folks who keep these sites up and running –– and salute all of them for their hard work, enthusiasm, and dedication. Topping this list are the staff members of the Swiss Bioinformatics Institute, who run the ExPASy and the Swiss EMBnet Web server. They always went out of their way to answer any query regarding their site. The NCBI folks have also been very helpful, and we thank them for that.

We also want to pat each other on the back for making the writing of this book great fun!

Finally, we’d like to thank our families and friends, who put up with missed dinners, extra child care, changing deadlines, late nights, and the many other demands of a project like this. We really appreciate their patience –– and promise that we won’t do another one . . . at least not anytime soon!

Publisher’s Acknowledgments

We’re proud of this book; please send us your comments through our online registration form located at www.dummies.com/register/.

Some of the people who helped bring this book to market include the following:

Acquisitions, Editorial, and Media Development

Project Editor: Paul Levesque

Acquisitions Editor: Melody Layne

Senior Copy Editor: Barry Childs-Helton

Technical Editor: Amey Godse

Editorial Manager: Leah Cameron

Media Development Specialists: Angela Denny, Kate Jenkins, Steven Kudirka, Kit Malone

Media Development Coordinator: Laura Atkinson

Media Project Supervisor: Laura Moss

Media Development Manager: Laura VanWinkle

Editorial Assistant: Amanda Foxworth

Sr. Editorial Assistant: Cherie Case

Cartoons: Rich Tennant (www.the5thwave.com)

Composition Services

Project Coordinator: Jennifer Theriot

Layout and Graphics: Carl Byers, Lavonne Cook, Barbara Moore, Shelley Norris, Barry Offringa, Laura Pence

Proofreaders: Susan Moritz, Charles Spencer, Rob Springer, Techbooks

Indexer: Techbooks

Anniversary Logo Design: Richard Pacifico

Publishing and Editorial for Technology Dummies

Richard Swadley, Vice President and Executive Group Publisher

Andy Cummings, Vice President and Publisher

Mary Bednarek, Executive Acquisitions Director

Mary C. Corder, Editorial Director

Publishing for Consumer Dummies

Diane Graves Steele, Vice President and Publisher

Joyce Pepple, Acquisitions Director

Composition Services

Gerry Fahey, Vice President of Production Services

Debbie Stailey, Director of Composition Services

Title

Introduction

What This Book Does for You

Foolish Assumptions

How This Book Is Organized

Icons Used in This Book

Where to Go from Here

Part I : Getting Started in Bioinformatics

Chapter 1: Finding Out What Bioinformatics Can Do for You

What Is Bioinformatics?

Analyzing Protein Sequences

Analyzing DNA Sequences

Analyzing RNA Sequences

DNA Coding Regions: Pretending to Work with Protein Sequences

Working with Entire Genomes

Chapter 2: How Most People Use Bioinformatics

Becoming an Instant Expert with PubMed/Medline

Retrieving Protein Sequences

Retrieving DNA Sequences

Using BLAST to Compare My Protein Sequence to Other Protein Sequences

Making a Multiple Protein Sequence Alignment with ClustalW

Part II : A Survival Guide to Bioinformatics

Chapter 3: Using Nucleotide Sequence Databases

Reading into Genes and Genomes

Making Use (and Sense) of GenBank

Using a Gene-Centric Database

Working with Whole-Genome Databases

Exploring the Human Genome

Chapter 4: Using Protein and Specialized Sequence Databases

From Translated ORFs to Mature Proteins

Reading a Swiss-Prot Entry

Finding Out More about Your Protein

Chapter 5: Working with a Single DNA Sequence

Catching Errors Before It’s Too Late

Computing/Verifying a Restriction Map

Designing PCR Primers

Analyzing DNA Composition

Finding Protein-Coding Regions

Assembling Sequence Fragments

Beyond This Chapter

Chapter 6: Working with a Single Protein Sequence

Doing Biochemistry on a Computer

Doing Primary Structure Analysis

Predicting Post-Translational Modifications in Your Protein

Finding Known Domains in Your Protein

Discovering New Domains in Your Proteins

More Protein Analysis for Free over the Internet

Part III : Becoming a Pro in Sequence Analysis

Chapter 7: Similarity Searches on Sequence Databases

Understanding the Importance of Similarity

The Most Popular Data-Mining Tool Ever: BLAST

Controlling BLAST: Choosing the Right Parameters

Making BLAST Iterative with PSI-BLAST

Similarity Searches for Free over the Internet

Chapter 8: Comparing Two Sequences

Making Sure You Have the Right Sequences and the Right Methods

Making a Dot Plot

Making Local Alignments over the Internet

Making Global Alignments over the Internet

Using Lalign to Make a Global Alignment

Aligning Proteins and DNA

Free Pairwise Sequence Comparisons over the Internet

Chapter 9: Building a Multiple Sequence Alignment

Finding Out if a Multiple Sequence Alignment Can Help You

Choosing the Right Sequences

Choosing the Right Method of Multiple Sequence Alignment

Interpreting Your Multiple Sequence Alignment

Comparing Sequences That You Can’t Align

Internet Resources for Doing Multiple Sequence Comparisons

Chapter 10: Editing and Publishing Alignments

Getting Your Multiple Alignment in the Right Format

Using Jalview to Edit Your Multiple Alignment Online

Preparing Your Multiple Alignment for Publication

Editing and Analyzing Multiple Sequence Alignments for Free over the Internet

Part IV : Becoming a Specialist: Advanced Bioinformatics Techniques

Chapter 11: Working with Protein 3-D Structures

From Primary to Secondary Structures

From the Primary Structure to the 3-D Structure

Beyond This Chapter

Chapter 12: Working with RNA

Predicting, Modeling, and Drawing RNA Secondary Structures

Using Mfold

Searching Databases and Genomes for RNA Sequences

Finding the “New” RNAs: miRNAs and siRNAs

Doing RNA Analysis for Free over the Internet

Chapter 13: Building Phylogenetic Trees

Finding Out What Phylogenetic Trees Can Do for You

Preparing Your Phylogenetic Data

Building the Kind of Tree You Need

Doing Phylogeny for Free over the Internet

Part V : The Part of Tens

Chapter 14: The Ten (Okay, Twelve) Commandments for Using Servers

Keep in Mind: Your Data Is Never Secure on the Web

Remember the Server, the Database, and the Program Version You Used

Write Down the Sequence-Identification Numbers

Write Down the Program Parameters

Save Your Internet Results the Right Way

Use E-Values

Make Sure You Can Trust Your Alignments

Use Different Programs to Check Borderline Results

Stay Away from Unpublished Methods!

Databases Are Not Like Good Wine

Just Because It Looks Free Doesn’t Mean It Is Free . . .

Biting the Bullet at the Right Time

Chapter 15: Some Useful Bioinformatics Resources

Ten Major Databases

Ten Major Bioinformatics Software Programs

Ten Major Resource Locators

Some Places to Find Out What’s Really Going On

Introduction

Welcome to the second edition of Bioinformatics For Dummies!

In the first edition, we presented bioinformatics as a brand new discipline on the rise. How right we were! Since then, it has become so prominent that anybody with an interest in biology, biotechnology, modern medicine, or (for that matter) genetically engineered food or drugs simply cannot afford to remain ignorant about the topic. With this book, you’ve come to the right place to quickly learn the basics.

But wait — if you expect something complicated, you’re in for a (good or bad) surprise: Bioinformatics is nothing but good, sound, regular biology, appropriately dressed so it can fit into a computer.

Bioinformatics is about searching biological databases, comparing sequences, looking at protein structures, and (more generally) asking biological and biomedical questions with a computer. The bioinformatics we show you in this book can save you months of work in the lab at the minute cost of a few hours’ work with your computer.

Although you’ll find standard biological terms throughout, don’t look here for long equations and computer-geek gibberish. The purpose of this book is to show you quickly and plainly how to use the bioinformatics programs that you need to get your work done. On every page, we give you tricks and treats to get the most out of existing tools. If you didn’t know that you can use the most sophisticated programs for free over the Internet — and that you can do this (sometimes) without installing anything on your own computer — then stay tuned: You’re in for many more good surprises.

What This Book Does for You

This book is here to help you get things done. For every standard bioinformatics task you may want to undertake, you’ll find detailed steps that you can use to quickly produce the result you need.

To use most of the tools we describe in this book, you don’t need to install any program on your computer. Everything we show you here runs over the Internet via your Internet browser.

If you know what you want to do — or at least know the task by name — going through the Table of Contents is the best strategy for finding exactly what you need. If you have an idea of what you want to do but you’re not sure how to express it with words, Chapter 2 is here to help you decide which part of the book will suit your needs.

At the end of most chapters you’ll find a convenient “Doing It for Free over the Internet” section, where we list a few carefully chosen Web sites that are similar to those we describe in the rest of the chapter. Treat this information as a spare wheel! If the main site is down, this section probably lists a convenient replacement.

Foolish Assumptions

Putting a project’s assumptions right up front is just good policy. While writing this book, we have assumed that

You have a PC running Microsoft Windows.

You have an Internet connection (a fast one if possible, but not necessarily).

You likely have a background in molecular biology. If you don’t — or if you need to brush up on your molecular biology — Chapter 1 gives you a brief overview of the basics.

You know how to use an Internet browser but not much more about computers.

You don’t want to become a bioinformatics guru; you simply want to use the right tools for your problem and not spend days finding out about things you don’t need!

Most private biotech companies consider it unsafe to send data over the Internet. We assume here that the data you want to analyze over the Internet is not very confidential. Also, some of the “public” databases and services listed in this book require commercial users to enter into a license agreement.

How This Book Is Organized

Bioinformatics is a broad field, with many nooks and crannies, hills and dales, and other charming features. Rather than present the whole vast discipline in one fell swoop, we’ve divided our discussion into five (more manageable) parts.

Part I: Getting Started in Bioinformatics

If you have less than an hour to find out what bioinformatics can do for you, Part I is the right place for you! It tells you everything you need to know in order to actually do something with bioinformatics. In Part I, we also remind you of just those bits of molecular biology that you’ll need to know when you do sequence analysis. We show you here how to run the main bioinformatics tools so that you know what’s in store for you.

Part II: A Survival Guide to Bioinformatics

If you want to find out everything that’s ever been published on your sequence, this part is for you. It shows you how you can deal with the bioinformaticist’s bread and butter: DNA or protein sequences and their databases. Here we tell you where you can find all the available sequences, and how to find the one you really need among zillions of irrelevant others. We also show you how to gather everything that’s known in the universe about this special sequence that interests you so much (at least all of it that’s available online).

Part III: Becoming a Pro in Sequence Analysis

If you want to compare sequences, this is the part for you. Here we show you how to search databases for sequences that are similar to yours, as well as show you how to compare two or more sequences. This part also tells you how to gather hints about the function of a gene, through sequence comparisons. Finally, we give you pointers on how to produce, edit, and beautify your multiple sequence alignments so you can show them in presentations and publications.

Part IV: Becoming a Specialist: Advanced Bioinformatics Techniques

To take full advantage of this part, you should have a pretty good idea of what you’re looking for. Heavy stuff is going on here: how to predict a protein structure, how to predict an RNA structure, and how to do phylogenetic analysis. These are complicated subjects; it’s simply amazing what you can do with a simple PC, thanks to the Internet resources we describe in this part.

Part V: The Part of Tens

Welcome to our bazaar! If you haven’t found what you were looking for in the other parts, you’re now in the right place. The wealth of online resources that exist in bioinformatics is extraordinary — and almost overwhelming. With every student and his or her cousins putting semester reports online, finding exactly what you need with a simple keyword search can be a daunting task. In the Part of Tens, we give you a list of central resources that you can use as a starting point. Chances are that the program or server you’re looking for is only one or two clicks away. In this part, we also give you ten important pieces of advice to make sure that your lab work can safely depend on your Internet work.

Icons Used in This Book

Always eager to please, we’ve decided to use a series of icons in the margins of this book as a way to help you key in on important information. We came up with four, which seemed like a nice, round number.

Some particularly technoid information is coming up. You can skip it and nothing terrible will happen. Yet, if you want to be in full control of what you’re doing, reading this may help! Your call. . . .

This icon shows you something simple, or smart, or a cute shortcut. In any case, it’s something that can save you time and trouble.

There are many booby traps around when you use Internet servers. This icon warns you when some ambiguity surrounds what the server you’re using is up to — or when disaster is only one (wrong) mouse click away. Treat the Warning icon with respect — especially in a steps list!

This icon indicates something you should remember. It can be one of the few important principles that you need to know, or it can be a very special tip — the kind that can save you three days of work (or drive you nuts if you forget it). You may assume that the head of your institute/company got to the top by discovering and applying one or more of pearls of wisdom in these very special tips!

Where to Go from Here

If you know nothing about bioinformatics, this book is here to reassure you. Bioinformatics is a much simpler subject than you ever thought possible. For most people new to this field, the main difficulty is finding out the kind of questions they can ask with these new tools. If you’re a biologist, don’t let the computer scare you; bioinformatics is nothing more than good, sound, regular biology hidden inside a computer.

The magic thing about bioinformatics is that, with a simple Internet connection, you can browse databases that contain the sum of our entire human biological knowledge — and you can do this with the most sophisticated tools ever developed by mankind. And how much is this going to cost you? Nothing!

If you do molecular biology, this is the equivalent of having an entire lab with expensive, state-of-the-art equipment and staffed by an army of post-docs who can go fetch anything you need any time you need it. The only difference is that you cannot set this lab on fire (even if you try very hard).

If you think of it, it is quite incredible to realize that all this is right here, at your fingertips, one or two mouse clicks away! The Web is borderless; it is colorblind and unimpressed by wealth! Whether you come from a rich or a poor country, whether you’re a first-year student, a scientist, or a Nobel Prize winner, you have access — for free — to the same high-quality information. No other scientific discipline has ever been so democratically widespread.

This book isn’t a textbook but a cookbook! And we take pride in this! It contains many recipes that colleagues showed us over the years or that we discovered ourselves. Accommodating and serving biological data is something very personal — and we’re sure that you’ll gradually find your own way to do it. In the meantime, if you need a quick fix, you can always use some of the off-the-shelf solutions that we provide here.

No discipline in science has benefited as much as biology from the “global village” phenomenon of the Internet. Whatever your question, whatever you want to do, starting on the Internet is the proper thing to do. Nonetheless, remember that the best and the worst appear online these days. Do as you do in real life — and trust only those sites or institutions that you know well.

This book is as up-to-date as we can make it, but the world doesn’t stand still right after we finish correcting the last galley proofs and send Bioinformatics For Dummies into the bookstores. For those of you who want up-to-date info on the growing field of bioinformatics (including lists of our favorite bioinformatics links) and don’t want to wait until the next edition, check out the Web site associated with this title at www.dummies.com/extras.

Sometimes browsing the Internet gives one the depressing feeling that everything has been done by others and that it’s all over. This may be true. Now that the whole world talks together, it’s clear that there’s a finite number of interesting questions to ask. That’s the bad news. The good news is that there are many more answers than there are questions! Never exclude the hypothesis that your answer may be the best in the universe (at least for a few days. . . .)!

Part I

Getting Started in Bioinformatics

In this part . . .

Bioinformatics is a new discipline, which means that nobody should feel ashamed if he or she doesn’t have a clue what the excitement’s all about. Don’t worry; after finishing this book, you’ll be speaking bioinformaticsspeak with the best of them.

We start you off in Part I with a quick reminder of what you need to know about DNA and proteins to make sense of this book. We also give you an overview of the main bioinformatics tools available on the Internet.

We don’t give too many details here, but if all you need to know is which Internet page to open and which button to press, come on in, ’cuz we’ve got just what you need!