Bioinformatics For Dummies®, 2nd Edition
Published by
Wiley Publishing, Inc.
111 River St.
Hoboken, NJ 07030-5774
www.wiley.com
Copyright © 2007 by Wiley Publishing, Inc., Indianapolis, Indiana
Published by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions.
Trademarks: Wiley, the Wiley Publishing logo, For Dummies, the Dummies Man logo, A Reference for the Rest of Us!, The Dummies Way, Dummies Daily, The Fun and Easy Way, Dummies.com, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 800-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002.
For technical support, please visit www.wiley.com/techsupport.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Library of Congress Control Number: 2006934844
ISBN13: 978-0-470-08985-9
ISBN10: 0-470-08985-7
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
1B/SX/RR/QW/IN
Jean-Michel Claverie is Professor of Medical Bioinformatics at the School of Medicine of the Université de la Méditerranée, and a consultant in genomics and bioinformatics. He is the founder and current head of the Structural & Genomic Information Laboratory, located in Marseilles, a sunny city on the Mediterranean coast of France. Using science as a pretext to travel, Jean-Michel has held positions in Paris (France), Sherbrooke (PQ, Canada), the Salk Institute (La Jolla, CA), the Pasteur Institute (Paris), Incyte pharmaceutical (Palo Alto, CA); and the National Center for Biotechnology Information (Bethesda, MD). He has used computers in biology since the early days –– his Ph.D. work involved modeling biochemical reactions by programming an 8K Honeywell 516 computer right from the console switches! Although he has no clear recollection of it, he has been credited with introducing the French word “bioinformatique” in the late eighties, before involuntarily coining the catchy “bioinformatics” by mistranslating it while giving a talk in English!
Jean-Michel’s current research interests are in microbial and structural genomics, and in the development of bioinformatic methods for the prediction of gene function. He is the author or coauthor of more than 150 scientific publications, and a member of numerous international review panels and scientific councils. In his spare time, he enjoys the relaxed pace of life in Marseilles, with his wife Chantal and their two sons, Nicholas and Raphael.
Cedric Notredame is a researcher at the French National Centre for Scientific Research. Cedric has used and abused the facilities offered by science to wander around Europe. After a Ph.D. at EMBL (Heidelberg, Germany) and at the European Bioinformatics Institute (Cambridge, UK) under the supervision of Des Higgins (yes, the ClustalW guy), Cedric did a post-doc at the National Institute of Medical Research (London, UK), in the lab of Willie Taylor and under the supervision of Jaap Heringa. He then did a post-doc in Lausanne (Switzerland) with Phillip Bucher, and remained involved with the Swiss Institute of Bioinformatics for several years. Having had his share of rain, snow, and wind, Cedric has finally settled in Marseilles, where the sun and the sea are simply warmer than any other place he has lived in.
Cedric dedicates most of his research to the multiple sequence alignment problem and its many applications in biology. His friends claim that his entire life (past, present, future) is somehow stuffed into the T-Coffee multiple-sequence alignment package. When he is not busy dismantling T-Coffee and brewing new sequences, Cedric enjoys life in the company of his wife, Marita.
This is for my parents Monique and Jack, for keeping me in school, and for Chantal, for keeping me happy — in and out of the lab. It’s also for my daughter Vanessa, and my sons Nicholas and Raphael, for reminding me that not everything in life is scientific.
–– J-MC
This is for my wife Marita, my daughter Lina, my mother Marie and in memory of my grandparents, Simone and Louis.
–– CN
The entire Wiley staff did a great job pulling together to publish this book on tight deadlines. We’d especially like to thank our tireless project editor, Paul Levesque, and Barry Childs-Helton, who did a great job copyediting a text full of obscure biochemical words.
We’d also like to thank Amey Godse, our technical editor. Amey nailed down major and minor inaccuracies alike. His many suggestions did much to improve the book.
We also have to thank the bioinformatics community for creating the many great Web resources that we describe in this book and for making them available for free over the Internet. We personally know a number of the folks who keep these sites up and running –– and salute all of them for their hard work, enthusiasm, and dedication. Topping this list are the staff members of the Swiss Bioinformatics Institute, who run the ExPASy and the Swiss EMBnet Web server. They always went out of their way to answer any query regarding their site. The NCBI folks have also been very helpful, and we thank them for that.
We also want to pat each other on the back for making the writing of this book great fun!
Finally, we’d like to thank our families and friends, who put up with missed dinners, extra child care, changing deadlines, late nights, and the many other demands of a project like this. We really appreciate their patience –– and promise that we won’t do another one . . . at least not anytime soon!
We’re proud of this book; please send us your comments through our online registration form located at www.dummies.com/register/.
Some of the people who helped bring this book to market include the following:
Acquisitions, Editorial, and Media Development
Project Editor: Paul Levesque
Acquisitions Editor: Melody Layne
Senior Copy Editor: Barry Childs-Helton
Technical Editor: Amey Godse
Editorial Manager: Leah Cameron
Media Development Specialists: Angela Denny, Kate Jenkins, Steven Kudirka, Kit Malone
Media Development Coordinator: Laura Atkinson
Media Project Supervisor: Laura Moss
Media Development Manager: Laura VanWinkle
Editorial Assistant: Amanda Foxworth
Sr. Editorial Assistant: Cherie Case
Cartoons: Rich Tennant (www.the5thwave.com)
Composition Services
Project Coordinator: Jennifer Theriot
Layout and Graphics: Carl Byers, Lavonne Cook, Barbara Moore, Shelley Norris, Barry Offringa, Laura Pence
Proofreaders: Susan Moritz, Charles Spencer, Rob Springer, Techbooks
Indexer: Techbooks
Anniversary Logo Design: Richard Pacifico
Publishing and Editorial for Technology Dummies
Richard Swadley, Vice President and Executive Group Publisher
Andy Cummings, Vice President and Publisher
Mary Bednarek, Executive Acquisitions Director
Mary C. Corder, Editorial Director
Publishing for Consumer Dummies
Diane Graves Steele, Vice President and Publisher
Joyce Pepple, Acquisitions Director
Composition Services
Gerry Fahey, Vice President of Production Services
Debbie Stailey, Director of Composition Services
Title
Introduction
What This Book Does for You
Foolish Assumptions
How This Book Is Organized
Icons Used in This Book
Where to Go from Here
Part I : Getting Started in Bioinformatics
Chapter 1: Finding Out What Bioinformatics Can Do for You
What Is Bioinformatics?
Analyzing Protein Sequences
Analyzing DNA Sequences
Analyzing RNA Sequences
DNA Coding Regions: Pretending to Work with Protein Sequences
Working with Entire Genomes
Chapter 2: How Most People Use Bioinformatics
Becoming an Instant Expert with PubMed/Medline
Retrieving Protein Sequences
Retrieving DNA Sequences
Using BLAST to Compare My Protein Sequence to Other Protein Sequences
Making a Multiple Protein Sequence Alignment with ClustalW
Part II : A Survival Guide to Bioinformatics
Chapter 3: Using Nucleotide Sequence Databases
Reading into Genes and Genomes
Making Use (and Sense) of GenBank
Using a Gene-Centric Database
Working with Whole-Genome Databases
Exploring the Human Genome
Chapter 4: Using Protein and Specialized Sequence Databases
From Translated ORFs to Mature Proteins
Reading a Swiss-Prot Entry
Finding Out More about Your Protein
Chapter 5: Working with a Single DNA Sequence
Catching Errors Before It’s Too Late
Computing/Verifying a Restriction Map
Designing PCR Primers
Analyzing DNA Composition
Finding Protein-Coding Regions
Assembling Sequence Fragments
Beyond This Chapter
Chapter 6: Working with a Single Protein Sequence
Doing Biochemistry on a Computer
Doing Primary Structure Analysis
Predicting Post-Translational Modifications in Your Protein
Finding Known Domains in Your Protein
Discovering New Domains in Your Proteins
More Protein Analysis for Free over the Internet
Part III : Becoming a Pro in Sequence Analysis
Chapter 7: Similarity Searches on Sequence Databases
Understanding the Importance of Similarity
The Most Popular Data-Mining Tool Ever: BLAST
Controlling BLAST: Choosing the Right Parameters
Making BLAST Iterative with PSI-BLAST
Similarity Searches for Free over the Internet
Chapter 8: Comparing Two Sequences
Making Sure You Have the Right Sequences and the Right Methods
Making a Dot Plot
Making Local Alignments over the Internet
Making Global Alignments over the Internet
Using Lalign to Make a Global Alignment
Aligning Proteins and DNA
Free Pairwise Sequence Comparisons over the Internet
Chapter 9: Building a Multiple Sequence Alignment
Finding Out if a Multiple Sequence Alignment Can Help You
Choosing the Right Sequences
Choosing the Right Method of Multiple Sequence Alignment
Interpreting Your Multiple Sequence Alignment
Comparing Sequences That You Can’t Align
Internet Resources for Doing Multiple Sequence Comparisons
Chapter 10: Editing and Publishing Alignments
Getting Your Multiple Alignment in the Right Format
Using Jalview to Edit Your Multiple Alignment Online
Preparing Your Multiple Alignment for Publication
Editing and Analyzing Multiple Sequence Alignments for Free over the Internet
Part IV : Becoming a Specialist: Advanced Bioinformatics Techniques
Chapter 11: Working with Protein 3-D Structures
From Primary to Secondary Structures
From the Primary Structure to the 3-D Structure
Beyond This Chapter
Chapter 12: Working with RNA
Predicting, Modeling, and Drawing RNA Secondary Structures
Using Mfold
Searching Databases and Genomes for RNA Sequences
Finding the “New” RNAs: miRNAs and siRNAs
Doing RNA Analysis for Free over the Internet
Chapter 13: Building Phylogenetic Trees
Finding Out What Phylogenetic Trees Can Do for You
Preparing Your Phylogenetic Data
Building the Kind of Tree You Need
Doing Phylogeny for Free over the Internet
Part V : The Part of Tens
Chapter 14: The Ten (Okay, Twelve) Commandments for Using Servers
Keep in Mind: Your Data Is Never Secure on the Web
Remember the Server, the Database, and the Program Version You Used
Write Down the Sequence-Identification Numbers
Write Down the Program Parameters
Save Your Internet Results the Right Way
Use E-Values
Make Sure You Can Trust Your Alignments
Use Different Programs to Check Borderline Results
Stay Away from Unpublished Methods!
Databases Are Not Like Good Wine
Just Because It Looks Free Doesn’t Mean It Is Free . . .
Biting the Bullet at the Right Time
Chapter 15: Some Useful Bioinformatics Resources
Ten Major Databases
Ten Major Bioinformatics Software Programs
Ten Major Resource Locators
Some Places to Find Out What’s Really Going On
Welcome to the second edition of Bioinformatics For Dummies!
In the first edition, we presented bioinformatics as a brand new discipline on the rise. How right we were! Since then, it has become so prominent that anybody with an interest in biology, biotechnology, modern medicine, or (for that matter) genetically engineered food or drugs simply cannot afford to remain ignorant about the topic. With this book, you’ve come to the right place to quickly learn the basics.
But wait — if you expect something complicated, you’re in for a (good or bad) surprise: Bioinformatics is nothing but good, sound, regular biology, appropriately dressed so it can fit into a computer.
Bioinformatics is about searching biological databases, comparing sequences, looking at protein structures, and (more generally) asking biological and biomedical questions with a computer. The bioinformatics we show you in this book can save you months of work in the lab at the minute cost of a few hours’ work with your computer.
Although you’ll find standard biological terms throughout, don’t look here for long equations and computer-geek gibberish. The purpose of this book is to show you quickly and plainly how to use the bioinformatics programs that you need to get your work done. On every page, we give you tricks and treats to get the most out of existing tools. If you didn’t know that you can use the most sophisticated programs for free over the Internet — and that you can do this (sometimes) without installing anything on your own computer — then stay tuned: You’re in for many more good surprises.
This book is here to help you get things done. For every standard bioinformatics task you may want to undertake, you’ll find detailed steps that you can use to quickly produce the result you need.
To use most of the tools we describe in this book, you don’t need to install any program on your computer. Everything we show you here runs over the Internet via your Internet browser.
If you know what you want to do — or at least know the task by name — going through the Table of Contents is the best strategy for finding exactly what you need. If you have an idea of what you want to do but you’re not sure how to express it with words, Chapter 2 is here to help you decide which part of the book will suit your needs.
At the end of most chapters you’ll find a convenient “Doing It for Free over the Internet” section, where we list a few carefully chosen Web sites that are similar to those we describe in the rest of the chapter. Treat this information as a spare wheel! If the main site is down, this section probably lists a convenient replacement.
Putting a project’s assumptions right up front is just good policy. While writing this book, we have assumed that
You have a PC running Microsoft Windows.
You have an Internet connection (a fast one if possible, but not necessarily).
You likely have a background in molecular biology. If you don’t — or if you need to brush up on your molecular biology — Chapter 1 gives you a brief overview of the basics.
You know how to use an Internet browser but not much more about computers.
You don’t want to become a bioinformatics guru; you simply want to use the right tools for your problem and not spend days finding out about things you don’t need!
Most private biotech companies consider it unsafe to send data over the Internet. We assume here that the data you want to analyze over the Internet is not very confidential. Also, some of the “public” databases and services listed in this book require commercial users to enter into a license agreement.
Bioinformatics is a broad field, with many nooks and crannies, hills and dales, and other charming features. Rather than present the whole vast discipline in one fell swoop, we’ve divided our discussion into five (more manageable) parts.
If you have less than an hour to find out what bioinformatics can do for you, Part I is the right place for you! It tells you everything you need to know in order to actually do something with bioinformatics. In Part I, we also remind you of just those bits of molecular biology that you’ll need to know when you do sequence analysis. We show you here how to run the main bioinformatics tools so that you know what’s in store for you.
If you want to find out everything that’s ever been published on your sequence, this part is for you. It shows you how you can deal with the bioinformaticist’s bread and butter: DNA or protein sequences and their databases. Here we tell you where you can find all the available sequences, and how to find the one you really need among zillions of irrelevant others. We also show you how to gather everything that’s known in the universe about this special sequence that interests you so much (at least all of it that’s available online).
If you want to compare sequences, this is the part for you. Here we show you how to search databases for sequences that are similar to yours, as well as show you how to compare two or more sequences. This part also tells you how to gather hints about the function of a gene, through sequence comparisons. Finally, we give you pointers on how to produce, edit, and beautify your multiple sequence alignments so you can show them in presentations and publications.
To take full advantage of this part, you should have a pretty good idea of what you’re looking for. Heavy stuff is going on here: how to predict a protein structure, how to predict an RNA structure, and how to do phylogenetic analysis. These are complicated subjects; it’s simply amazing what you can do with a simple PC, thanks to the Internet resources we describe in this part.
Welcome to our bazaar! If you haven’t found what you were looking for in the other parts, you’re now in the right place. The wealth of online resources that exist in bioinformatics is extraordinary — and almost overwhelming. With every student and his or her cousins putting semester reports online, finding exactly what you need with a simple keyword search can be a daunting task. In the Part of Tens, we give you a list of central resources that you can use as a starting point. Chances are that the program or server you’re looking for is only one or two clicks away. In this part, we also give you ten important pieces of advice to make sure that your lab work can safely depend on your Internet work.
Always eager to please, we’ve decided to use a series of icons in the margins of this book as a way to help you key in on important information. We came up with four, which seemed like a nice, round number.
If you know nothing about bioinformatics, this book is here to reassure you. Bioinformatics is a much simpler subject than you ever thought possible. For most people new to this field, the main difficulty is finding out the kind of questions they can ask with these new tools. If you’re a biologist, don’t let the computer scare you; bioinformatics is nothing more than good, sound, regular biology hidden inside a computer.
The magic thing about bioinformatics is that, with a simple Internet connection, you can browse databases that contain the sum of our entire human biological knowledge — and you can do this with the most sophisticated tools ever developed by mankind. And how much is this going to cost you? Nothing!
If you do molecular biology, this is the equivalent of having an entire lab with expensive, state-of-the-art equipment and staffed by an army of post-docs who can go fetch anything you need any time you need it. The only difference is that you cannot set this lab on fire (even if you try very hard).
If you think of it, it is quite incredible to realize that all this is right here, at your fingertips, one or two mouse clicks away! The Web is borderless; it is colorblind and unimpressed by wealth! Whether you come from a rich or a poor country, whether you’re a first-year student, a scientist, or a Nobel Prize winner, you have access — for free — to the same high-quality information. No other scientific discipline has ever been so democratically widespread.
This book isn’t a textbook but a cookbook! And we take pride in this! It contains many recipes that colleagues showed us over the years or that we discovered ourselves. Accommodating and serving biological data is something very personal — and we’re sure that you’ll gradually find your own way to do it. In the meantime, if you need a quick fix, you can always use some of the off-the-shelf solutions that we provide here.
No discipline in science has benefited as much as biology from the “global village” phenomenon of the Internet. Whatever your question, whatever you want to do, starting on the Internet is the proper thing to do. Nonetheless, remember that the best and the worst appear online these days. Do as you do in real life — and trust only those sites or institutions that you know well.
Sometimes browsing the Internet gives one the depressing feeling that everything has been done by others and that it’s all over. This may be true. Now that the whole world talks together, it’s clear that there’s a finite number of interesting questions to ask. That’s the bad news. The good news is that there are many more answers than there are questions! Never exclude the hypothesis that your answer may be the best in the universe (at least for a few days. . . .)!
In this part . . .
Bioinformatics is a new discipline, which means that nobody should feel ashamed if he or she doesn’t have a clue what the excitement’s all about. Don’t worry; after finishing this book, you’ll be speaking bioinformaticsspeak with the best of them.
We start you off in Part I with a quick reminder of what you need to know about DNA and proteins to make sense of this book. We also give you an overview of the main bioinformatics tools available on the Internet.
We don’t give too many details here, but if all you need to know is which Internet page to open and which button to press, come on in, ’cuz we’ve got just what you need!