next_inactive up previous


Best Practices for Computational Biology

Dr. Balaji S. Srinivasan

Departments of Statistics and Computer Science, Stanford University, Stanford CA 94305


Contents

Introduction

NOTE: THIS PAGE IS OLD. I don't have the time to update it. Some of the stuff is still useful, but a few major notes:

1. I've been using Python/Ubuntu for a long time now -- Perl is faster for quick stuff, but definitely harmful for large projects.
2. In python, the numpy/scipy/matplotlib stack is extremely powerful; when combined with sage, rpy2, cython, and django it really can't be beat for scientific computing and quickly building webapps.
3. I'm a huge fan of Conkeror, which is the offspring of emacs + firefox. Very worthwile to learn.
4. I recommend getting very good with elisp -- Robert Chassell's book (get the 2nd edition) is outstanding. Do 'chassell elisp filetype:pdf' on google to get it.
5. Git is MUCH better than svn. Having symbolic links in the repo is key. Interactive rebase even more key. And automated git bisect run = the key-iest.
6. Trac is excellent for working in large groups, though Redmine is even more featureful (though it's Ruby rather than Python based)
7. Use Fireworks CS4 and the symbol library for real mockups, use Omnigraffle Pro with the stencils for fast mockups/figs, and use Photoshop sparingly.
You should Google anything that I didn't link explicitly (viz., 'what is git bisect run'). Good luck :)

BEGIN OLD STUFF

This document covers a number of tips, tricks, and heuristics that I've compiled while doing research in computational biology. Think of it as the sort of thing that is usually not formally taught, but which would have been great to know when beginning graduate school.

The content is quite broadly applicable to any kind of data-analytic field, but comp bio will be my running example. I've structured the document in terms of a chronologically ordered workflow, though in practice there will be significant feedback between various stages (i.e. data analysis will lead you to collect more data or read more papers, and so on). If you have any comments or further suggestions, please email me at balajis_at_stanford_dot_edu and I may incorporate them into a later version.

Get Comfortable

  1. Your very first task is to get comfortable. Get a good chair, an ergonomic keyboard, and a comfortable footrest. Make sure it is more comfortable to sit than it is to get up. If you've found a place to study - a library, a conference room, a restaurant - that you find unusually conducive to productivity, it is often because the seating is comfortable enough for you to focus on work. It may sound silly, but it is worth taking a tape measure to duplicate the seating dimensions of this ``comfortable place'' at your desk or office at work. Spending time on your working environment is much like learning to touch type or speed read: it pays for itself many times over in the long run.

  2. When it is time to work, end all interruptions. Turn off email notification sounds, cell phones, PDAs, and other gizmos. Keep food and drinks nearby so that you can refuel without getting up when your brain tires from exertion. If you work in an office with several other people in the vicinity, wearing headphones is a good way to block out sounds and politely signal that you are busy. They also give you an excuse to ignore anything short of someone physically tapping you on the shoulder. Another possibility is to do your ``high concentration'' work early in the morning or late at night, when other people aren't around.

Read Literature

  1. Finding papers
    1. Nature Reviews and Annual Reviews: great for browsing and generating ideas
    2. Faculty of 1000: set as your homepage.
    3. Pubmed, Citeseer and Scisearch: use for searching primary literature. Pubmed is only for biomedical papers, while Citeseer and Scisearch include CS and Statistics.
    4. My NCBI, complex queries tutorial, and Pubmed Tutorial.
    5. Saved Searches: keep up on your competition.
    6. ISI Impact Factors and Highwire Press citation maps
    7. Firefox with Tab Mix Plus and Biobar: browse dozens of papers simultaneously and run PubMed queries directly from your browser.

  2. Reading papers

    1. Overall comments
      1. Browse first, seriously read only if promising
      2. Read critically - many results are wrong.
      3. Read rapidly - there are thousands of papers out there.
      4. Use hyperlinks and Pubmed to find all the references related to your area, following references till saturation.
      5. For casual reading: read pdfs onscreen, keeping all pdfs related to given search in a directory.
      6. Don't manually retitle files, use tools which can search pdfs (e.g. Google Desktop or Spotlight) to find local pdfs.
      7. If pdf is interesting, save to Citeulike (see ``Referencing papers'' section below).
      8. For serious reading: print out all relevant pdfs and get them spiral bound at Kinko's. This is much more efficient to read and more permanent than stapling.

    2. Specific tips
      1. Start with title, and then brutally filter on institution and journal (if feeling guilty, rationalize as Bayesian prior)
      2. Abstract should summarize the main point and be sufficiently compelling to get the full pdf.
      3. Given full pdf, ignore main text. Focus on figures, figure captions, and conclusions.
      4. If technical details seem iffy, skim materials and methods to look for flaws.
      5. If all seems promising, print article to read it in detail. Note: it is generally useless to read mathematical articles without pen and paper in hand. Very important to constantly write and rewrite the authors' equations. Good heuristic: ask yourself whether you could make an exam or homework problem out of the authors' algorithm or paper. Doing this will really make sure you understand their algorithmic details.
      6. For third pass: download & read supplementary info for guts of materials/methods.

  3. Ranking papers and authors

    1. Ranking is useful to get prior probabilities on significance, relevance, etc.
    2. Faculty of 1000: peer filtered literature, excellent first stop.
    3. Highwire Press: citation map for finding seminal papers in the field.
    4. Ranking journals: ISI Impact factors, useful for seeing journal trends. Example: Cell vs. Nature Biotechnology.
    5. Ranking individual scientists: h-factor for eminence.

  4. Referencing papers

    1. Do not try to manually update reference numbering or to use bookmarks to keep track of refs. This does not scale.
    2. Citeulike.org: outstanding web application for organizing resources. Use one click reference tool to organize your bibliography. Use pseudonym if security of reading list is an issue.
    3. Google Notebook can also be very handy for recording all kinds of links to things that are not papers. Can be used in conjunction with citeulike for non-paper references (e.g. class websites, academic ftp sites, etc.)
    4. Refworks is similar to CiteULike, and probably more useful if you work with MS Word documents rather than TeX documents. Access is free for Stanford researchers.
    5. TexMed: A bit quicker than CiteULike for putting together reference list from scratch.
    6. BibTeX: LaTex reference management. (Cross platform: Windows, Mac, Linux).
    7. EndNote: Word reference management. Windows and Mac *only*.
    8. Jabref: Java based frontend for BibTex, exports to RTF.

Obtain Data

  1. Where?

    1. Major public databases: NCBI, EBI/EMBL, DDBJ.
    2. Worth going through NCBI tutorial to learn about all features of NCBI's labrynthine site.
    3. Specialized databases: UCSC genome browser, Hapmap (SNP data), Stanford Microarray Database (microarrays), KEGG (pathways), RegulonDB (transcription factors), etc.
    4. NAR Database issue: January each year. Comprehensive searchable list of databases.
    5. Supplementary Info: get on article by article basis. One example: Butland coli paper and supp info.
    6. Personal communication and websites: very iffy. Nag and phone if you don't get a response.

  2. How?

    1. Bulk FTP download (preferred). LFTP, ncFTP, FireFTP, etc.
    2. HTML download. If FTP unavailable, may use DownThemAll or similar Firefox plugins to speed up download process.
    3. wget is very useful for quick downloading at the command line; given appropriate command flags, it can also do fairly sophisticated things such as recursive mirrors of websites.
    4. Scripted HTML download (for recalcitrant sites). Use Perl LWP module for this. Be careful as scripting a server is often not looked upon favorably. Do it at night and limit your requests per second.
    5. It is often worth your time to explore all the subdirectories of an ftp site.

  3. When?
    1. Download ALL data at beginning of project (locally mirror).
    2. Resist urge to make incremental updates to downloaded data (i.e. August update if downloaded in June). Cost is usually not worth the benefit.

Programming, Scripting, and Formatting Data

  1. General comments on formats and organization
    1. Major file formats: tab-delimited text, XML, FASTA, GenBank, many others (ASN, Phylip, ClustalW, etc.)
    2. When in doubt, use tab delimited text if at all possible for all internal results. Tab delimited text files are easily human readable, easy to parse by computer, and the default inputs to many programs.
    3. Usually a good idea to put some thought into project directory organization at the beginning of a project. Separate raw data, code, results, and so on.
    4. Don't reinvent the wheel when handling data (e.g. XML). Use an established parser such as the ones from CPAN (see also below re: Perl).

  2. Programming, Scripting, and Formatting Data
    1. Subversion: better version of CVS, a repository to store all code which will have multiple revisions (R, Perl, Python, C++, etc.). Pays for itself in terms of self documentation, data backup, etc. Here is a quick Subversion tutorial.
    2. Linux: Find a machine which you can install Linux from scratch on where you have a decent amount (at least 100 GB) of hard drive space. Use Fedora Core if in doubt about distribution - no matter what kind of software you're looking for, you can be sure a binary version is available for Fedora Core. You might also want to consider Ubuntu, which is becoming more popular. The reason you want your own machine is that root access is simply necessary to do a lot of thing that you will want to do (such as installing programs, running web servers and sql databases, and so on).
    3. Unix command line: crucial to learn OS for industrial strength data processing. Start by running the command learn on Stanford's Unix machines. Eventually go through Linux Cookbook and Unix Power Tools. Use Emacs shortcuts at command line. Set up aliases in /etc/bashrc to switch directories rapidly. Become familiar with process of Unix program installation: download tar.gz file, unzip it, unpack the tar archive, and then run configure; make; make install.
    4. GNU textutils: command line processing of text files. Crucial to learn for rapid text processing. See below for examples.
    5. Emacs: use for editing all kinds of code (bash, Perl, R, html, LaTeX, etc.). Start with the emacs tutorial. Use M-x term to run a terminal from within Emacs. Use emacs -nw rather than GUI emacs or xemacs for maximum responsiveness. Consider remapping Caps Lock to Control for maximum speed.
    6. If you aren't planning on using R or Matlab, you might also want to consider vim rather than emacs; it is a powerful text editor which a lot of C++ programmers favor. However, ESS and Matlab-mode make emacs generally better for most statisticians.
    7. Perl intro: use for more heavyweight parsing or downloading than can be accomplished at command line. Start with Beginning Perl for Bioinformatics, then Mastering Perl for Bioinformatics, and then the Perl Cookbook.
    8. Perl tips: use the debugger (perl -d yourscript.pl at command line). See Beginning Perl for Bioinformatics book for tutorial. Use the profiler (perl -d:DProf) if code is slow. Use CPAN to install new modules semiautomatically with perl -MCPAN -e shell. PDL is a bit clumsy but useful if you need to do some basic matrix math within Perl (though you should generally save this for R or Matlab as much as possible).
    9. Perl vs. Python: Perl is quicker and more versatile, Python is neater and more scalable. If you wish to use Python, see Zach Rahan's Python installation tutorial.
    10. Avoid C++/Java and compiled languages when doing string processing and table manipulation! These languages are not really meant for this purpose and will make your life much harder. Ask yourself whether your program is meant to interact with data or with humans. In the latter case, Java may be useful; in the former case it rarely is.
    11. C++ is sometimes useful when you have a specific step (e.g. computationally intense p-value calculation) which is very slow. In this case one can use XS to interface C++ with Perl. But don't try doing regex parsing or serious text processing in C++.
    12. SQL databases: May seem to be attractive at first, but they can REALLY slow you down and take quite a bit of programming overhead. Only use them if you can't fit the relevant tables into RAM; it is often worth manually writing the relevant joins and hashes, as hard-disk based databases operate orders of magnitude slower than hashes held in RAM. Use flat files of tab-delimited text if at all possible.
    13. screen is a must have Unix program which allows you to run multiple shells from within one window, in much the same way you can run multiple tabs within firefox, rather than opening up a bunch of different ssh connections. It's particularly valuable if you are using putty.exe (on Windows) or Terminal.app (on OS X) or a Linux box which isn't running a GUI. You can get an example .screenrc configuration file. There is a bit of a learning curve but it's worth spending the time to know this app.

  3. Unix: a few more details
    1. Use --help, man, and info to find out what programs do. For example, cut --help, man cut, and info cut will give successively more detailed information about the cut program.
    2. Basic text processing and filtering: grep, cut, paste, head, tail, uniq, sort, nl, cat, wc, comm, split
    3. Calculations at the command line: bc, awk
    4. Advanced stuff where you can really see the true power of UNIX: find, xargs, tee, awk, sed
    5. A few examples of tasks I've done in the last few days, to give some idea.
      1. List the directories and files in the current directory such that they can be distinguished with a trailing '/' (-F flag), identify the directories by grepping for the trailing '/' and pull out only those rows, remove the trailing slash with sed, use awk to build a command line that will tar up the directory, and finally execute the whole shebang by piping it to bash.

        ls -F | grep '\/$' | sed 's/\/$//g' | awk '{print ``tar -xf ``$0" "$0''.tar'';}'' | bash

      2. The ``proper'' way to do this is with the while construct; however, the disadvantage of this is that you can't preview the command sequence quite as easily.

        ls -F | grep '\/$' | sed 's/\/$//g' | while read filename; do tar -xf ``${filename}" "${filename}.tar''; done

      3. Find the files in the subdirectories below the current directory, grep for those that end with a trailing '.txt', take the top 15 lines of each of these files, and pipe the result to less so you can page through the results.

        find ./ | grep '\.txt$' | xargs head -15 | less

      4. Find all the rows where the fifth column does not equal -1 in the file demo.txt, cut out the resulting columns 1-5 among the resulting rows, sort the result by the 1st column with a reverse (-r) numeric (-n) sort, retain only those lines which are not adjacent to identical lines, and print out the resulting number of lines with wc -l.

        awk '{if ($5 != -1) { print;}}' demo.txt | cut -f1-5 | sort -k1 -r -n | uniq | wc -l

    6. These are just examples; for more of a flavor of what can be done see this tutorial on pipes in Unix and this list of awk commands.

Analyze Data

  1. Given the tab delimited text files which you have so carefully built from downloaded data with Unix commands and Perl/Python, you need to analyze them. You will want to use R or possibly Matlab for this purpose,

  2. R for statistics
    1. Install R with all packages (including Bioconductor). Often useful to install from source with options like configure --enable-R-profiling --enable-R-shlib --enable-linux-lfs (see documentation for meanings).
    2. To install all new packages, from within R run the commands: v <- new.packages; install.packages(v). To install all available packages, run: v <- available.packages(); packlist <- as.vector(v[,1]); install.packages(packlist). You can mix and match the relevant update commands to make sure that you have up to date versions of every package. Note that you shouldn't upgrade very frequently; new code often breaks old code in subtle ways!
    3. ESS: this is absolutely a must have for anyone who uses R. Allows you to control R from within Emacs and significantly boosts the speed of debugging and code development. See the ESS tutorial here.
    4. Useful R packages: while there are many useful packages, it is well worth your time to learn RGL and lattice graphics. Conditional plots and 3D visualization are crucial tools when visualizing high dimensional data sets.
    5. Many examples of R graphics are at the R Graphics repository to give you some ideas.
    6. Save all R code in your subversion repository, just as you do with your Perl code, Matlab code, etc.

  3. Matlab for matrix algebra

    1. In general, Matlab is more about design and synthesis than analysis (think sine waves, not scatterplots). It does not play well with missing data and is meant more for engineers than statisticians.
    2. For hardcore matrix algebra it is sometimes necessary, but statistical analysis is considerably more difficult in Matlab (stats toolbox notwithstanding).
    3. Matlab-mode: This is much like ESS and allows control of Matlab from within Emacs. A must have.
    4. Comment: An enterprising undergraduate or master's student might decide to knock off Matlab's open source matrix algebra routines and incorporate them into R. This would be a nontrivial undertaking but would significantly enhance our ability to analyze high dimensional data, which often requires both statistics and heavy linear algebra (e.g. eigenvalues of huge matrices for subsequent MDS or PCA, best done with Arnoldi-Lanczos iterative algorithm implemented in Matlab as eigs())

Write Paper

  1. Writing the Main Paper
    1. LaTeX: A way to write structured technical documents with complicated equations, figures, references, etc. Do not use for presentations or posters, though - use Keynote instead with LaTeXit (if on Mac) or Powerpoint (if on Windows).
    2. LyX: A GUI interface for LaTeX in Linux.
    3. TeXShop: Mac LaTeX distribution, highly recommended
    4. MikTex + WinEdt: Windows LaTeX distribution and editor, best of breed
    5. Sweave: use for embedding dynamic R code in LaTeX as one self-documenting file. Design decisions must be made during the generation of a Sweave document as some data sets are too big, some computations too long, and some figures too complicated to rebuild or recompute each time a document is generated. In general, though, Sweave will save you time.
    6. Emacs keybindings in Word (ctrl-f for office on that link): If you cannot use LaTeX for whatever reason, you may still want to have Emacs keybindings as they dramatically speed up writing (since you can keep your fingers in the touch typing position).
    7. On a related note, see here to enable global Emacs keybindings in OS X. Download a sample KeyBindings file.
    8. And see here for Emacs keybindings in Firefox.

  2. Generating Figures

    1. Figure generation is best done on a Mac. No other platform combines the ability to script with top-flight graphics programs like Omnigraffle, Illustrator, Preview, and Photoshop. Both Linux and Windows will frustrate you!

    2. For general Mac tips, I highly recommend OS X for Oceanographers and OS X for Physicists.

    3. Important note: try to export graphics as vector format like PDF or postscript whenever possible, rather than raster format like BMP, JPEG, PNG, etc. Vector formats allow unlimited zooming and rescaling, which is very useful for slideshow presentations.

    4. Key programs
      1. Omnigraffle: probably the single best program for technical drawing out there. OmniOutliner is also very useful.
      2. Keynote 2: superior poster and presentation generation. Make sure to get Keynote 2 in iWork 2006.
      3. LaTeXit: Allows easy incorporation of LaTeX formulas into any application via drag and drop of images. Frequently updated.
      4. Photoshop: Unbeatable set of features for raster image manipulation.
      5. Illustrator: Use for generation of vector graphics when Omnigraffle is insufficient. Make sure to check out LiveTrace if dealing with hand drawn sketches.
      6. iPhoto: use for organizing and searching for technical photos generated while coding.
      7. Adium: allows embedding of LaTeX equations in IM transcripts. Quite useful for long distance collaborations.

  3. Handling References
    1. See the section on Reading the Literature for more details.
    2. BibTeX: Probably the best way to organize technical references. Cross platform.
    3. Endnote: Windows and Mac application for reference management.

About this document ...

Best Practices for Computational Biology

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 best_practices.tex

The translation was initiated by System Administrator on 2007-03-15


next_inactive up previous
System Administrator 2007-03-15