Best Practices for Computational Biology
Dr. Balaji S. Srinivasan
Departments of Statistics and Computer Science, Stanford University, Stanford CA 94305
NOTE: THIS PAGE IS OLD. I don't have the time to update it. Some of the stuff is still useful, but a few major notes:
1. I've been using Python/Ubuntu for a long time now -- Perl is faster for quick stuff, but definitely harmful for large projects.
2. In python, the numpy/scipy/matplotlib stack is extremely powerful; when combined with sage, rpy2, cython, and django it really can't be beat for scientific computing and quickly building webapps.
3. I'm a huge fan of Conkeror, which is the offspring of emacs + firefox. Very worthwile to learn.
4. I recommend getting very good with elisp -- Robert Chassell's book (get the 2nd edition) is outstanding. Do 'chassell elisp filetype:pdf' on google to get it.
5. Git is MUCH better than svn. Having symbolic links in the repo is key. Interactive rebase even more key. And automated git bisect run = the key-iest.
6. Trac is excellent for working in large groups, though Redmine is even more featureful (though it's Ruby rather than Python based)
7. Use Fireworks CS4 and the symbol library for real mockups, use Omnigraffle Pro with the stencils for fast mockups/figs, and use Photoshop sparingly.
You should Google anything that I didn't link explicitly (viz., 'what is git bisect run'). Good luck :)
BEGIN OLD STUFF
This document covers a number of tips, tricks, and heuristics that
I've compiled while doing research in computational biology. Think of
it as the sort of thing that is usually not formally taught, but which
would have been great to know when beginning graduate school.
The content is quite broadly applicable to any kind of data-analytic
field, but comp bio will be my running example. I've structured the
document in terms of a chronologically ordered workflow, though in
practice there will be significant feedback between various stages
(i.e. data analysis will lead you to collect more data or read more
papers, and so on). If you have any comments or further suggestions,
please email me at balajis_at_stanford_dot_edu and I may
incorporate them into a later version.
- Your very first task is to get comfortable. Get a good
chair, an ergonomic keyboard, and a comfortable footrest. Make sure
it is more comfortable to sit than it is to get up. If you've found
a place to study - a library, a conference room, a restaurant -
that you find unusually conducive to productivity, it is often
because the seating is comfortable enough for you to focus on work.
It may sound silly, but it is worth taking a tape measure to
duplicate the seating dimensions of this ``comfortable place'' at
your desk or office at work. Spending time on your working
environment is much like learning to touch type or speed read: it
pays for itself many times over in the long run.
- When it is time to work, end all interruptions. Turn
off email notification sounds, cell phones, PDAs, and other gizmos.
Keep food and drinks nearby so that you can refuel without getting
up when your brain tires from exertion. If you work in an office
with several other people in the vicinity, wearing headphones is a
good way to block out sounds and politely signal that you are busy.
They also give you an excuse to ignore anything short of someone
physically tapping you on the shoulder. Another possibility is to
do your ``high concentration'' work early in the morning or late at
night, when other people aren't around.
- Finding papers
- Nature Reviews
and Annual Reviews: great for browsing and generating ideas
- Faculty of 1000: set as your homepage.
- Pubmed, Citeseer
and Scisearch: use for searching primary literature. Pubmed is only for biomedical papers, while Citeseer and Scisearch include CS and Statistics.
- My NCBI, complex queries tutorial, and Pubmed Tutorial.
- Saved Searches: keep up on your competition.
- ISI Impact Factors
and Highwire Press
citation maps
- Firefox
with Tab Mix Plus
and Biobar:
browse dozens of papers simultaneously and run PubMed queries directly from your browser.
- Reading papers
- Overall comments
- Browse first, seriously read only if promising
- Read critically - many results are wrong.
- Read rapidly - there are thousands of papers out there.
- Use hyperlinks and Pubmed to find all the references related
to your area, following references till saturation.
- For casual reading: read pdfs onscreen, keeping all pdfs
related to given search in a directory.
- Don't manually retitle files, use tools which can search
pdfs (e.g. Google Desktop
or Spotlight) to find local pdfs.
- If pdf is interesting, save to Citeulike
(see ``Referencing papers'' section below).
- For serious reading: print out all relevant pdfs and get
them spiral bound at Kinko's. This is much more efficient to
read and more permanent than stapling.
- Specific tips
- Start with title, and then brutally filter on institution and journal (if feeling guilty, rationalize as Bayesian prior)
- Abstract should summarize the main point and be sufficiently compelling to get the full pdf.
- Given full pdf, ignore main text. Focus on figures, figure captions, and conclusions.
- If technical details seem iffy, skim materials and methods to look for flaws.
- If all seems promising, print article to read it in
detail. Note: it is generally useless to read mathematical
articles without pen and paper in hand. Very important to
constantly write and rewrite the authors' equations. Good
heuristic: ask yourself whether you could make an exam or
homework problem out of the authors' algorithm or paper. Doing
this will really make sure you understand their algorithmic
details.
- For third pass: download & read supplementary info for guts
of materials/methods.
- Ranking papers and authors
- Ranking is useful to get prior probabilities on significance, relevance, etc.
- Faculty of 1000: peer filtered literature, excellent first stop.
- Highwire Press: citation map for finding seminal papers in the field.
- Ranking journals: ISI Impact factors, useful for seeing journal trends. Example: Cell
vs. Nature Biotechnology.
- Ranking individual scientists: h-factor
for eminence.
- Referencing papers
- Do not try to manually update reference numbering or to use
bookmarks to keep track of refs. This does not scale.
- Citeulike.org: outstanding web application for organizing
resources. Use one click reference tool to organize your
bibliography. Use pseudonym if security of reading list is an
issue.
- Google Notebook
can also be very handy for recording all kinds of links to things
that are not papers. Can be used in conjunction with citeulike for non-paper references
(e.g. class websites, academic ftp sites, etc.)
- Refworks
is similar to CiteULike, and probably more useful if you work with MS Word documents rather than TeX documents. Access is free for Stanford researchers.
- TexMed: A bit quicker than CiteULike for putting together reference list from scratch.
- BibTeX: LaTex reference management. (Cross platform: Windows, Mac, Linux).
- EndNote: Word reference management. Windows and Mac *only*.
- Jabref: Java based frontend for BibTex, exports to RTF.
- Where?
- Major public databases: NCBI, EBI/EMBL, DDBJ.
- Worth going through NCBI tutorial
to learn about all features of NCBI's labrynthine site.
- Specialized databases: UCSC genome browser, Hapmap
(SNP data), Stanford Microarray Database
(microarrays), KEGG
(pathways), RegulonDB
(transcription factors), etc.
- NAR Database issue: January each year. Comprehensive searchable list of databases.
- Supplementary Info: get on article by article basis. One example: Butland coli paper
and supp info.
- Personal communication and websites: very iffy. Nag and phone if you don't get a response.
- How?
- Bulk FTP download (preferred). LFTP, ncFTP, FireFTP, etc.
- HTML download. If FTP unavailable, may use DownThemAll
or similar Firefox plugins to speed up download process.
wget
is very useful for quick downloading at the command line; given appropriate command flags, it can also do fairly sophisticated things such as recursive mirrors of websites.
- Scripted HTML download (for recalcitrant sites). Use Perl LWP
module for this. Be careful
as scripting a server is often not looked upon favorably. Do it at night and limit your requests per second.
- It is often worth your time to explore all the subdirectories of an ftp site.
- When?
- Download ALL data at beginning of project (locally mirror).
- Resist urge to make incremental updates to downloaded data
(i.e. August update if downloaded in June). Cost is usually not worth the benefit.
- General comments on formats and organization
- Major file formats: tab-delimited text, XML, FASTA, GenBank, many others (ASN, Phylip, ClustalW, etc.)
- When in doubt, use tab delimited text if at all possible for
all internal results. Tab delimited text files are easily human
readable, easy to parse by computer, and the default inputs to many
programs.
- Usually a good idea to put some thought into project directory organization
at the beginning of a project. Separate raw data, code, results, and so on.
- Don't reinvent the wheel when handling data (e.g. XML). Use an established parser
such as the ones from CPAN
(see also below re: Perl).
- Programming, Scripting, and Formatting Data
- Subversion: better version of CVS, a repository to store all code which will have
multiple revisions (R, Perl, Python, C++, etc.). Pays for itself in terms
of self documentation, data backup, etc. Here is a quick Subversion tutorial.
- Linux: Find a machine which you can install Linux from scratch on where you have
a decent amount (at least 100 GB) of hard drive space. Use Fedora Core
if in doubt
about distribution - no matter what kind of software you're looking for, you can be sure a binary version is available for Fedora Core. You might also want to consider Ubuntu, which is becoming more popular. The reason you want your own machine is that root access is simply necessary to do a lot of thing that you will want to do (such as installing programs, running web servers and sql databases, and so on).
- Unix command line: crucial to learn OS for industrial
strength data processing. Start by running the command
learn
on Stanford's Unix
machines. Eventually go
through Linux
Cookbook
and
Unix Power
Tools.
Use Emacs
shortcuts
at command line. Set up aliases
in /etc/bashrc to
switch directories rapidly. Become familiar with process of
Unix program installation: download tar.gz file,
unzip it, unpack the tar archive, and then run
configure; make; make install.
- GNU textutils: command line processing of text files. Crucial to learn for
rapid text processing. See below for examples.
- Emacs: use for editing all kinds of code (bash, Perl, R, html,
LaTeX, etc.). Start with the emacs tutorial. Use
M-x term to
run a terminal from within Emacs. Use emacs -nw rather than
GUI emacs or xemacs for maximum responsiveness.
Consider remapping Caps Lock to Control
for maximum speed.
- If you aren't planning on using R or Matlab, you might also want to consider vim
rather than emacs; it is a powerful text editor which a lot of C++ programmers favor. However, ESS and Matlab-mode make emacs generally better for most statisticians.
- Perl intro: use for more heavyweight parsing or downloading
than can be accomplished at command line. Start with Beginning Perl for Bioinformatics, then Mastering Perl for Bioinformatics, and then the Perl Cookbook.
- Perl tips: use the debugger
(
perl -d yourscript.pl at
command line). See Beginning Perl for Bioinformatics book for tutorial. Use the
profiler
(perl -d:DProf) if code is slow. Use CPAN
to install new
modules semiautomatically with perl -MCPAN -e shell. PDL
is a bit clumsy but useful if you need to do some basic matrix
math within Perl (though you should generally save this for R or
Matlab as much as possible).
- Perl vs. Python: Perl is quicker and more versatile, Python
is neater and more scalable. If you wish to use Python, see Zach Rahan's
Python installation tutorial.
- Avoid C++/Java and compiled languages when doing string
processing and table manipulation! These languages are not really
meant for this purpose and will make your life much harder. Ask yourself
whether your program is meant to interact with data or with humans. In the latter
case, Java may be useful; in the former case it rarely is.
- C++ is sometimes useful when you have a specific step (e.g.
computationally intense p-value calculation) which is very slow.
In this case one can use XS
to interface C++ with Perl. But don't try
doing regex parsing or serious text processing in C++.
- SQL databases: May seem to be attractive at first, but they
can REALLY slow you down and take quite a bit of programming
overhead. Only use them if you can't fit the relevant tables
into RAM; it is often worth manually writing the relevant joins
and hashes, as hard-disk based databases operate orders of
magnitude slower than hashes held in RAM. Use flat files of tab-delimited text if at
all possible.
screen
is a must have Unix program which allows you to run multiple
shells from within one window, in much the same way you can run
multiple tabs within firefox, rather than opening up a bunch of
different ssh connections. It's particularly valuable if you are
using
putty.exe
(on Windows) or
Terminal.app
(on OS X) or a Linux box which isn't running a GUI. You can get
an example .screenrc configuration file. There is a bit
of a learning curve but it's worth spending the time to know
this app.
- Unix: a few more details
- Use
--help, man, and info to find out what programs do.
For example, cut --help, man cut, and info cut will give successively
more detailed information about the cut program.
- Basic text processing and filtering:
grep, cut, paste, head, tail, uniq, sort, nl, cat, wc, comm, split
- Calculations at the command line:
bc, awk
- Advanced stuff where you can really see the true power of UNIX:
find, xargs, tee, awk, sed
- A few examples of tasks I've done in the last few days, to give some idea.
- List the directories and files in the current directory such that
they can be distinguished with a trailing '/' (-F flag), identify the
directories by grepping for the trailing '/' and pull out only those
rows, remove the trailing slash with sed, use awk to build a command
line that will tar up the directory, and finally execute the whole
shebang by piping it to bash.
ls -F | grep '\/$' | sed 's/\/$//g' | awk '{print ``tar -xf ``$0" "$0''.tar'';}'' | bash
- The ``proper'' way to do this is with the
while construct; however, the disadvantage of this is that you can't preview the command sequence quite as easily.
ls -F | grep '\/$' | sed 's/\/$//g' | while read filename; do tar -xf ``${filename}" "${filename}.tar''; done
- Find the files in the subdirectories below the current
directory, grep for those that end with a trailing '.txt', take the
top 15 lines of each of these files, and pipe the result to less so
you can page through the results.
find ./ | grep '\.txt$' | xargs head -15 | less
- Find all the rows where the fifth column does not equal -1 in
the file demo.txt, cut out the resulting columns 1-5 among the
resulting rows, sort the result by the 1st column with a reverse
(-r) numeric (-n) sort, retain only those lines which are not
adjacent to identical lines, and print out the resulting number of
lines with wc -l.
awk '{if ($5 != -1) { print;}}' demo.txt | cut -f1-5 | sort -k1 -r -n | uniq | wc -l
- These are just examples; for more of a flavor of what can be done see this tutorial on pipes in Unix
and this list of
awk commands.
- Given the tab delimited text files which you have so carefully built from downloaded
data with Unix commands and Perl/Python, you need to analyze them. You will
want to use R or possibly Matlab for this purpose,
- R for statistics
- Install R with all packages (including Bioconductor). Often useful to install
from source with options like
configure --enable-R-profiling --enable-R-shlib --enable-linux-lfs
(see documentation for meanings).
- To install all new packages, from within R run the commands:
v <- new.packages; install.packages(v).
To install all available packages, run: v <- available.packages(); packlist <- as.vector(v[,1]); install.packages(packlist).
You can mix and match the relevant update commands to make sure that you have up to date versions
of every package. Note that you shouldn't upgrade very frequently; new code often breaks old code in subtle ways!
- ESS: this is absolutely a must have for anyone who uses R.
Allows you to control R from within Emacs and
significantly boosts the speed of debugging and code development.
See the ESS tutorial here.
- Useful R packages: while there are many useful packages, it is
well worth your time to learn RGL
and lattice graphics. Conditional
plots and 3D visualization are crucial tools when visualizing high
dimensional data sets.
- Many examples of R graphics are at the R Graphics repository
to give you some ideas.
- Save all R code in your subversion repository, just as you do with your Perl code,
Matlab code, etc.
- Matlab for matrix algebra
- In general, Matlab is more about design and synthesis than analysis (think sine waves, not scatterplots).
It does not play well with missing data and is meant more for engineers than statisticians.
- For hardcore matrix algebra it is sometimes necessary, but statistical analysis
is considerably more difficult in Matlab (
stats toolbox notwithstanding).
- Matlab-mode: This is much like ESS and allows control of Matlab from within Emacs.
A must have.
- Comment: An enterprising undergraduate or master's student might decide to knock off Matlab's
open source matrix algebra routines and incorporate them into R. This would be a nontrivial
undertaking but would significantly enhance our ability to analyze high dimensional data, which
often requires both statistics and heavy linear algebra (e.g. eigenvalues of huge matrices for
subsequent MDS or PCA, best done with Arnoldi-Lanczos iterative algorithm implemented in Matlab
as
eigs())
- Writing the Main Paper
- LaTeX: A way to write structured technical documents with complicated equations, figures, references, etc. Do not
use for presentations or posters, though - use Keynote instead with LaTeXit (if on Mac) or Powerpoint (if on Windows).
- LyX: A GUI interface for LaTeX in Linux.
- TeXShop: Mac LaTeX distribution, highly recommended
- MikTex
+ WinEdt: Windows LaTeX distribution and editor, best of breed
- Sweave: use for embedding dynamic R code in LaTeX as one self-documenting file. Design decisions must be made during
the generation of a Sweave document as some data sets are too big, some computations too long, and some figures too complicated
to rebuild or recompute each time a document is generated. In general, though, Sweave will save you time.
- Emacs keybindings in Word
(ctrl-f for office on that link):
If you cannot use LaTeX for whatever reason, you may still want to have Emacs keybindings as they
dramatically speed up writing (since you can keep your fingers in the touch typing position).
- On a related note, see here to enable global Emacs keybindings in OS X. Download a sample KeyBindings file.
- And see here for Emacs keybindings in Firefox.
- Generating Figures
- Figure generation is best done on a Mac. No other platform
combines the ability to script with top-flight graphics programs
like Omnigraffle, Illustrator, Preview, and Photoshop. Both Linux and
Windows will frustrate you!
- For general Mac tips, I highly recommend OS X for Oceanographers
and OS X for Physicists.
- Important note: try to export graphics as vector format like PDF
or postscript whenever possible, rather than raster format like BMP,
JPEG, PNG, etc. Vector formats allow unlimited zooming and
rescaling, which is very useful for slideshow presentations.
- Key programs
- Omnigraffle: probably the single best program for technical drawing out there. OmniOutliner
is also very useful.
- Keynote 2: superior poster and presentation generation. Make sure to get Keynote 2 in
iWork 2006.
- LaTeXit: Allows easy incorporation of LaTeX formulas into any application via drag
and drop of images. Frequently updated.
- Photoshop: Unbeatable set of features for raster image manipulation.
- Illustrator: Use for generation of vector graphics when Omnigraffle is insufficient.
Make sure to check out LiveTrace
if dealing with hand drawn sketches.
- iPhoto: use for organizing and searching for technical photos generated while coding.
- Adium: allows embedding of LaTeX equations in IM transcripts. Quite useful for long
distance collaborations.
- Handling References
- See the section on Reading the Literature for more details.
- BibTeX: Probably the best way to organize technical references. Cross platform.
- Endnote: Windows and Mac application for reference management.
Best Practices for Computational Biology
This document was generated using the
LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 best_practices.tex
The translation was initiated by System Administrator on 2007-03-15
System Administrator
2007-03-15