• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Buried in cloud files? We can help with Spring cleaning!

    Whether you use Dropbox, Drive, G-Suite, OneDrive, Gmail, Slack, Notion, or all of the above, Dokkio will organize your files for you. Try Dokkio (from the makers of PBworks) for free today.

  • Dokkio (from the makers of PBworks) was #2 on Product Hunt! Check out what people are saying by clicking here.



Page history last edited by cariaso 12 years ago



This has largely been superseded by http://www.runblast.com




mpiBLAST in Amazon EC2

Author: Mike Cariaso


A project to run mpiblast in Amazon.com's EC2.




Users of the tool BLAST.




BLAST is one of the primary tools of bioinformatics, and can be one of the most computationally demanding. Few of its users have the luxury of large supercomputing facilities in house, or the experience and budget to create their own. As a result they must share the free ncbi service. NCBI strictly enforces rules to to avoid a single user monopolizing the system.


Amazon.com recently released EC2. This is a cluster of machines, that can be rented by the hour, at very reasonable rates ($0.10 per computer per hour). This architecture matches well with the design of mpiBLAST.


The two together offer BLAST users an attractive alternative.







Currently works well, but needs to be launched from a unix-like environment. cygwin is fine. Specifically we need command line ssh and scp.








Creating the cluster


Complete the Getting started Guide at http://developer.amazonwebservices.com/connect/entry.jspa?externalID=533&categoryID=87



Follow the steps of http://www.datawrangling.com/mpi-cluster-with-python-and-amazon-ec2-part-2-of-3.html

but when it asks you to edit EC2config.py use IMAGE_ID = ami-2e8c6947





IMAGE_ID = "ami-2e8c6947"

KEYNAME = "gsg-keypair"

KEY_LOCATION = "~/id_rsa-gsg-keypair"





This command requests machines from EC2. You will be billed $0.10 per machine, per hour. Note that the name "ec2-start_cluster.py" may soon be changed to replace the underscore with a hyphen.






Run this a few times to watch as the machines come online.




This command configures mpi for this collection of machines. It makes a list of all the machines, and sets up passwordless ssh within the cluster. Be aware that this weakens security. This step currently relies on scp and ssh being available as local commands. As a result this step needs to be run from a unix-like environment.





Once all the nodes are up, decide which one you'd like to use as the head node. Since we currently use the same image on master and workers, it doesn't matter which one you choose.






Your mpiblast cluster is running


Proving that will require more effort.


Log into the head node as root.




Blastable databases will live in the directory /mnt/dbs/mpiblastable/. This is controlled with the ~/.ncbirc file. Since standard ncbi blast can also read mpiblast format databases, you do not need to keep two sets.


-bash-3.1# cat ~/.ncbirc [NCBI] Data =/mnt/dbs/mpiblastable [mpiBLAST] Shared=/mnt/dbs/mpiblastable Local=/mnt/dbs/local



For now its much easier to generate random data. There is a program called randomSeq.py in the current directory. It is a little script to make a random dna sequences. We're going to use it to generate a database. Run each of these to get a feeling for how the tool works. It takes 3 arguments - 1. how-many-sequences 2. min-length 3. max-length. Try these examples.



./randomSeq.py 1 10 10

./randomSeq.py 2 20 20

./randomSeq.py 5 30 40




mpiblast excels when the database is too large to fit into the memory of a single computer. For this walkthru we will work with small databases to keep things fast.



./randomSeq.py 10000 1000 1000 | mpiformatdb -N 20 --skip-reorder -t random1 -n random1 -o T -p F -i stdin


That is generating an mpiblastable database named random1. Note that only the head node is doing this work. Generating this database takes approximately 50 seconds.



The random1 database contains 10,000 dna sequences. Each dna sequence is 1000 nucleotides long. That would be a 10 megabyte file as fasta formated text. Before it ever hits the disk, we start converting it into the database format mpiblast reads. For this tiny database thats unnecessary. But for larger datasets this is very useful.




#It seems you will also need to run this command. This is a bug in mpiblast. I will try to fix it shortly

chmod 666 /mnt/dbs/mpiblastable/random1.mbf





At the moment, I've chosen to partition the database in into 20 fragments. This doesn't mean that we need to run this on 20 nodes. Just that 20 will be the maximum. I normally like 60 fragments, since it can evenly distribute as I scale up. There is a performance perk to setting the number of db fragments equal to the number of nodes. Eventually, if you consistently use one size, set it here.



You now have an mpiblastable database.


The last thing you're going to need is some sequence to search with. To start good habits, login as lamuser to run your blast search.



# make a single random sequence

./randomSeq.py 1 1000 2000 > /home/lamuser/query


# login as a typical user

su - lamuser


# boot the cluster

mpdboot -n 5 -f mpd.hosts


# run mpiblast

mpirun -np 5 mpiblast -p blastn -i query -d random1 --copy-via=mpi -o blast.out



The results are in blast.out. After running many blasts, you will eventually want to shutdown the cluster. Do this from your local machine. You are no longer being billed.






Larger Databases


mpiBLAST requires at least 3 nodes to run. Since each node in EC2 has 2G of ram, we would like to keep size of the portion of the database for each node, below 2G. Here are some observations and estimates for the time to build different DBs. If you actually run these, please provide your observed times.



Creating the 10M database random10m takes 50s.

./randomSeq.py 10000 1000 1000

mpiformatdb -N 20 --skip-reorder -t random10m -n random10m -o T -p F -i stdin

Creating the 100M database random100m takes 8m7s.

./randomSeq.py 100000 1000 1000 mpiformatdb -N 20 --skip-reorder -t random100m -n random100m -o T -p F -i stdin

real 8m7.842s

user 3m39.010s

sys 0m0.450s

Creating the 1G database random1g takes 85m40s.

./randomSeq.py 1000000 1000 1000

mpiformatdb -N 20 --skip-reorder -t random1g -n random1g -o T -p F -i stdin

real 85m39.516s

user 37m30.890s

sys 0m4.580s


Creating the 10G database random10g takes 14h.

./randomSeq.py 10000000 1000 1000 mpiformatdb -N 20 --skip-reorder -t random10g -n random10g -o T -p F -i stdin

real 813m17.380s

user 363m6.090s

sys 0m34.510s

Creating the 100G database random100g is estimated to take 5d.

./randomSeq.py 10000000 1000 1000 | mpiformatdb -N 20 --skip-reorder -t random100g -n random100g -o T -p F -i stdin




Real databases


As of April 2007, the ncbi nt database is around 16G, and all of Genbank is closer to 75G. From 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months. In comparison, random10g seems a bit small, and random100g is quite large.


If you would like to experiment with a small but real dataset, you may wish to upload your own fasta file.


Alternately you might like to use an existing blastable database. We will use the est_mouse collection from ncbi.



mkdir -p /mnt/mpiblast/dbs/blastable

cd /mnt/mpiblast/dbs/blastable

wget ftp://ftp.ncbi.nih.gov/blast/db/est_mouse.tar.gz

tar zxvf est_mouse.tar.gz

export TMPDIR=/mnt/tmpdir

export MPIBLAST_SHARED=/mnt/dbs/mpiblastable

export MPIBLAST_LOCAL=/mnt/dbs/local


mkdir -p $TMPDIR




# Do it in one step to avoid touching the disk

/mnt/mpiblast/installs/ncbi/bin/fastacmd -D 1 -d /mnt/mpiblast/dbs/blastable/est_mouse | mpiformatdb -N 20 --skip-reorder -t est_mouse -n est_mouse -o T -p F -i stdin

real 36m39.028s

user 13m26.330s

sys 1m17.810s



This last step uses the fastacmd program from standard ncbi blast to dump its contents into fasta. We then feed that into mpiformatdb. The compressed download was 650Mb. I'm not sure of its size in fasta.






The times listed here are just to build the mpiblastable database. Distributing it to the workers, will be a significant cost as well. This price only needs to be paid occasionally. Each mpiblast run will take considerable additional time. While I am interested in collecting these numbers, doing so will begin to incur significant EC2 costs. So I will leave that analysis to the future.



If you'd like to compile your own version, you may find these AmazonEC2BuildNotes of use.



Future Directions


performance profiling

s3 storage of mpiblastable databases

web interface for job submission

using the GI filter to minimize the number of mpiblastable databases


Comments (0)

You don't have permission to comment on this page.