BioHPC Lab: User Guide

BioHPC Lab:
User Guide


Commonly Used Genome Databases on BioHPC Lab Computers

The BioHPC lab computers keep copies of some of the commonly used reference genomes. Unless otherwise noted, the reference genomes are from the databases maintained by Illumina (including BWA, Bowtie1&2 index, as well as annotation files from UCSC, NCBI and Ensembl). The complete set of local reference genomes are in the directory /shared_data/genome_db. As the /shared_data directory is mounted on the network file server, make sure that you copy the files to /workdir before you use them.

To copy a directory from /shared_data to /workdir, use the “cp -r” command. For example, to use the human NCBI genome databases to run Tophat, you will need to copy Bowtie2Index, WholeGenomeFasta and the GTF files to workdir.

cp -r /shared_data/genome_db/Homo_sapiens/NCBI/build37.2/Sequence/Bowtie2Index/  /workdir/myUserName/

cp -r /shared_data/genome_db/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/ /workdir/myUserName/

cp /shared_data/genome_db/Homo_sapiens/NCBI/build37.2/Annotation/Genes/*gtf  /workdir/myUserName/

For your convenience, we keep a set of most commonly used genomes on local drives of each BioHPC lab computers. They are in the /local_data directory. You can use the files in /local_data directly, there is no need to copy them to /workdir. (For genomes with multiple annotation sources, we keep the UCSC annotation in /local_data directory, the Ensembl and NCBI annotations are in the /shared_data/genome_db)

Because of the size limit on /local_data, we will only keep lastest version of the genome that were indexed with latest version of software. When you do the analysis, make sure that record the software version and database version being used. When the /local_data changed, if you want to reproduce the same results, you can copy the older version from /shared_data directory. 

There is also a local mysql server for some common databases (, including the blast2go annotation database. To use blast2go, please read the instruction here. There is a readme files in each directory to tell you the update time for all the databases.

Available on the local drives (/local_data)

  • Arabidopsis_thaliana_tair10**

  • Caenorhabditis_elegans_ce10

  • rosophila_melanogaster_dm3

  • Homo_sapiens_UCSC_hg19

  • Mus_musculus_UCSC_mm10

  • Saccharomyces_cerevisiae_sacCer3  

  • Zea_mays_agpv3**  

Available on the network file server (/shared_data/genome_db)

  • NCBI BLAST database (nt, nr and others - see important note about copying them below ****)

  • interproscan***

  • Arabidopsis_thaliana**

  • Caenorhabditis_elegans

  • Drosophila_melanogaster

  • Homo_sapiens

  • Mus_musculus

  • Saccharomyces_cerevisiae

  • Zea_mays**

  • apple

  • grape

  • Taeniopygia_guttata (zebrafinch)

** The databases maintained by Illumina do not always use gene names commonly accepted by the community. In our system, the Arabidopsis reference genome is from TAIR. The maize reference genome agpv2 is from, the maize reference genome agpv3 is from Plant Ensembl.

*** Interproscan needs to be unpacked before using. Go to your directory under /workdir and then execute "tar -xzf /shared_data/genome_db/interproscan-5.2-45.0-64-bit.tar.gz". Your copy of interproscan will be in subdierctory interproscan in the directory you executed the command from.

**** Small datatbases pdbaa, pdbnt and swissprot are distributed as masks to the nt and nr databases. Therefore if you need swissprot or pdbaa you also need to copy nr, if you need pdbnt you also need to copy nt. To copy any single database you need to execute command 'cp /shared_data/genome_db/BLAST_NCBI/NNN.* /workdir/myid/mydbdir' where NNN is the database name (e.g. nt, nr etc).



