Bioref Webpage
Welcome to the MSI’s bioref database webpage! The Minnesota Supercomputing Institute houses several public biological reference (or “bioref”, for short) databases that are of broad use to researchers, such as NCBI BLAST and ENSEMBL. Our goal to hosting these databases is save our users and groups valuable storage space and time.
These databases are maintained and regularly updated by the MSI Research Informatics (RI) group. BLAST is updated quarterly and ENSEMBL is updated annually starting at the beginning of each the calendar year. These databases can be found at this location on our system:
/common/bioref/
IMPORTANT NOTICE: risdb (/panfs/roc/risdb) and risdb_new (/panfs/roc/risdb_new) will continue to be accessible until the September MSI Maintenance day (9/6/2023). If you would like to keep certain genome assembly/annotation versions or any of the other biological sequence databases from risdb or risdb_new, please copy them to your group space BEFORE September 6, 2023. |
BLAST databases
The MSI maintains the entire NCBI BLAST sequence database, which are currently 34 separate databases. Table 1 below shows the names of these BLAST databases, a brief description of their contents, their database format version, the full path to their location on our system, and the versions of NCBI BLAST software that they are compatible with.
Latest BLAST databases are located within:
/common/bioref/blast/latest/
If you are loading BLAST software 2.12.0 or greater using “module load” on our system, the latest bioref BLAST databases are automatically loaded in your environment, so there is no need to specify the full path to the database - providing only the database name (in column 1 of Table 1) for the database argument will suffice. Please note that the BLAST databases in bioref are in dbV5 format, so they will not work with any versions of BLAST software older than 2.12.0+.
BLAST databases and their locations in bioref
BLAST database name |
db format |
BLAST version compatibility |
Full path to database |
Contents* |
16S_ribosomal_RNA |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/16S_ribosomal_RNA |
Microbial 16S RNA sequences from the RefSeq Targeted Loci project (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/). |
18S_fungal_sequences |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/18S_fungal_sequences |
|
28S_fungal_sequences |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/28S_fungal_sequences |
|
Betacoronavirus |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/Betacoronavirus |
|
cdd_delta |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/cdd_delta |
Condensed conserved domain database for use with deltablast protein searches. |
env_nr |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/env_nr |
Protein sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nr database. |
env_nt |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/env_nt |
Nucleotide sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nt database. |
human_genome |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/human_genome |
Current refseq human genome assembly (GRCh) with various database masking |
ITS_eukaryote_sequences |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ITS_eukaryote_sequences |
Databases with collection of eukaryotic Internal Transcribed Spacer sequences. |
ITS_RefSeq_Fungi |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ITS_RefSeq_Fungi |
Databases with collection fungal Internal Transcribed Spacer sequences. |
landmark |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/landmark |
The landmark database includes complete proteomes from a few selected representative genomes spanning a wide taxonomic range, the main database used by the SmartBLAST services. |
LSU_eukaryote_rRNA |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/LSU_eukaryote_rRNA |
Database with large submit rRNA sequences for prokaryotes. |
LSU_prokaryote_rRNA |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/LSU_prokaryote_rRNA |
Database with large submit rRNA sequences for eukaryotes. |
mito |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/mito |
protein from the annotated mitochondrial genomes |
mouse_genome |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/mouse_genome |
Current refseq mouse genome assembly (GRCm) with various database masking |
nr |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/nr |
A collection of protein sequences with entries from GenPept, Swissprot, PDB, PRF, PIR and NCBI Reference Sequence (RefSeq) project. |
nt |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/nt |
The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded. |
pataa |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/pataa |
Protein sequences from patents as supplied by USPTO. These entries are EXCLUDED from the nr database. |
patnt |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/patnt |
Nucleotide sequences from patents as supplied by USPTO to GenBank, or from EU/Japan Patent Agencies through EMBL/DDBJ. Entries are EXCLUDED from the nt database. |
pdbaa |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/pdbaa |
Protein sequences from PDB structure records’ protein components. |
pdbnt |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/pdbnt |
Sequences for the nucleotide components of PDB structure records. |
ref_euk_rep_genomes |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ref_euk_rep_genomes |
Eukaryotic representative genomes from NCBI RefSeq project |
ref_prok_rep_genomes |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ref_prok_rep_genomes |
Prokaryotic representative genomes from NCBI RefSeq project |
refseq_protein |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/refseq_protein |
Protein sequences from NCBI RefSeq project. |
refseq_rna |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/refseq_rna |
RNA sequences from NCBI RefSeq project, also included in the nt database. |
refseq_select_prot |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/refseq_select_prot |
|
refseq_select_rna |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/refseq_select_rna |
|
ref_viroids_rep_genomes |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ref_viroids_rep_genomes |
Viroids representative genomes from NCBI RefSeq project |
ref_viruses_rep_genomes |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/ref_viruses_rep_genomes |
Viruses representative genomes from NCBI RefSeq project |
SSU_eukaryote_rRNA |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/SSU_eukaryote_rRNA |
A database with sequences small from fungi and eukaryotes |
swissprot |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/swissprot |
Protein sequences from the swiss-prot sequence database (last major update). |
taxdb |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/taxdb |
A non-sequence database file containing taxonomic information for sequences in the preformatted databases providing common and scientific names for each entry. |
tsa_nr |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/tsa_nr |
Protein sequences from the Trascriptome Shotgun Assembly. Its entries are EXCLUDED from the nr database. |
tsa_nt |
dbV5 |
2.12.0+ |
/common/bioref/blast/latest/tsa_nt |
A database with earlier non-project based Transcriptome Shotgun Assembly (TSA) entries. Project-based TSA entries are NOT included. Entries are EXCLUDED from the nt database. |
* from: https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_... |
BLAST Update Schedule
January |
April |
July |
October |
The bioref BLAST databases will be updated quarterly on the months listed in the update schedule and kept for 2 years. If needed, you can find older versions of the BLAST databases within the /common/bioref/blast/ folder, named by their month and year they were downloaded (e.g. “blast_update_04_2023” refers to the April 2023 update).
ENSEMBL databases
The genomes and gene annotations for all organisms from ENSEMBL, except bacteria, are stored in /common/bioref/ensembl/. In addition to storing the genomes (FASTA format) and gene annotations (GFF and GTF formats), we also provide pre-built genome indices for BWA, HISAT2, Bowtie, Bowtie2, Samtools, NCBI_toolkit, and Picard. Note: Protein sequences and transcript sequences (FASTA format) corresponding to the genes in the ENSEMBL GTF/GFFs are not provided.
Latest ENSEMBL databases are located within:
/common/bioref/ensembl/
Organisms are organized into directories based on their kingdom (plants, metazoan, fungi, etc), except “main” which contains model organisms and vertebrates.
ENSEMBL Directory Structure
Directory Name |
Contents |
main |
Model organisms and vertebrates |
grch37 |
GRCh37 build of the human genome |
fungi |
Fungi |
metazoa |
Invertebrates and other animals |
plants |
Green plants and algae |
protists |
Organisms colloquially known as “protists” |
Within each “genus_species” directory, you will find the following subdirectories:
Directory Name |
Contents |
annotation |
Gene model annotations in GFF and GTF formats |
blast |
NCBI BLAST+ index files for the genome (not peptide, not transcripts) |
bowtie |
Index files for read mapping with bowtie version 1 |
bowtie2 |
Index files for read mapping with bowtie version 2 |
bwa |
Index files for read mapping with BWA |
hisat2 |
Index files for splice-aware read mapping with HISAT2 |
seq |
FASTA sequence files for the genome |
Software Versions/Commands used for genomic sequence indexing
Software |
Version |
Command |
BWA |
0.7.17 |
bwa index |
HISAT2 |
2.1.0 |
hisat2-build |
bowtie2 |
2.3.4.1 |
bowtie2 build |
bowtie |
1.1.2 |
bowtie-build |
picard |
2.3.0 |
picard CreateSequenceDictionary |
ncbi_toolkit |
25.2.0 |
makeblastdb |
samtools |
1.9 |
samtools faidx |
ENSEMBL Update Schedule
ENSEMBL databases will be updated annually in January and will be kept for 4 years. If needed, you can find older versions of the ENSEMBL databases within the ENSEMBL folder labeled by the month and year (e.g. “ensembl_update_01_2023”).