bioref

Bioref Webpage

 

 

Welcome to the MSI’s bioref database webpage!  The Minnesota Supercomputing Institute houses several public biological reference (or “bioref”, for short) databases that are of broad use to researchers, such as NCBI BLAST and ENSEMBL. Our goal to hosting these databases is save our users and groups valuable storage space and time. 

 

These databases are maintained and regularly updated by the MSI Research Informatics (RI) group. BLAST is updated quarterly and ENSEMBL is updated annually starting at the beginning of each the calendar year. These databases can be found at this location on our system:

 

/common/bioref/
 

 

 

 

IMPORTANT NOTICE: risdb (/panfs/roc/risdb) and risdb_new (/panfs/roc/risdb_new) will continue to be accessible until the September MSI Maintenance day (9/6/2023). If you would like to keep certain genome assembly/annotation versions or any of the other biological sequence databases from risdb or risdb_new, please copy them to your group space BEFORE September 6, 2023.

 

 

BLAST databases

 

The MSI maintains the entire NCBI BLAST sequence database, which are currently 34 separate databases. Table 1 below shows the names of these BLAST databases, a brief description of their contents, their database format version, the full path to their location on our system, and the versions of NCBI BLAST software that they are compatible with. 

 


Latest BLAST databases are located within:

/common/bioref/blast/latest/

 

 

If you are loading BLAST software 2.12.0 or greater using “module load” on our system, the latest bioref BLAST databases are automatically loaded in your environment, so there is no need to specify the full path to the database - providing only the database name (in column 1 of Table 1) for the database argument will suffice. Please note that the BLAST databases in bioref are in dbV5 format, so they will not work with any versions of BLAST software older than 2.12.0+. 

 

 


BLAST databases and their locations in bioref

BLAST database name

db format

BLAST version compatibility

Full path to database

Contents*

16S_ribosomal_RNA

dbV5

2.12.0+

/common/bioref/blast/latest/16S_ribosomal_RNA

Microbial 16S RNA sequences from the RefSeq Targeted Loci project (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/).

18S_fungal_sequences

dbV5

2.12.0+

/common/bioref/blast/latest/18S_fungal_sequences

 

28S_fungal_sequences

dbV5

2.12.0+

/common/bioref/blast/latest/28S_fungal_sequences

 

Betacoronavirus

dbV5

2.12.0+

/common/bioref/blast/latest/Betacoronavirus

 

cdd_delta

dbV5

2.12.0+

/common/bioref/blast/latest/cdd_delta

Condensed conserved domain database for use with deltablast protein searches.

env_nr

dbV5

2.12.0+

/common/bioref/blast/latest/env_nr

Protein sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nr database.

env_nt

dbV5

2.12.0+

/common/bioref/blast/latest/env_nt

Nucleotide sequences from large environmental sequencing projects, e.g., Sargasso Sea, Acid Mine Drainage. Its entries are EXCLUDED from the nt database.

human_genome

dbV5

2.12.0+

/common/bioref/blast/latest/human_genome

Current refseq human genome assembly (GRCh) with various database masking

ITS_eukaryote_sequences

dbV5

2.12.0+

/common/bioref/blast/latest/ITS_eukaryote_sequences

Databases with collection of eukaryotic Internal Transcribed Spacer sequences.

ITS_RefSeq_Fungi

dbV5

2.12.0+

/common/bioref/blast/latest/ITS_RefSeq_Fungi

Databases with collection fungal Internal Transcribed Spacer sequences.

landmark

dbV5

2.12.0+

/common/bioref/blast/latest/landmark

The landmark database includes complete proteomes from a few selected representative genomes spanning a wide taxonomic range, the main database used by the SmartBLAST services.

LSU_eukaryote_rRNA

dbV5

2.12.0+

/common/bioref/blast/latest/LSU_eukaryote_rRNA

Database with large submit rRNA sequences for prokaryotes.

LSU_prokaryote_rRNA

dbV5

2.12.0+

/common/bioref/blast/latest/LSU_prokaryote_rRNA

Database with large submit rRNA sequences for eukaryotes.

mito

dbV5

2.12.0+

/common/bioref/blast/latest/mito

protein from the annotated mitochondrial genomes

mouse_genome

dbV5

2.12.0+

/common/bioref/blast/latest/mouse_genome

Current refseq mouse genome assembly (GRCm) with various database masking

nr

dbV5

2.12.0+

/common/bioref/blast/latest/nr

A collection of protein sequences with entries from GenPept, Swissprot, PDB, PRF, PIR and NCBI Reference Sequence (RefSeq) project.

nt

dbV5

2.12.0+

/common/bioref/blast/latest/nt

The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded.

pataa

dbV5

2.12.0+

/common/bioref/blast/latest/pataa

Protein sequences from patents as supplied by USPTO. These entries are EXCLUDED from the nr database.

patnt

dbV5

2.12.0+

/common/bioref/blast/latest/patnt

Nucleotide sequences from patents as supplied by USPTO to GenBank, or from EU/Japan Patent Agencies through EMBL/DDBJ. Entries are EXCLUDED from the nt database.

pdbaa

dbV5

2.12.0+

/common/bioref/blast/latest/pdbaa

Protein sequences from PDB structure records’ protein components.

pdbnt

dbV5

2.12.0+

/common/bioref/blast/latest/pdbnt

Sequences for the nucleotide components of PDB structure records.

ref_euk_rep_genomes

dbV5

2.12.0+

/common/bioref/blast/latest/ref_euk_rep_genomes

Eukaryotic representative genomes from NCBI RefSeq project

ref_prok_rep_genomes

dbV5

2.12.0+

/common/bioref/blast/latest/ref_prok_rep_genomes

Prokaryotic representative genomes from NCBI RefSeq project

refseq_protein

dbV5

2.12.0+

/common/bioref/blast/latest/refseq_protein

Protein sequences from NCBI RefSeq project.

refseq_rna

dbV5

2.12.0+

/common/bioref/blast/latest/refseq_rna

RNA sequences from NCBI RefSeq project, also included in the nt database.

refseq_select_prot

dbV5

2.12.0+

/common/bioref/blast/latest/refseq_select_prot

 

refseq_select_rna

dbV5

2.12.0+

/common/bioref/blast/latest/refseq_select_rna

 

ref_viroids_rep_genomes

dbV5

2.12.0+

/common/bioref/blast/latest/ref_viroids_rep_genomes

Viroids representative genomes from NCBI RefSeq project

ref_viruses_rep_genomes

dbV5

2.12.0+

/common/bioref/blast/latest/ref_viruses_rep_genomes

Viruses representative genomes from NCBI RefSeq project

SSU_eukaryote_rRNA

dbV5

2.12.0+

/common/bioref/blast/latest/SSU_eukaryote_rRNA

A database with sequences small from fungi and eukaryotes

swissprot

dbV5

2.12.0+

/common/bioref/blast/latest/swissprot

Protein sequences from the swiss-prot sequence database (last major update).

taxdb

dbV5

2.12.0+

/common/bioref/blast/latest/taxdb

A non-sequence database file containing taxonomic information for sequences in the preformatted databases providing common and scientific names for each entry.

tsa_nr

dbV5

2.12.0+

/common/bioref/blast/latest/tsa_nr

Protein sequences from the Trascriptome Shotgun Assembly. Its entries are EXCLUDED from the nr database.

tsa_nt

dbV5

2.12.0+

/common/bioref/blast/latest/tsa_nt

A database with earlier non-project based Transcriptome Shotgun Assembly (TSA) entries. Project-based TSA entries are NOT included. Entries are EXCLUDED from the nt database.

* from: https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_...

 

 


BLAST Update Schedule

January
April
July
October

 

The bioref BLAST databases will be updated quarterly on the months listed in the update schedule and kept for 2 years. If needed, you can find older versions of the BLAST databases within the /common/bioref/blast/ folder, named by their month and year they were downloaded (e.g. “blast_update_04_2023” refers to the April 2023 update). 

 

 

 

 



ENSEMBL databases

 

The genomes and gene annotations for all organisms from ENSEMBL, except bacteria, are stored in /common/bioref/ensembl/. In addition to storing the genomes (FASTA format) and gene annotations (GFF and GTF formats), we also provide pre-built genome indices for BWA, HISAT2, Bowtie, Bowtie2, Samtools, NCBI_toolkit, and Picard. Note: Protein sequences and transcript sequences (FASTA format) corresponding to the genes in the ENSEMBL GTF/GFFs are not provided.

 

 


Latest ENSEMBL databases are located within:

/common/bioref/ensembl/

 

 

Organisms are organized into directories based on their kingdom (plants, metazoan, fungi, etc), except “main” which contains model organisms and vertebrates. 

ENSEMBL Directory Structure

Directory Name

Contents

main

Model organisms and vertebrates

grch37

GRCh37 build of the human genome

fungi

Fungi

metazoa

Invertebrates and other animals

plants

Green plants and algae

protists

Organisms colloquially known as “protists”

 


Within each “genus_species” directory, you will find the following subdirectories:

 

Directory Name

Contents

annotation

Gene model annotations in GFF and GTF formats

blast

NCBI BLAST+ index files for the genome (not peptide, not transcripts)

bowtie

Index files for read mapping with bowtie version 1

bowtie2

Index files for read mapping with bowtie version 2

bwa

Index files for read mapping with BWA

hisat2

Index files for splice-aware read mapping with HISAT2

seq

FASTA sequence files for the genome

 

 

 


Software Versions/Commands used for genomic sequence indexing

Software

Version

Command

BWA

0.7.17

bwa index

HISAT2

2.1.0

hisat2-build

bowtie2

2.3.4.1

bowtie2 build

bowtie

1.1.2

bowtie-build

picard

2.3.0

picard CreateSequenceDictionary

ncbi_toolkit

25.2.0

makeblastdb

samtools

1.9

samtools faidx

 

 


ENSEMBL Update Schedule

ENSEMBL databases will be updated annually in January and will be kept for 4 years. If needed, you can find older versions of the ENSEMBL databases within the ENSEMBL folder labeled by the month and year (e.g. “ensembl_update_01_2023”).