The program that converts FASTA files to the binary files used by BLAST is formatdb. The index file, sequence file and header file are the three files needed to extract sequences from the BLAST database. For protein databases these files end with the extensions ".pin", ".psq" and ".phr" respectively. For DNA databases the extensions are ".nin", ".nsq" and ".nhr" respectively. The index file contains information about the database, i.e. version number, database type, file offsets, etc. The sequence file contains residues for each of the sequences. Finally, the header file contains the header information for each of the sequences. This document describes the structure of the NCBI BLAST database version 4 (the current version as of this writing).
The NCBI C Toolkit warns that internal structure of the BLAST databases
can change with little or no notice. They recommend that the readdb API,
which is part of the NCBI Toolkit, should be used to extract data from
the BLAST databases.
| BlastDbFormatV4.pdf | - | This document describes the NCBI BLAST database in a bit more detail. |
| dumpncbi_1.0.tgz | - | A simple recursive descent program that dumps the contents of a BLAST database. The program prints out basic database information from the index file. Then for each sequence, the header information is printed followed by the sequence data. This program is just meant to demonstrate how to parse a NCBI Blast database. |
The integer fields are stored in big endian format, except for the residue count which is stored in little endian. The Timstamp string might be padded with NUL characters to force the remaining integer fields to be properly aligned for performance reasons. The offset tables always store one more entry than the number of sequences. This last entry points to the end of the file. This allows the size of an object to be calculated by subtracting the current offset from the next offset. No special code is needed for the last sequence.
| Name | Type | Description |
|---|---|---|
| Version | Int32 | Version Number. Note: This page describes only version 4. |
| Database type | Int32 | 0 - DNA; 1 - Protein. |
| Title length | Int32 | Length of the title string (T). |
| Title | Char[T] | Database title. Note: This string is not NUL terminated. |
| Timestamp length | Int32 | Length of the timestamp string (S). |
| Timestamp | Char[S] | Time of database creation. Note: The length of the timestamp S is increased to force 8 byte alignment of the next integer field. The timestamp is padded, if necessary, with NULs to achieve this alignment. |
| Number of sequences | Int32 | Number of sequences in the database (N). |
| Residue count | Int64 | Total number of residues in the database. Note: This field is stored in little endian. |
| Longest sequence | Int32 | Length of the longest sequence in the database. |
| Header offset table | Int32[N+1] | Offsets into the header file (*.phr, *.nhr). |
| Sequence offset table | Int32[N+1] | Offsets into the sequence file (*.psq, *.nsq). |
| Ambiguity offset table | Int32[N+1] | Offsets into the sequence file (*.nsq). Note: This table is only in DNA databases. The ambiguity table follows the 2 bit residue encoding. If the sequence does not have any ambiguity residues, then the offset points to the beginning of the next sequence. |
The protein sequence file encodes one residue per 8 bits. Each sequence is separated by a NUL byte. The 8 bit encoding is below. The beginning of the kth sequence is found by indexing into the sequence offset table. The length of the kth sequence is calculated by subtracting the (k+1)th sequence offset from the kth offset minus one (for the NUL byte). Since the database store one more offset then sequences, no special code is needed for calculating the length of the last sequence.
| Amino acid | Value | Amino acid | Value |
|---|---|---|---|
| - | 0 | N | 13 |
| A | 1 | O | 26 |
| B | 2 | P | 14 |
| V | 3 | Q | 15 |
| D | 4 | R | 16 |
| E | 5 | S | 17 |
| F | 6 | T | 18 |
| G | 7 | U | 24 |
| H | 8 | V | 19 |
| I | 9 | W | 20 |
| J | 27 | X | 21 |
| K | 10 | Y | 22 |
| L | 11 | Z | 23 |
| M | 12 | * | 25 |
The DNA sequence file first encodes the sequence in a 2-bit encoding followed by an ambiguity table to correct any residues in the 2-bit encoding. Unlike the protein sequences, each sequence is not separated by a NUL byte.
The DNA sequence file first encodes one residue per 2 bits. The 2-bit encoded values are below. The beginning of the kth sequence is found by indexing into the sequence offset table. The number of bytes used in the 2-bit encoding is calculated by subtracting the sequence offset from the ambiguity offset. The last byte of the 2-bit encoding can code for zero to three residues. The least two significant bits of the last byte contains the number of residues (0 - 3) in the last byte.
| Nucleotide | Value | Binary |
|---|---|---|
| A | 0 | 00 |
| C | 1 | 01 |
| G | 2 | 10 |
| T or U | 3 | 11 |
To correct a sequence containing any degenerate residues, an ambiguity table follows the 2-bit encoded string. The start of the ambiguity table is pointed to by the ambiguity table index. The first four bytes contains the number of 32 bit words in the correction table. Each entry contains three pieces of information, the actual encoded nucleotide, how many nucleotides to be replaced and finally the offset into the sequence to apply the correction.
For 32 bit entries, the first four most significant bits encode the actual nucleotide. These values are below. The next for bits encode the repeat count. This is the count of the number of residues in the original sequence that are to be replaced. One is added to the count giving it the range of 1 - 16. The final 24 bits is the offset into the sequence to start the replacement. The first residue starts at offset zero, the second at offset one, etc. Using a 24 bit offset, only sequences up to 16 million residues can be corrected.
For sequences greater than 16 million residues, 64 bit correction entries are used. To signal the use of 64 bit entries, the count, at the beginning of the ambiguity table, has its most significant bit set. Even though 64 bit entries are to be used, the remaining 31 bits of the count still indicates the number of 32 bit words are in the table. The first four bits encode the actual nucleotide. The repeat count and replacement offset sizes have been increased to 12 and 48 bits respectively.
| Nucleotide | Value | Binary | Nucleotide | Value | Binary |
|---|---|---|---|---|---|
| - | 0 | 0000 | T | 8 | 1000 |
| A | 1 | 0001 | W (A|T) | 9 | 1001 |
| C | 2 | 0010 | Y (C|T) | 10 | 1010 |
| M (A|C) | 3 | 0011 | H (A|C|T) | 11 | 1011 |
| G | 4 | 0100 | K (G|T) | 12 | 1100 |
| R (A|G) | 5 | 0101 | D (A|G|T) | 13 | 1101 |
| S (C|G) | 6 | 0110 | B (C|G|T) | 14 | 1110 |
| V (A|C|G) | 7 | 0111 | N (A|C|G|T) | 15 | 1111 |
The header file contains the headers for each sequence, one after another. The sequences are in a binary encoded ASN.1 format. The size of the kth header can be calculated by subtracting the offset of the (k+1)th header from the kth header offset. The five types making up the ASN.1 BLAST headers are
| INTEGER | - | a variable length integer value. The first byte of an encoded integer is a hex 02. The next byte is the number of bytes used to encode the integer value. The remaining bytes are the actual value. The value is encoded most significant byte first. |
| VisibleString | - | a variable length string. The first byte of a visible string is a hex 1A. The next byte starts encoding the length of the string. If the most significant bit is off, then the lower seven bits encode the length of the string, i.e. the string has a length less than 128. If the most significant bit is on, then the lower seven bits is the number of bytes that hold the length of the string, then the bytes encoding the string length, most significant bytes first. Following the length are the actual string characters. The strings are not NUL terminated. |
| CHOICE | - | a union of one or more alternatives. The first byte indicates which selection of the choice. The choices start with a hex value A080 for the first item, A180 for the second, etc. This is followed by the encoded data for that choice. Finally two NUL bytes mark the end of the choice. |
| SEQUENCE | - | an ordered collection of one or more types. The first two bytes are a hex 3080. The header is then followed by the encoded sequence types. The first two bytes indicates which type of the sequence is encoded. This index starts with the hex value A080 for the first item, A180 for the second, etc. then followed by the encoded item and finally two NUL bytes, 0000, to indicate the end of that type. The next type in the sequence is then encoded. If an item is optional and is not defined, then none of it is encoded including the index and NUL bytes. This is repeated until the entire sequence has been encoded. Two NUL bytes then mark the end of the sequence. |
| SEQUENCE OF | - | an ordered collection of zero or more occurrences of a given type. The first two bytes are a hex 3080. Then the lists of objects are encoded. Two NUL bytes encode the end of the list. |
Below is the ASN.1 BLAST header definition. It was copied from two files, asn.all and fastadl.asn from the BLAST C Toolkit.
Blast-def-line-set ::= SEQUENCE OF Blast-def-line -- all deflines for an entry
Blast-def-line ::= SEQUENCE {
title VisibleString OPTIONAL, -- simple title
seqid SEQUENCE OF Seq-id OPTIONAL, -- Regular NCBI Seq-Id
taxid INTEGER OPTIONAL, -- taxonomy id
memberships SEQUENCE OF INTEGER OPTIONAL, -- bit arrays
links SEQUENCE OF INTEGER OPTIONAL, -- bit arrays
other-info SEQUENCE OF INTEGER OPTIONAL } -- future use
Seq-id ::= CHOICE {
local Object-id, -- local use
gibbsq INTEGER, -- Geninfo backbone seqid
gibbmt INTEGER, -- Geninfo backbone moltype
giim Giimport-id, -- Geninfo import id
genbank Textseq-id,
embl Textseq-id,
pir Textseq-id,
swissprot Textseq-id,
patent Patent-seq-id,
other Textseq-id, -- for historical reasons, 'other' = 'refseq'
general Dbtag, -- for other databases
gi INTEGER, -- GenInfo Integrated Database
ddbj Textseq-id, -- DDBJ
prf Textseq-id, -- PRF SEQDB
pdb PDB-seq-id, -- PDB sequence
tpg Textseq-id, -- Third Party Annot/Seq Genbank
tpe Textseq-id, -- Third Party Annot/Seq EMBL
tpd Textseq-id, -- Third Party Annot/Seq DDBJ
gpipe Textseq-id, -- Internal NCBI genome pipeline processing ID
named-annot-track Textseq-id } -- Internal named annotation tracking ID
Dbtag ::= SEQUENCE {
db VisibleString, -- name of database or system
tag Object-id } -- appropriate tag
-- Object-id can tag or name anything
Object-id ::= CHOICE {
id INTEGER,
str VisibleString }
Patent-seq-id ::= SEQUENCE {
seqid INTEGER, -- number of sequence in patent
cit Id-pat } -- patent citation
Textseq-id ::= SEQUENCE {
name VisibleString OPTIONAL,
accession VisibleString OPTIONAL,
release VisibleString OPTIONAL,
version INTEGER OPTIONAL }
Giimport-id ::= SEQUENCE {
id INTEGER, -- the id to use here
db VisibleString OPTIONAL, -- dbase used in
release VisibleString OPTIONAL } -- the release
PDB-seq-id ::= SEQUENCE {
mol PDB-mol-id, -- the molecule name
chain INTEGER DEFAULT 32, -- a single ASCII character, chain id
rel Date OPTIONAL } -- release date, month and year
PDB-mol-id ::= VisibleString -- name of mol, 4 chars
Id-pat ::= SEQUENCE { -- just to identify a patent
country VisibleString, -- Patent Document Country
id CHOICE {
number VisibleString, -- Patent Document Number
app-number VisibleString }, -- Patent Doc Appl Number
doc-type VisibleString OPTIONAL } -- Patent Doc Type
Date ::= CHOICE {
str VisibleString, -- for those unparsed dates
std Date-std } -- use this if you can
Date-std ::= SEQUENCE { -- NOTE: this is NOT a unix tm struct
year INTEGER, -- full year (including 1900)
month INTEGER OPTIONAL, -- month (1-12)
day INTEGER OPTIONAL, -- day of month (1-31)
season VisibleString OPTIONAL, -- for "spring", "may-june", etc
hour INTEGER OPTIONAL, -- hour of day (0-23)
minute INTEGER OPTIONAL, -- minute of hour (0-59)
second INTEGER OPTIONAL } -- second of minute (0-59)