The program that converts FASTA files to the binary files used by BLAST is formatdb. The index file, sequence file and header file are the three files needed to extract sequences from the BLAST database. For protein databases these files end with the extensions ".pin", ".psq" and ".phr" respectively. For DNA databases the extensions are ".nin", ".nsq" and ".nhr" respectively. The index file contains information about the database, i.e. version number, database type, file offsets, etc. The sequence file contains residues for each of the sequences. Finally, the header file contains the header information for each of the sequences. This document describes the structure of the NCBI BLAST database version 4 (the current version as of this writing).

The NCBI C Toolkit warns that internal structure of the BLAST databases can change with little or no notice. They recommend that the readdb API, which is part of the NCBI Toolkit, should be used to extract data from the BLAST databases.

BlastDbFormatV4.pdf - This document describes the NCBI BLAST database in a bit more detail.
dumpncbi_1.0.tgz - A simple recursive descent program that dumps the contents of a BLAST database. The program prints out basic database information from the index file. Then for each sequence, the header information is printed followed by the sequence data. This program is just meant to demonstrate how to parse a NCBI Blast database.

Index File Layout (*.pin, *.nin)

The integer fields are stored in big endian format, except for the residue count which is stored in little endian. The Timstamp string might be padded with NUL characters to force the remaining integer fields to be properly aligned for performance reasons. The offset tables always store one more entry than the number of sequences. This last entry points to the end of the file. This allows the size of an object to be calculated by subtracting the current offset from the next offset. No special code is needed for the last sequence.

Name Type Description
Version Int32 Version Number. Note: This page describes only version 4.
Database type Int32 0 - DNA; 1 - Protein.
Title length Int32 Length of the title string (T).
Title Char[T] Database title. Note: This string is not NUL terminated.
Timestamp length Int32 Length of the timestamp string (S).
Timestamp Char[S] Time of database creation. Note: The length of the timestamp S is increased to force 8 byte alignment of the next integer field. The timestamp is padded, if necessary, with NULs to achieve this alignment.
Number of sequences Int32 Number of sequences in the database (N).
Residue count Int64 Total number of residues in the database. Note: This field is stored in little endian.
Longest sequence Int32 Length of the longest sequence in the database.
Header offset table Int32[N+1] Offsets into the header file (*.phr, *.nhr).
Sequence offset table Int32[N+1] Offsets into the sequence file (*.psq, *.nsq).
Ambiguity offset table Int32[N+1] Offsets into the sequence file (*.nsq). Note: This table is only in DNA databases. The ambiguity table follows the 2 bit residue encoding. If the sequence does not have any ambiguity residues, then the offset points to the beginning of the next sequence.

Protein Sequence File (*.psq)

The protein sequence file encodes one residue per 8 bits. Each sequence is separated by a NUL byte. The 8 bit encoding is below. The beginning of the kth sequence is found by indexing into the sequence offset table. The length of the kth sequence is calculated by subtracting the (k+1)th sequence offset from the kth offset minus one (for the NUL byte). Since the database store one more offset then sequences, no special code is needed for calculating the length of the last sequence.

Amino acid Value Amino acid Value
- 0N13
A 1O26
B 2P14
V 3Q15
D 4R16
E 5S17
F 6T18
G 7U24
H 8V19
I 9W20

DNA Sequence File (*.nsq)

The DNA sequence file first encodes the sequence in a 2-bit encoding followed by an ambiguity table to correct any residues in the 2-bit encoding. Unlike the protein sequences, each sequence is not separated by a NUL byte.

DNA 2-Bit Encoding

The DNA sequence file first encodes one residue per 2 bits. The 2-bit encoded values are below. The beginning of the kth sequence is found by indexing into the sequence offset table. The number of bytes used in the 2-bit encoding is calculated by subtracting the sequence offset from the ambiguity offset. The last byte of the 2-bit encoding can code for zero to three residues. The least two significant bits of the last byte contains the number of residues (0 - 3) in the last byte.

Nucleotide Value Binary
A 000
C 101
G 210
T or U 311

DNA Ambiguity Table

To correct a sequence containing any degenerate residues, an ambiguity table follows the 2-bit encoded string. The start of the ambiguity table is pointed to by the ambiguity table index. The first four bytes contains the number of 32 bit words in the correction table. Each entry contains three pieces of information, the actual encoded nucleotide, how many nucleotides to be replaced and finally the offset into the sequence to apply the correction.

For 32 bit entries, the first four most significant bits encode the actual nucleotide. These values are below. The next for bits encode the repeat count. This is the count of the number of residues in the original sequence that are to be replaced. One is added to the count giving it the range of 1 - 16. The final 24 bits is the offset into the sequence to start the replacement. The first residue starts at offset zero, the second at offset one, etc. Using a 24 bit offset, only sequences up to 16 million residues can be corrected.

For sequences greater than 16 million residues, 64 bit correction entries are used. To signal the use of 64 bit entries, the count, at the beginning of the ambiguity table, has its most significant bit set. Even though 64 bit entries are to be used, the remaining 31 bits of the count still indicates the number of 32 bit words are in the table. The first four bits encode the actual nucleotide. The repeat count and replacement offset sizes have been increased to 12 and 48 bits respectively.

Nucleotide Value Binary Nucleotide Value Binary
- 00000 T 81000
A 10001 W (A|T) 91001
C 20010 Y (C|T)101010
M (A|C) 30011 H (A|C|T)111011
G 40100 K (G|T)121100
R (A|G) 50101 D (A|G|T)131101
S (C|G) 60110 B (C|G|T)141110
V (A|C|G) 70111 N (A|C|G|T)151111

Header File (*.phr, *.nhr)

The header file contains the headers for each sequence, one after another. The sequences are in a binary encoded ASN.1 format. The size of the kth header can be calculated by subtracting the offset of the (k+1)th header from the kth header offset. The five types making up the ASN.1 BLAST headers are

INTEGER - a variable length integer value.
The first byte of an encoded integer is a hex 02. The next byte is the number of bytes used to encode the integer value. The remaining bytes are the actual value. The value is encoded most significant byte first.
VisibleString- a variable length string.
The first byte of a visible string is a hex 1A. The next byte starts encoding the length of the string. If the most significant bit is off, then the lower seven bits encode the length of the string, i.e. the string has a length less than 128. If the most significant bit is on, then the lower seven bits is the number of bytes that hold the length of the string, then the bytes encoding the string length, most significant bytes first. Following the length are the actual string characters. The strings are not NUL terminated.
CHOICE - a union of one or more alternatives.
The first byte indicates which selection of the choice. The choices start with a hex value A080 for the first item, A180 for the second, etc. This is followed by the encoded data for that choice. Finally two NUL bytes mark the end of the choice.
SEQUENCE - an ordered collection of one or more types.
The first two bytes are a hex 3080. The header is then followed by the encoded sequence types. The first two bytes indicates which type of the sequence is encoded. This index starts with the hex value A080 for the first item, A180 for the second, etc. then followed by the encoded item and finally two NUL bytes, 0000, to indicate the end of that type. The next type in the sequence is then encoded. If an item is optional and is not defined, then none of it is encoded including the index and NUL bytes. This is repeated until the entire sequence has been encoded. Two NUL bytes then mark the end of the sequence.
SEQUENCE OF - an ordered collection of zero or more occurrences of a given type.
The first two bytes are a hex 3080. Then the lists of objects are encoded. Two NUL bytes encode the end of the list.

ASN.1 BLAST Header Definition

Below is the ASN.1 BLAST header definition. It was copied from two files, asn.all and fastadl.asn from the BLAST C Toolkit.

Blast-def-line-set ::= SEQUENCE OF Blast-def-line  -- all deflines for an entry

Blast-def-line ::= SEQUENCE {
    title VisibleString OPTIONAL,             -- simple title
    seqid SEQUENCE OF Seq-id OPTIONAL,        -- Regular NCBI Seq-Id
    taxid INTEGER OPTIONAL,                   -- taxonomy id
    memberships SEQUENCE OF INTEGER OPTIONAL, -- bit arrays
    links SEQUENCE OF INTEGER OPTIONAL,       -- bit arrays
    other-info SEQUENCE OF INTEGER OPTIONAL } -- future use

Seq-id ::= CHOICE {
    local Object-id,               -- local use
    gibbsq INTEGER,                -- Geninfo backbone seqid
    gibbmt INTEGER,                -- Geninfo backbone moltype
    giim Giimport-id,              -- Geninfo import id
    genbank Textseq-id,
    embl Textseq-id,
    pir Textseq-id,
    swissprot Textseq-id,
    patent Patent-seq-id,
    other Textseq-id,              -- for historical reasons, 'other' = 'refseq'
    general Dbtag,                 -- for other databases
    gi INTEGER,                    -- GenInfo Integrated Database
    ddbj Textseq-id,               -- DDBJ
    prf Textseq-id,                -- PRF SEQDB
    pdb PDB-seq-id,                -- PDB sequence
    tpg Textseq-id,                -- Third Party Annot/Seq Genbank
    tpe Textseq-id,                -- Third Party Annot/Seq EMBL
    tpd Textseq-id,                -- Third Party Annot/Seq DDBJ
    gpipe Textseq-id,              -- Internal NCBI genome pipeline processing ID
    named-annot-track Textseq-id } -- Internal named annotation tracking ID

Dbtag ::= SEQUENCE {
    db VisibleString,      -- name of database or system
    tag Object-id }        -- appropriate tag

-- Object-id can tag or name anything
Object-id ::= CHOICE {
    id INTEGER,
    str VisibleString }

Patent-seq-id ::= SEQUENCE {
    seqid INTEGER,         -- number of sequence in patent
    cit Id-pat }           -- patent citation

Textseq-id ::= SEQUENCE {
    name VisibleString OPTIONAL,
    accession VisibleString OPTIONAL,
    release VisibleString OPTIONAL,
    version INTEGER OPTIONAL }

Giimport-id ::= SEQUENCE {
    id INTEGER,                      -- the id to use here
    db VisibleString OPTIONAL,       -- dbase used in
    release VisibleString OPTIONAL } -- the release

PDB-seq-id ::= SEQUENCE {
    mol PDB-mol-id,           -- the molecule name
    chain INTEGER DEFAULT 32, -- a single ASCII character, chain id
    rel Date OPTIONAL }       -- release date, month and year

PDB-mol-id ::= VisibleString   -- name of mol, 4 chars
Id-pat ::= SEQUENCE {                  -- just to identify a patent
    country VisibleString,             -- Patent Document Country
    id CHOICE {
        number VisibleString,          -- Patent Document Number
        app-number VisibleString },    -- Patent Doc Appl Number
    doc-type VisibleString OPTIONAL }  -- Patent Doc Type

Date ::= CHOICE {
    str VisibleString,        -- for those unparsed dates
    std Date-std }            -- use this if you can

Date-std ::= SEQUENCE {             -- NOTE: this is NOT a unix tm struct
    year INTEGER,                   -- full year (including 1900)
    month INTEGER OPTIONAL,         -- month (1-12)
    day INTEGER OPTIONAL,           -- day of month (1-31)
    season VisibleString OPTIONAL,  -- for "spring", "may-june", etc
    hour INTEGER OPTIONAL,          -- hour of day (0-23)
    minute INTEGER OPTIONAL,        -- minute of hour (0-59)
    second INTEGER OPTIONAL }       -- second of minute (0-59)