Format of assembly files
========================

SHORT DESCRIPTIONS

0.  assembly.format: this file

1.  contigs.bases: fasta file for contig bases

2.  contigs.quals: fasta files for contig quality scores

3.  supercontigs: structure of supercontigs (scaffolds)

4.  reads.placed: provides the locations of reads which were placed in the
    assembly

5.  reads.unplaced: provides the names of reads which were not placed in the
    assembly, and reasons why

CONVENTIONS

(a) blank lines are ignored

(b) a "name" is a string from the alphabet "a-zA-Z0-9_.-".

LONG DESCRIPTIONS

1.  contigs.bases: fasta file for contig bases

Each contig begins with a line 
>name
where name defines the contig name.  The bases then follow (A, C, G, T).

2.  contigs.quals: fasta file for contig quality scores

The information in this file should coordinate exactly with the information
in contigs.bases, except that in place of bases, this file has scores, between
0 and 255, separated by white space.

3.  supercontigs: structure of supercontigs (scaffolds)

This file contains a description of the supercontigs (also called scaffolds).
They are ordered lists of contigs, with approximately known gaps between
them.  By assumption, all contigs in a supercontig are oriented in the
same direction (forward).

The file consists of lines, each starting with a keyword, which is supercontig,
contig, or gap:

(a) supercontig line format:
         supercontig [supercontig name]

(b) contig line format:
         contig [contig name]

(c) gap line format:
         gap [gap length] [gap length deviation] [link quality score] [link count]

Where:

    * contig name is consistent with the naming in contigs.bases and contigs.quals

    * gap length is the estimated gap length between contigs
      (negative if overlap predicted)

    * gap length deviation: estimated standard deviation for gap length value

    * link quality score: integer quality score assigned to link between contigs by
           assembly program, phred [log10] style
           e.g. 20 means that the link has a 1% chance of being wrong.

    * link count: total number of links crossing this gap

    Any of the four gap parameters can be replaced by * if unknown.

Example:

supercontig s1
contig c1
gap 200 * * 2
contig c7
gap 2235 * * 5
contig c3

supercontig s2
contig c2

supercontig s3
contig c4
gap 400 100 * *
contig c5

4.  reads.placed: provides the locations of reads which were placed in the
    assembly

This is a file with one line per read placed in the assembly.  Each line has
white-space-separated fields, as follows:

   (a) NCBI ti number for read (or *, if none known)
   (b) read name
   (c) start of trimmed read on original read
   (d) number of bases in trimmed read
   (e) orientation on contig (0 = forward, 1 = reverse)
   (f) contig name
   (g) supercontig name
   (h) approximate start of trimmed read on contig
   (i) approximate start of trimmed read on supercontig.

For c, h, and i, the first position is always 1 (not 0).  For h, the start
of a read on a contig is always the smallest position on the contig which the
read covers, regardless of its orientation.  This applies to i as well.  For i,
positions on supercontigs are measured so as to take account of gaps.

5.  reads.unplaced: provides the names of reads which were not placed in the
    assembly, and reasons why

For each read which is not placed in the assembly, reads.unplaced provides
an explanation for its exclusion.  These explanations are provided in both 
short and long forms.  Short forms are names.  For example, the short form might 
be "chimera", and the long form might be "suspected of being chimeric".  
Accordingly, reads.unplaced begins with a key which converts short forms to long 
forms.  A sample key entry would be:
     chimera: "suspected of being chimeric"
These entries may extend over multiple lines.

After the key, each remaining line has the form:
NCBI-ti-number-or-* read-name short-form-explanation