Formats for storing Hi-C pairs
.pairs
.pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.
The body of a .pairs contains a table with a variable number of fields separated by a “\t” character (a horizontal tab). The .pairs specification fixes the content and the order of the first seven columns:
index |
name |
description |
---|---|---|
1 |
read_id |
the ID of the read as defined in fastq files |
2 |
chrom1 |
the chromosome of the alignment on side 1 |
3 |
pos1 |
the 1-based genomic position of the outer-most (5’) mapped bp on side 1 |
4 |
chrom2 |
the chromosome of the alignment on side 2 |
5 |
pos2 |
the 1-based genomic position of the outer-most (5’) mapped bp on side 2 |
6 |
strand1 |
the strand of the alignment on side 1 |
7 |
strand2 |
the strand of the alignment on side 2 |
A .pairs file starts with a header, an arbitrary number of lines starting with a “#” character. By convention, the header lines have a format of “#field_name: field_value”. The .pairs specification mandates a few standard header lines (e.g., column names, chromosome order, sorting order, etc), all of which are automatically filled in by pairtools.
The entries of a .pairs file can be flipped and sorted. “Flipping” means that the sides 1 and 2 do not correspond to side1 and side2 in sequencing data. Instead, side1 is defined as the side with the alignment with a lower sorting index (using the lexographic order for chromosome names, followed by the numeric order for positions and the lexicographic order for pair types). This particular order of “flipping” is defined as “upper-triangular flipping”, or “triu-flipping”. Finally, pairs are typically block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.
Pairtools’ flavor of .pairs
.pairs files produced by pairtools extend .pairs format in a few ways.
pairtools store null, unmapped, ambiguous (multiply mapped) and chimeric (if not parsed by parse2 or –walks-policy all of parse) alignments as chrom=’!’, pos=0, strand=’-‘.
pairtools store the header of the source .sam files in the ‘#samheader:’ fields of the pairs header. When multiple .pairs files are merged, the respective ‘#samheader:’ fields are checked for consistency and merged.
Each pairtool applied to .pairs leaves a record in the ‘#samheader’ fields (using a @PG sam tag), thus preserving the full history of data processing.
pairtools append an extra column describing the type of a Hi-C pair:
index |
name |
description |
---|---|---|
8 |
pair_type |
the type of a Hi-C pair |
Pair types
pairtools use a simple two-character notation to define all possible pair types, according to the quality of alignment of the two sides. The type of a pair can be defined unambiguously using the table below. To use this table, identify which side has an alignment of a “poorer” quality (unmapped < multimapped < unique alignment) and which side has a “better” alignment and find the corresponding row in the table.
. |
. |
. |
Less informative alignment |
More informative alignment |
. |
||
Pair type |
Code |
>2 alignments |
Mapped |
Unique |
Mapped |
Unique |
Sidedness |
walk-walk |
WW |
✔ |
❌ |
❌ |
❌ |
❌ |
0 [1] |
null |
NN |
❌ |
❌ |
❌ |
0 |
||
corrupt |
XX |
❌ |
❌ |
❌ |
0 [2]_ |
||
null-multi |
NM |
❌ |
❌ |
✔ |
❌ |
0 |
|
null-rescued |
NR |
✔ |
❌ |
✔ |
✔ |
1 [3] |
|
null-unique |
NU |
❌ |
❌ |
✔ |
✔ |
1 |
|
multi-multi |
MM |
❌ |
✔ |
❌ |
✔ |
❌ |
0 |
multi-rescued |
MR |
✔ |
✔ |
❌ |
✔ |
✔ |
1 [3] |
multi-unique |
MU |
❌ |
✔ |
❌ |
✔ |
✔ |
1 |
rescued-unique |
RU |
✔ |
✔ |
✔ |
✔ |
✔ |
2 [3] |
unique-rescued |
UR |
✔ |
✔ |
✔ |
✔ |
✔ |
2 [3] |
unique-unique |
UU |
❌ |
✔ |
✔ |
✔ |
✔ |
2 |
duplicate |
DD |
❌ |
✔ |
✔ |
✔ |
✔ |
2 [4]_ |
.pairsam
pairtools also define .pairsam, a valid extension of the .pairs format. On top of the pairtools’ flavor of .pairs, .pairsam format adds two extra columns containing the alignments from which the Hi-C pair was extracted:
index |
name |
description |
---|---|---|
9 |
sam1 |
the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM |
10 |
sam2 |
the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM |
Note that, normally, the fields of a sam alignment are separated by a horizontal tab character (\t), which we already use to separate .pairs columns. To avoid confusion, we replace the tab character in sam entries stored in sam1 and sam2 columns with a UNIT SEPARATOR character (\031).
Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string ‘\031NEXT_SAM\031’
Extra columns
pairtools can operate on .pairs/.pairsam with extra columns. Extra columns are specified in the order defined by the order their addition by various tools. Column names can be checked in the header of .pairs/.pairsam file. We provide pairtools header utilities for manipulating and verifying compatibility of headers and their columns.
The list of additional columns used throughout pairtools modules:
extra column |
generating module |
format |
how to add it |
description |
---|---|---|---|---|
mapq1, mapq2 |
parse/parse2 |
number from 0 to 255 |
pairtools parse –add-columns mapq |
Mapping quality, as reported in .sam/.bam, $-10 log_{10}(P_{error})$ |
pos51, pos52 |
parse/parse2 |
genomic coordinate |
pairtools parse –add-columns pos5 |
5’ position of alignment (closer to read start) |
pos31, pos32 |
parse/parse2 |
genomic coordinate |
pairtools parse –add-columns pos3 |
3’ position of alignment (further from read start) |
cigar1, cigar2 |
parse/parse2 |
string |
pairtools parse –add-columns cigar |
CIGAR, or Compact Idiosyncratic Gapped Alignment Report of alignment, as reported in .sam/.bam |
read_len1, read_len2 |
parse/parse2 |
number |
pairtools parse –add-columns read_len |
read length |
matched_bp1, matched_bp2 |
parse/parse2 |
number |
pairtools parse –add-columns matched_bp |
number of matched alignment basepairs to the reference |
algn_ref_span1, algn_ref_span2 |
parse/parse2 |
number |
pairtools parse –add-columns algn_ref_span |
basepairs of reference covered by alignment |
algn_read_span1, algn_read_span2 |
parse/parse2 |
number |
pairtools parse –add-columns algn_read_span |
basepairs of read covered by alignment |
dist_to_51, dist_to_52 |
parse/parse2 |
number |
pairtools parse –add-columns dist_to_5 |
distance to 5’-end of read |
dist_to_31, dist_to_32 |
parse/parse2 |
number |
pairtools parse –add-columns dist_to_3 |
distance to 3’-end of read |
seq1, seq2 |
parse/parse2 |
string |
pairtools parse –add-columns seq |
sequence of alignment |
mismatches1, mismatches2 |
parse/parse2 |
string |
pairtools parse –add-columns mismatches |
comma-separated list of mismatches relative to the reference, “{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}” |
XB1/2,AS1/2,XS1/2 or any sam tag |
parse/parse2 |
pairtools parse –add-columns XA,XB,NM |
format depends on tag specification |
|
walk_pair_type |
parse/parse2 |
string |
pairtools parse2 –add-pair-index |
Type of the pair relative to R1 and R2 reads of paired-end sequencing, see pasring docs |
walk_pair_index |
parse/parse2 |
number |
pairtools parse2 –add-pair-index |
Order of the pair in the complex walk, starting from 5’-end of left read, see pasring docs |
phase |
phase |
0, 1 or “.” |
pairtools phase |
Phase of alignment (haplotype 1, 2, on unphased), see phasing walkthrough |
rfrag1, rfrag2 |
restrict |
number |
pairtools restrict |
Unique index of the restriction fragment after annotating pairs positions, see restriction walkthrough |
rfrag_start1, rfrag_start2 |
restrict |
number |
pairtools restrict |
Coordinate of the start of restriction fragment |
rfrag_end1, rfrag_end2 |
restrict |
number |
pairtools restrict |
Coordinate of the end of restriction fragment |