Formats for storing Hi-C pairs

.pairs

.pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.

The body of a .pairs contains a table with a variable number of fields separated by a “\t” character (a horizontal tab). The .pairs specification fixes the content and the order of the first seven columns:

index

name

description

1

read_id

the ID of the read as defined in fastq files

2

chrom1

the chromosome of the alignment on side 1

3

pos1

the 1-based genomic position of the outer-most (5’) mapped bp on side 1

4

chrom2

the chromosome of the alignment on side 2

5

pos2

the 1-based genomic position of the outer-most (5’) mapped bp on side 2

6

strand1

the strand of the alignment on side 1

7

strand2

the strand of the alignment on side 2

A .pairs file starts with a header, an arbitrary number of lines starting with a “#” character. By convention, the header lines have a format of “#field_name: field_value”. The .pairs specification mandates a few standard header lines (e.g., column names, chromosome order, sorting order, etc), all of which are automatically filled in by pairtools.

The entries of a .pairs file can be flipped and sorted. “Flipping” means that the sides 1 and 2 do not correspond to side1 and side2 in sequencing data. Instead, side1 is defined as the side with the alignment with a lower sorting index (using the lexographic order for chromosome names, followed by the numeric order for positions and the lexicographic order for pair types). This particular order of “flipping” is defined as “upper-triangular flipping”, or “triu-flipping”. Finally, pairs are typically block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.

Pairtools’ flavor of .pairs

.pairs files produced by pairtools extend .pairs format in a few ways.

  1. pairtools store null, unmapped, ambiguous (multiply mapped) and chimeric (if not parsed by parse2 or –walks-policy all of parse) alignments as chrom=’!’, pos=0, strand=’-‘.

  2. pairtools store the header of the source .sam files in the ‘#samheader:’ fields of the pairs header. When multiple .pairs files are merged, the respective ‘#samheader:’ fields are checked for consistency and merged.

  3. Each pairtool applied to .pairs leaves a record in the ‘#samheader’ fields (using a @PG sam tag), thus preserving the full history of data processing.

  4. pairtools append an extra column describing the type of a Hi-C pair:

index

name

description

8

pair_type

the type of a Hi-C pair

Pair types

pairtools use a simple two-character notation to define all possible pair types, according to the quality of alignment of the two sides. The type of a pair can be defined unambiguously using the table below. To use this table, identify which side has an alignment of a “poorer” quality (unmapped < multimapped < unique alignment) and which side has a “better” alignment and find the corresponding row in the table.

.

.

.

Less informative alignment

More informative alignment

.

Pair type

Code

>2 alignments

Mapped

Unique

Mapped

Unique

Sidedness

walk-walk

WW

0 [1]

null

NN

0

corrupt

XX

0 [2]_

null-multi

NM

0

null-rescued

NR

1 [3]

null-unique

NU

1

multi-multi

MM

0

multi-rescued

MR

1 [3]

multi-unique

MU

1

rescued-unique

RU

2 [3]

unique-rescued

UR

2 [3]

unique-unique

UU

2

duplicate

DD

2 [4]_

.pairsam

pairtools also define .pairsam, a valid extension of the .pairs format. On top of the pairtools’ flavor of .pairs, .pairsam format adds two extra columns containing the alignments from which the Hi-C pair was extracted:

index

name

description

9

sam1

the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM

10

sam2

the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM

Note that, normally, the fields of a sam alignment are separated by a horizontal tab character (\t), which we already use to separate .pairs columns. To avoid confusion, we replace the tab character in sam entries stored in sam1 and sam2 columns with a UNIT SEPARATOR character (\031).

Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string ‘\031NEXT_SAM\031’

Extra columns

pairtools can operate on .pairs/.pairsam with extra columns. Extra columns are specified in the order defined by the order their addition by various tools. Column names can be checked in the header of .pairs/.pairsam file. We provide pairtools header utilities for manipulating and verifying compatibility of headers and their columns.

The list of additional columns used throughout pairtools modules:

extra column

generating module

format

how to add it

description

mapq1, mapq2

parse/parse2

number from 0 to 255

pairtools parse –add-columns mapq

Mapping quality, as reported in .sam/.bam, $-10 log_{10}(P_{error})$

pos51, pos52

parse/parse2

genomic coordinate

pairtools parse –add-columns pos5

5’ position of alignment (closer to read start)

pos31, pos32

parse/parse2

genomic coordinate

pairtools parse –add-columns pos3

3’ position of alignment (further from read start)

cigar1, cigar2

parse/parse2

string

pairtools parse –add-columns cigar

CIGAR, or Compact Idiosyncratic Gapped Alignment Report of alignment, as reported in .sam/.bam

read_len1, read_len2

parse/parse2

number

pairtools parse –add-columns read_len

read length

matched_bp1, matched_bp2

parse/parse2

number

pairtools parse –add-columns matched_bp

number of matched alignment basepairs to the reference

algn_ref_span1, algn_ref_span2

parse/parse2

number

pairtools parse –add-columns algn_ref_span

basepairs of reference covered by alignment

algn_read_span1, algn_read_span2

parse/parse2

number

pairtools parse –add-columns algn_read_span

basepairs of read covered by alignment

dist_to_51, dist_to_52

parse/parse2

number

pairtools parse –add-columns dist_to_5

distance to 5’-end of read

dist_to_31, dist_to_32

parse/parse2

number

pairtools parse –add-columns dist_to_3

distance to 3’-end of read

seq1, seq2

parse/parse2

string

pairtools parse –add-columns seq

sequence of alignment

mismatches1, mismatches2

parse/parse2

string

pairtools parse –add-columns mismatches

comma-separated list of mismatches relative to the reference, “{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}”

XB1/2,AS1/2,XS1/2 or any sam tag

parse/parse2

pairtools parse –add-columns XA,XB,NM

format depends on tag specification

walk_pair_type

parse/parse2

string

pairtools parse2 –add-pair-index

Type of the pair relative to R1 and R2 reads of paired-end sequencing, see pasring docs

walk_pair_index

parse/parse2

number

pairtools parse2 –add-pair-index

Order of the pair in the complex walk, starting from 5’-end of left read, see pasring docs

phase

phase

0, 1 or “.”

pairtools phase

Phase of alignment (haplotype 1, 2, on unphased), see phasing walkthrough

rfrag1, rfrag2

restrict

number

pairtools restrict

Unique index of the restriction fragment after annotating pairs positions, see restriction walkthrough

rfrag_start1, rfrag_start2

restrict

number

pairtools restrict

Coordinate of the start of restriction fragment

rfrag_end1, rfrag_end2

restrict

number

pairtools restrict

Coordinate of the end of restriction fragment