Overview¶
pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment. pairtools perform various operations on Hi-C pairs and occupy the middle position in a typical Hi-C data processing pipeline:

In a typical Hi-C pipeline, DNA sequences (reads) are aligned to the reference genome, converted into ligation junctions and binned, thus producing a Hi-C contact map.
pairtools aim to be an all-in-one tool for processing Hi-C pairs, and can perform following operations:
- detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
- sort .pairs files for downstream analyses
- detect, tag and remove PCR/optical duplicates
- generate extensive statistics of Hi-C datasets
- select Hi-C pairs given flexibly defined criteria
- restore .sam alignments from Hi-C pairs
pairtools produce .pairs files compliant with the 4DN standard.
pairtools uses a two-character notation to define pair types (see table _section-pair-types)
The full list of available pairtools:
Pairtool | Description |
---|---|
dedup | Find and remove PCR/optical duplicates. |
filterbycov | Remove pairs from regions of high coverage. |
flip | Flip pairs to get an upper-triangular matrix. |
markasdup | Tag pairs as duplicates. |
merge | Merge sorted .pairs/.pairsam files. |
parse | Find ligation junctions in .sam, make .pairs. |
phase | Phase pairs mapped to a diploid genome. |
restrict | Assign restriction fragments to pairs. |
select | Select pairs according to some condition. |
sort | Sort a .pairs/.pairsam file. |
split | Split a .pairsam file into .pairs and .sam. |
stats | Calculate pairs statistics. |
Contents:
Quickstart¶
Install pairtools and all of its dependencies using the conda package manager and the bioconda channel for bioinformatics software.
$ conda install -c conda-forge -c bioconda pairtools
Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome:
$ mkdir /tmp/test-pairtools
$ cd /tmp/test-pairtools
$ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam
Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:
$ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes
With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:
$ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam
Inspect the resulting table:
$ less MATalpha_R1.pairs.gz
Installation¶
Requirements¶
- Python 3.x
- Python packages numpy and click
- Command-line utilities sort (the Unix version), bgzip (shipped with samtools) and samtools. If available, pairtools can compress outputs with bgzip, pbgzip and lz4.
Install using conda¶
We highly recommend using the conda package manager to install pre-compiled pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.
With conda, you can install pre-compiled pairtools and all of its dependencies from the bioconda channel:
$ conda install -c conda-forge -c bioconda pairtools
Install using pip¶
Alternatively, compile and install pairtools and its Python dependencies from PyPI using pip:
$ pip install pairtools
Install the development version¶
Finally, you can install the latest development version of pairtools from github. First, make a local clone of the github repository:
$ git clone https://github.com/open2c/pairtools
Then, you can compile and install pairtools in the development mode, which installs the package without moving it to a system folder and thus allows immediate live-testing any changes in the python code. Please, make sure that you have cython installed!
$ cd pairtools
$ pip install -e ./
Parsing sequence alignments into Hi-C pairs¶
Overview¶
Hi-C experiments aim to measure the frequencies of contacts between all pairs of loci in the genome. In these experiments, the spacial structure of chromosomes is first fixed with formaldehyde crosslinks, after which DNA is partially digested with restriction enzymes and then re-ligated back. Then, DNA is shredded into smaller pieces, released from the nucleus, sequenced and aligned to the reference genome. The resulting sequence alignments reveal if DNA molecules were formed through ligations between DNA from different locations in the genome. These ligation events imply that ligated loci were close to each other when the ligation enzyme was active, i.e. they formed “a contact”.
pairtools parse
detects ligation events in the aligned sequences of
DNA molecules formed in Hi-C experiments and reports them in the .pairs/.pairsam
format.
Terminology¶
Throughout this document we will be using the same visual language to describe how DNA sequences (in the .fastq format) are transformed into sequence alignments (.sam/.bam) and into ligation events (.pairs).
Short-read sequencing determines the sequences of the both ends (or, sides) of DNA molecules (typically 50-300 bp), producing read pairs in .fastq format (shown in the first row on the figure above). In such reads, base pairs are reported from the tips inwards, which is also defined as the 5’->3’ direction (in accordance of the 5’->3’ direction of the DNA strand that sequence of the corresponding side of the read).
Alignment software maps both reads of a pair to the reference genome, producing alignments, i.e. segments of the reference genome with matching sequences. Typically, if the read length is not very large (< 150 bp), there will be only two alignments per read pair, one on each side. But, sometimes, the parts of one or both sides may map to different locations on the genome, producing more than two alignments per DNA molecule (see Multiple ligations (walks)).
pairtools parse
converts alignments into ligation events (aka
Hi-C pairs aka pairs). In the simplest case, when each side has only one
unique alignment (i.e. the whole side maps to a single unique segment of the
genome), for each side, we report the chromosome, the genomic position of the
outer-most (5’) aligned base pair and the strand of the reference genome that
the read aligns to. pairtools parse
assigns to such pairs the type UU
(unique-unique).
Unmapped/multimapped reads¶
Sometimes, one side or both sides of a read pair may not align to the reference genome:
In this case, pairtools parse
fills in the chromosome of the corresponding
side of Hi-C pair with !
, the position with 0
and the strand with -
.
Such pairs are reported as type NU
(null-unique, when the other side has
a unique alignment) or NN
(null-null, when both sides lack any alignment).
Similarly, when one or both sides map to many genome locations equally well (i.e.
have non-unique, or, multi-mapping alignments), pairtools parse
reports
the corresponding sides as (chromosome= !
, position= 0
, strand= -
) and
type MU
(multi-unique) or MM
(multi-multi) or NM
(null-multi),
depending on the type of the alignment on the other side.
pairtools parse
calls an alignment to be multi-mapping when its
MAPQ score
(which depends on the scoring gap between the two best candidate alignments for a segment)
is equal or greater than the value specied with the --min-mapq
flag (by default, 1).
Multiple ligations (walks)¶
If the read is long enough (e.g. larger than 150 bp), it may contain more than two alignments:
Molecules like these typically form via multiple ligation events and we call them
walks [1]. The mode of walks reporting is controlled by --walks-policy
parameter of pairtools parse
.
You can report all the alignments in the reads by using pairtools parse2
(see parse2).
A pair of sequential alignments on a single read is ligation junction. Ligation junctions are the Hi-C contacts
that have been directly observed in the experiment. However, traditional Hi-C pairs do not have direct evidence of ligation
because they arise from read pairs that do not necessarily contain ligation junction.
To filter out the molecules with complex walks, --walks-policy
can be set to:
mask
to tag these molecules as typeWW
(single ligations are rescued, see Rescuing single ligations),5any
to report the 5’-most alignment on each side,5unique
to report the 5’-most unique alignment on each side,3any
to report the 3’-most alignment on each side,3unique
to report the 3’-most unique alignment on each side,all
to report all sequential alignments (complex ligations are rescued, see Rescuing complex walks).
Parse modes for walks:
Rescuing single ligations¶
Importantly, some of DNA molecules containing only one ligation junction may still end up with three alignments:
A molecule formed via a single ligation gets three alignments when one of the two ligated DNA pieces is shorter than the read length, such that that read on the corresponding side sequences through the ligation junction and into the other piece [2]. The amount of such molecules depends on the type of the restriction enzyme, the typical size of DNA molecules in the Hi-C library and the read length, and sometimes can be considerable.
pairtools parse
detects such molecules and rescues them (i.e.
changes their type from a walk to a single-ligation molecule). It tests
walks with three aligments using three criteria:

The three criteria used to “rescue” three-alignment walks: cis, point towards each other, short distance
- On the side with two alignments (the chimeric side), the “inner” (or, 3’) alignment must be on the same chromosome as the alignment on the non-chimeric side.
- The “inner” alignment on the chimeric side and the alignment on the non-chimeric side must point toward each other.
- These two alignments must be within the distance specified with the
--max-molecule-size
flag (by default, 2000bp).
Sometimes, the “inner” alignment on the chimeric side can be non-unique or “null”
(i.e. when the unmapped segment is longer than --max-inter-align-gap
,
as described in Interpreting gaps between alignments). pairtools parse
ignores such alignments
altogether and thus rescues such walks as well.
Interpreting gaps between alignments¶
Reads that are only partially aligned to the genome can be interpreted in two different ways. One possibility is to assume that this molecule was formed via at least two ligations (i.e. it’s a walk) but the non-aligned part (a gap) was missing from the reference genome for one reason or another. Another possibility is to simply ignore this gap (for example, because it could be an insertion or a technical artifact), thus assuming that our molecule was formed via a single ligation and has to be reported:

A gap between alignments can interpeted as a legitimate segment without an alignment or simply ignored
Both options have their merits, depending on a dataset, quality of the reference
genome and sequencing. pairtools parse
ignores shorter gaps and keeps
longer ones as “null” alignments. The maximal size of ignored gaps is set by
the --max-inter-align-gap
flag (by default, 20bp).
Rescuing complex walks¶
We call the multi-fragment DNA molecule that is formed during Hi-C (or any other chromosome capture with sequencing) a walk. If the reads are long enough, the right (reverse) read might read through the left (forward) read. Thus, left read might span multiple ligation junctions of the right read. The pairs of contacts that overlap between left and right reads are intermolecular duplicates that should be removed.
If the walk has no more than two different fragments at one side of the read, this can be rescued with simple
pairtools parse --walks-policy mask
. However, in complex walks (two fragments on both reads or more than two fragments on any side)
you need specialized functionality that will report all the deduplicated pairs in the complex walks.
This is especially relevant if you have the reads length > 100 bp, since more than 20% or all restriction fragments in the genome are then shorter than the read length.
We put together some statistics about number of short restriction fragments for DpnII enzyme:
Genome | #rfrags <50 bp | <100 bp | <150 bp | <175 bp | <200 bp |
hg38 | 828538 (11.5%) | 1452918 (20.2%) | 2121479 (29.5%) | 2587250 (35.9%) | 2992757 (41.6%) |
mm10 | 863614 (12.9%) | 1554461 (23.3%) | 2236609 (33.5%) | 2526150 (37.9%) | 2780769 (41.7%) |
dm3 | 65327 (19.6%) | 108370 (32.5%) | 142662 (42.8%) | 156886 (47.1%) | 169339 (50.9%) |
Consider the read with overlapping left and right sides:
pairtools
can detect such molecules and parse them.
Briefly, we detects all the unique ligation junctions, and do not report the same junction as a pair multiple times.
To parse complex walks, you may use pairtools parse --walks-policy all
and parse2
, which have slightly different functionalities.
pairtools parse --walks-policy all
is used with regular paired-end Hi-C, when you want
all pairs in the walk to be reported as if they appeared in the sequencing data independently.
parse2
is used with single-end data or when you want to customize your reporting (orientation, position of alignments, or perform combinatorial expansion).
For example, parse2
defaults to reporting ligation junctions instead of outer ends of the alignments.
The complete guide through the reporting options of parse2
, orientation:
position:
Sometimes it is important to restore the sequence of ligation events (e.g., for MC-3C data). For that, you can add
special columns walk_pair_index
and walk_pair_type
by setting --add-pair-index
option of parse2
, that will keep the order and type of pair in the whole walk in the output .pairs file.
walk_pair_index
contains information on the order of the pair in the complex walk, starting from 5’-end of left readwalk_pair_type
describes the type of the pair relative to R1 and R2 reads of paired-end sequencing:- “R1-2” - unconfirmed pair, right and left alignments in the pair originate from different reads (left or right). This might be indirect ligation (mediated by other DNA fragments).
- “R1” - pair originates from the left read. This is direct ligation.
- “R2” - pair originated from the right read. Direct ligation.
- “R1&2” - pair was sequenced at both left and right read. Direct ligation.
With this information, the whole sequence of ligation events can be restored from the .pair file.
Combinatorial expansion is a way to increase the number of contacts in you data, which assumes that all DNA fragments
in the same molecule (read) are in contact. Use --expand
parameter for combinatorial expansion.
Note that expanded pairs have modified pair type, “E{separation}_{pair type}”, e.g.:
- “E1_R1” is a pair obtained by combining left alignment of some pair in R1 read and right alignment of the next pair in R1 sequence of the same read.
- “E2_R1” is a pair obtained by combining left alignment of some pair in R1 read and right alignment of the pair separated by 2 alignments in R1 sequence of the same read.
- “E2_R1&2” as above, both source pairs were sequenced on both R1 and R2.
- “E4_R1-2” is a pair obtained by combining left alignment of some pair in R1 read and right alignment of some pair in R1 sequence, separated by at least 4 alignments in between.
Note that “-” in the pair type means that pair is separated by unsequenced gap, which may contain other pairs.
[1] | Following the lead of C-walks |
[2] | This procedure was first introduced in HiC-Pro and the in Juicer . |
Sorting pairs¶
In order to enable efficient random access to Hi-C pairs, we flip and sort pairs. After sorting, interactions become arranged in the order of their genomic position, such that, for any given pair of regions, we easily find and extract all of their interactions. And, after flipping, all artificially duplicated molecules (either during PCR or in optical sequencing) end up in adjacent rows in sorted lists of interactions, such that we can easily identify and remove them.
Sorting¶
pairtools sort
arrange pairs in the order of (chrom1, chrom2, pos1, pos2).
This order is also known as block sorting, because all pairs between
any given pair of chromosomes become grouped into one continuous block.
Additionally, pairtools sort
also sorts pairs with identical positions by
pair_type. This does not really do much for mapped reads, but it nicely splits
unmapped reads into blocks of null-mapped and multi-mapped reads.
We note that there is an alternative to block sorting, called row sorting, where pairs are sorted by (chrom1, pos1, chrom2, pos2). In pairtools sort, we prefer block-sorting since it cleanly separates cis interactions from trans ones and thus is a more optimal solution for typical use cases.
Flipping¶
In a typical paired-end experiment, side1 and side2 of a DNA molecule are defined by the order in which they got sequenced. Since this order is essentially random, any given Hi-C pair, e.g. (chr1, 1.1Mb; chr2, 2.1Mb), may appear in a reversed orientation, i.e. (chr2, 2.1Mb; chr1, 1.1Mb). If we were to preserve this order of sides, interactions between same loci would appear in two different locations of the sorted pair list, which would complicate finding PCR/optical duplicates.
To ensure that Hi-C pairs with similar coordinates end up in the same location of the sorted list, we flip pairs, i.e. we choose side1 as the side with the lowest genomic coordinate. Thus, after flipping, for trans pairs (chrom1!=chrom2), order(chrom1)<order(chrom2); and for cis pairs (chrom1==chrom2), pos1<=pos2. In a matrix representation, flipping is equal to reflecting the lower triangle of the Hi-C matrix onto its upper triangle, such that the resulting matrix is upper-triangular.
In pairtools, flipping is done during parsing - that’s why pairtools parse
requires a .chromsizes file that specifies the order of chromosomes for flipping.
Importantly, pairtools parse
also flips one-sided pairs such that
side1 is always unmapped; and unmapped pairs such that side1 always has a “poorer”
mapping type (i.e. null-mapping<multi-mapping).
Chromosomal order for sorting and flipping¶
Importantly, the order of chromosomes for sorting and flipping can be different.
Specifically, pairtools sort
uses the lexicographic order for chromosomes
(chr1, chr10, chr11, …, chr2, chr21,…) instead of the “natural” order
(chr1, chr2, chr3, …); at the same time, flipping is done in
pairsamtools parse
using the chromosomal order specified by the user.
pairtools sort
uses the lexicographic order for sorting chromosomes.
This order is used universally to sorting strings in all languages and tools [1],
which makes it easy to design tools for indexing and searching in sorted pair lists.
At the same time, pairtools parse
uses a custom user-provided chromosomal
order to flip pairs. For performance considerations, for flipping, we recommend
ordering chromosomes in a way that will be used in plotting contact maps.
[1] | Unfortunately, many existing genomes use rather unconventional choices in chromosomal naming schemes. For example, in sacCer3, chromosomes are enumerated with Roman numerals; in dm3, big autosomes are split 4 different contigs each. Thus, it is impossible to design a universal algorithm that would order chromosomes in a “meaningful” way for all existing genomes. |
Formats for storing Hi-C pairs¶
.pairs¶
.pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.
The body of a .pairs contains a table with a variable number of fields separated by a “\t” character (a horizontal tab). The .pairs specification fixes the content and the order of the first seven columns:
index | name | description |
---|---|---|
1 | read_id | the ID of the read as defined in fastq files |
2 | chrom1 | the chromosome of the alignment on side 1 |
3 | pos1 | the 1-based genomic position of the outer-most (5’) mapped bp on side 1 |
4 | chrom2 | the chromosome of the alignment on side 2 |
5 | pos2 | the 1-based genomic position of the outer-most (5’) mapped bp on side 2 |
6 | strand1 | the strand of the alignment on side 1 |
7 | strand2 | the strand of the alignment on side 2 |
A .pairs file starts with a header, an arbitrary number of lines starting with a “#” character. By convention, the header lines have a format of “#field_name: field_value”. The .pairs specification mandates a few standard header lines (e.g., column names, chromosome order, sorting order, etc), all of which are automatically filled in by pairtools.
The entries of a .pairs file can be flipped and sorted. “Flipping” means that the sides 1 and 2 do not correspond to side1 and side2 in sequencing data. Instead, side1 is defined as the side with the alignment with a lower sorting index (using the lexographic order for chromosome names, followed by the numeric order for positions and the lexicographic order for pair types). This particular order of “flipping” is defined as “upper-triangular flipping”, or “triu-flipping”. Finally, pairs are typically block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2.
Pairtools’ flavor of .pairs¶
.pairs files produced by pairtools extend .pairs format in a few ways.
- pairtools store null, unmapped, ambiguous (multiply mapped) and chimeric (if not parsed by parse2 or –walks-policy all of parse) alignments as chrom=’!’, pos=0, strand=’-‘.
- pairtools store the header of the source .sam files in the ‘#samheader:’ fields of the pairs header. When multiple .pairs files are merged, the respective ‘#samheader:’ fields are checked for consistency and merged.
- Each pairtool applied to .pairs leaves a record in the ‘#samheader’ fields (using a @PG sam tag), thus preserving the full history of data processing.
- pairtools append an extra column describing the type of a Hi-C pair:
index | name | description |
---|---|---|
8 | pair_type | the type of a Hi-C pair |
Pair types¶
pairtools use a simple two-character notation to define all possible pair types, according to the quality of alignment of the two sides. The type of a pair can be defined unambiguously using the table below. To use this table, identify which side has an alignment of a “poorer” quality (unmapped < multimapped < unique alignment) and which side has a “better” alignment and find the corresponding row in the table.
. | . | . | Less informative alignment | More informative alignment | . | ||
Pair type | Code | >2 alignments | Mapped | Unique | Mapped | Unique | Sidedness |
walk-walk | WW | ✔ | ❌ | ❌ | ❌ | ❌ | 0 [1] |
null | NN | ❌ | ❌ | ❌ | 0 | ||
corrupt | XX | ❌ | ❌ | ❌ | 0 [2]_ | ||
null-multi | NM | ❌ | ❌ | ✔ | ❌ | 0 | |
null-rescued | NR | ✔ | ❌ | ✔ | ✔ | 1 [3] | |
null-unique | NU | ❌ | ❌ | ✔ | ✔ | 1 | |
multi-multi | MM | ❌ | ✔ | ❌ | ✔ | ❌ | 0 |
multi-rescued | MR | ✔ | ✔ | ❌ | ✔ | ✔ | 1 [3] |
multi-unique | MU | ❌ | ✔ | ❌ | ✔ | ✔ | 1 |
rescued-unique | RU | ✔ | ✔ | ✔ | ✔ | ✔ | 2 [3] |
unique-rescued | UR | ✔ | ✔ | ✔ | ✔ | ✔ | 2 [3] |
unique-unique | UU | ❌ | ✔ | ✔ | ✔ | ✔ | 2 |
duplicate | DD | ❌ | ✔ | ✔ | ✔ | ✔ | 2 [4]_ |
[1] | “walks”, or, C-walks are Hi-C molecules formed via multiple ligation events which cannot be reported as a single pair. |
[2] | “corrupt” pairs are those with technical issues - e.g. missing a FASTQ sequence/SAM entry from one side of the molecule. |
[2] | “rescued” pairs have two non-overlapping alignments on one of the sides (referred below as the chimeric side/read), but the inner (3’-) one extends the only alignment on the other side (referred as the non-chimeric side/read). Such pairs form when one of the two ligated DNA fragments is shorter than the read length. In this case, one of the reads contains this short fragment entirely, together with the ligation junction and a chunk of the other DNA fragment (thus, this read ends up having two non-overlapping alignments). Following the procedure introduced in HiC-Pro and Juicer, pairtools parse rescues such Hi-C molecules, reports the position of the 5’ alignment on the chimeric side, and tags them as “NU”, “MU”, “UR” or “RU” pair type, depending on the type of the 5’ alignment on the chimeric side. Such molecules can and should be used in downstream analysis. Read more on the rescue procedure in the section on parsing. |
[3] | (1, 2, 3, 4) pairtools dedup detects molecules that could be formed via PCR duplication and tags them as “DD” pair type. These pairs should be excluded from downstream analyses. |
.pairsam¶
pairtools also define .pairsam, a valid extension of the .pairs format. On top of the pairtools’ flavor of .pairs, .pairsam format adds two extra columns containing the alignments from which the Hi-C pair was extracted:
index | name | description |
---|---|---|
9 | sam1 | the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM |
10 | sam2 | the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM |
Note that, normally, the fields of a sam alignment are separated by a horizontal tab character (\t), which we already use to separate .pairs columns. To avoid confusion, we replace the tab character in sam entries stored in sam1 and sam2 columns with a UNIT SEPARATOR character (\031).
Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string ‘\031NEXT_SAM\031’
Extra columns¶
pairtools can operate on .pairs/.pairsam with extra columns. Extra columns are specified in the order defined by the order their addition by various tools. Column names can be checked in the header of .pairs/.pairsam file. We provide pairtools header utilities for manipulating and verifying compatibility of headers and their columns.
The list of additional columns used throughout pairtools modules:
extra column | generating module | format | how to add it | description |
---|---|---|---|---|
mapq1, mapq2 | parse/parse2 | number from 0 to 255 | pairtools parse –add-columns mapq | Mapping quality, as reported in .sam/.bam, $-10 log_{10}(P_{error})$ |
pos51, pos52 | parse/parse2 | genomic coordinate | pairtools parse –add-columns pos5 | 5’ position of alignment (closer to read start) |
pos31, pos32 | parse/parse2 | genomic coordinate | pairtools parse –add-columns pos3 | 3’ position of alignment (further from read start) |
cigar1, cigar2 | parse/parse2 | string | pairtools parse –add-columns cigar | CIGAR, or Compact Idiosyncratic Gapped Alignment Report of alignment, as reported in .sam/.bam |
read_len1, read_len2 | parse/parse2 | number | pairtools parse –add-columns read_len | read length |
matched_bp1, matched_bp2 | parse/parse2 | number | pairtools parse –add-columns matched_bp | number of matched alignment basepairs to the reference |
algn_ref_span1, algn_ref_span2 | parse/parse2 | number | pairtools parse –add-columns algn_ref_span | basepairs of reference covered by alignment |
algn_read_span1, algn_read_span2 | parse/parse2 | number | pairtools parse –add-columns algn_read_span | basepairs of read covered by alignment |
dist_to_51, dist_to_52 | parse/parse2 | number | pairtools parse –add-columns dist_to_5 | distance to 5’-end of read |
dist_to_31, dist_to_32 | parse/parse2 | number | pairtools parse –add-columns dist_to_3 | distance to 3’-end of read |
seq1, seq2 | parse/parse2 | string | pairtools parse –add-columns seq | sequence of alignment |
mismatches1, mismatches2 | parse/parse2 | string | pairtools parse –add-columns mismatches | comma-separated list of mismatches relative to the reference, “{ref_letter}:{mut_letter}:{phred}:{ref_position}:{read_position}” |
XB1/2,AS1/2,XS1/2 or any sam tag | parse/parse2 | pairtools parse –add-columns XA,XB,NM | format depends on tag specification | |
walk_pair_type | parse/parse2 | string | pairtools parse2 –add-pair-index | Type of the pair relative to R1 and R2 reads of paired-end sequencing, see pasring docs |
walk_pair_index | parse/parse2 | number | pairtools parse2 –add-pair-index | Order of the pair in the complex walk, starting from 5’-end of left read, see pasring docs |
phase | phase | 0, 1 or “.” | pairtools phase | Phase of alignment (haplotype 1, 2, on unphased), see phasing walkthrough |
rfrag1, rfrag2 | restrict | number | pairtools restrict | Unique index of the restriction fragment after annotating pairs positions, see restriction walkthrough |
rfrag_start1, rfrag_start2 | restrict | number | pairtools restrict | Coordinate of the start of restriction fragment |
rfrag_end1, rfrag_end2 | restrict | number | pairtools restrict | Coordinate of the end of restriction fragment |
Technical notes¶
Designing scientific software and formats requires making a multitude of tantalizing technical decisions and compromises. Often, the reasons behind a certain decision are non-trivial and convoluted, involving many factors. Here, we collect the notes and observations made during the desing stage of pairtools and provide a justification for most non-trivial decisions. We hope that this document will elucidate the design of pairtools and may prove useful to developers in their future projects.
.pairs format¶
The motivation behind some of the technical decisions in the pairtools’ flavor of .pairs/.pairsam:
pairtools can store SAM entries together with the Hi-C pair information in .pairsam files. Storing pairs and alignments in the same row enables easy tagging and filtering of paired-end alignments based on their Hi-C information.
pairtools use the exclamation mark “!” instead of ‘.’ as ‘chrom’ of unmapped reads because it has the lowest lexicographic sorting order among all characters. The use of ‘0’ and ‘-’ in the ‘pos’ and ‘strand’ fields of unmapped reads allows us to keep the types of these fields as ‘unsigned int’ and enum{‘+’,’-‘}, respectively.
“rescued” pairs have two types “UR” and “RU” instead of just “RU”. We chose this design because rescued pairs are two-sided and thus are flipped based on (chrom, pos), and not based on the side types. With two pair types “RU” and “UR”, pairtools can keep track of which side of the pair was rescued.
in “rescued” pairs, the type “R” is assigned to the non-chimeric side. This may seem counter-intuitive at first, since it is the chimeric side that gets rescued, but this way pairtools can keep track of the type of the 5’ alignment on the chimeric side (the alignment on the non-chimeric side has to be unique for the pair to be rescued).
pairtools rely on a text format, .pairs, instead of hdf5/parquet-based tables or custom binaries. We went with a text format for a few reasons:
- text tables enable easy access to data from any language and any tool. This is especially important at the level of Hi-C pairs, the “rawest” format of information from a Hi-C experiment.
- hdf5 and parquet have a few shortcomings that hinder their immediate use in pairtools. Specifically, hdf5 cannot compress variable-length strings (which are, in turn, required to store sam alignments and some optional information on pairs) and parquet cannot append columns to existing files, modify datasets in place or store multiple tables in one file (which is required to keep table indices in the same file with pairs).
- text tables have a set of well-developed and highly-optimized tools for sorting (Unix sort), compression (bgzip/lz4) and random access (tabix).
- text formats enable easy streaming between individual command-line tools.
Having said that, text formats have many downsides - they are bulky when not compressed, compression and parsing requires extra computational resources, they cannot be modified in place and random access requires extra tools. In the future, we plan to develop a binary format based on existing container formats, which would mitigate these downsides.
CLI¶
- many pairtools perform multiple actions at once, which contradicts the “do one thing” philosophy of Unix command line. We packed multiple (albeit, related) functions into one tool to improve the performance of pairtools. Specifically, given the large size of Hi-C data, a significant fraction of time is spent on compression/decompression, parsing, loading data into memory and sending it over network (for cloud/clusters). Packing multiple functions into one tool cuts down the amount of such time consuming operations.
pairtools parse
requires a .chromsizes file to know the order of chromosomes and perform pair flipping.- pairtools use bgzip compression by default instead of gzip. Using bgzip allows us to create an index with pairix and get random access to data.
- paritools have an option to compress outputs with lz4. Lz4 is much faster and only slighly less efficient than gzip. This makes lz4 a better choice for passing data between individual pairtools before producing final result (which, in turn, requires bgzip compression).
Command-line API¶
pairtools¶
Flexible tools for Hi-C data processing.
All pairtools have a few common options, which should be typed _before_ the command name.
pairtools [OPTIONS] COMMAND [ARGS]...
Options
-
--post-mortem
¶
Post mortem debugging
-
--output-profile
<output_profile>
¶ Profile performance with Python cProfile and dump the statistics into a binary file
-
-v
,
--verbose
¶
Verbose logging.
-
-d
,
--debug
¶
On error, drop into the post-mortem debugger shell.
-
--version
¶
Show the version and exit.
dedup¶
Find and remove PCR/optical duplicates.
Find PCR/optical duplicates in an upper-triangular flipped sorted pairs/pairsam file. Allow for a +/-N bp mismatch at each side of duplicated molecules.
PAIRS_PATH : input triu-flipped sorted .pairs or .pairsam file. If the path ends with .gz/.lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools dedup [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file for pairs after duplicate removal. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--output-dups
<output_dups>
¶ output file for duplicated pairs. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. If the path is the same as in –output or -, output duplicates together with deduped pairs. By default, duplicates are dropped.
-
--output-unmapped
<output_unmapped>
¶ output file for unmapped pairs. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. If the path is the same as in –output or -, output unmapped pairs together with deduped pairs. If the path is the same as –output-dups, output unmapped reads together with dups. By default, unmapped pairs are dropped.
-
--output-stats
<output_stats>
¶ output file for duplicate statistics. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, statistics are not printed.
-
--output-bytile-stats
<output_bytile_stats>
¶ output file for duplicate statistics. Note that the readID should be provided and contain tile information for this option. This analysis is possible when pairtools is run on a dataset with original Illumina-generated read IDs, because SRA does not store original read IDs from the sequencer. By default, by-tile duplicate statistics are not printed. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed.
-
--max-mismatch
<max_mismatch>
¶ Pairs with both sides mapped within this distance (bp) from each other are considered duplicates. [dedup option] [default: 3]
-
--method
<method>
¶ define the mismatch as either the max or the sum of the mismatches ofthe genomic locations of the both sides of the two compared molecules. [dedup option] [default: max]
-
--backend
<backend>
¶ What backend to use: scipy and sklearn are based on KD-trees, cython is online indexed list-based algorithm. With cython backend, duplication is not transitive with non-zero max mismatch (e.g. pairs A and B are duplicates, and B and C are duplicates, then A and C are not necessary duplicates of each other), while with scipy and sklearn it’s transitive (i.e. A and C are necessarily duplicates). Cython is the original version used in pairtools since its beginning. It is available for backwards compatibility and to allow specification of the column order. Now the default scipy backend is generally the fastest, and with chunksize below 1 mln has the lowest memory requirements. [dedup option]
-
--chunksize
<chunksize>
¶ Number of pairs in each chunk. Reduce for lower memory footprint. Below 10,000 performance starts suffering significantly and the algorithm might miss a few duplicates with non-zero –max-mismatch. Only works with ‘–backend scipy or sklearn’. [dedup option] [default: 100000]
-
--carryover
<carryover>
¶ Number of deduped pairs to carry over from previous chunk to the new chunk to avoid breaking duplicate clusters. Only works with ‘–backend scipy or sklearn’. [dedup option] [default: 100]
-
-p
,
--n-proc
<n_proc>
¶ Number of cores to use. Only applies with sklearn backend.Still needs testing whether it is ever useful. [dedup option]
-
--mark-dups
¶
If specified, duplicate pairs are marked as DD in “pair_type” and as a duplicate in the sam entries. [output format option]
-
--keep-parent-id
¶
If specified, duplicate pairs are marked with the readID of the retained deduped read in the ‘parent_readID’ field. [output format option]
-
--extra-col-pair
<extra_col_pair>
¶ Extra columns that also must match for two pairs to be marked as duplicates. Can be either provided as 0-based column indices or as column names (requires the “#columns” header field). The option can be provided multiple times if multiple column pairs must match. Example: –extra-col-pair “phase1” “phase2”. [output format option]
-
--sep
<sep>
¶ Separator (t, v, etc. characters are supported, pass them in quotes). [input format option]
-
--send-header-to
<send_header_to>
¶ Which of the outputs should receive header and comment lines. [input format option]
-
--c1
<c1>
¶ Chrom 1 column; default 1 Only works with ‘–backend cython’. [input format option]
-
--c2
<c2>
¶ Chrom 2 column; default 3 Only works with ‘–backend cython’. [input format option]
-
--p1
<p1>
¶ Position 1 column; default 2 Only works with ‘–backend cython’. [input format option]
-
--p2
<p2>
¶ Position 2 column; default 4 Only works with ‘–backend cython’. [input format option]
-
--s1
<s1>
¶ Strand 1 column; default 5 Only works with ‘–backend cython’. [input format option]
-
--s2
<s2>
¶ Strand 2 column; default 6 Only works with ‘–backend cython’. [input format option]
-
--unmapped-chrom
<unmapped_chrom>
¶ Placeholder for a chromosome on an unmapped side; default !
-
--yaml
,
--no-yaml
¶
Output stats in yaml format instead of table. [output stats format option]
-
--filter
<filter>
¶ Filter stats with condition to apply to the data (similar to pairtools select or pairtools stats). For non-YAML output only the first filter will be reported. [output stats filtering option] Note that this will not change the deduplicated output pairs. Example: pairtools dedup –yaml –filter ‘unique:(pair_type==”UU”)’ –filter ‘close:(pair_type==”UU”) and (abs(pos1-pos2)<10)’ –output-stats - test.pairs
-
--engine
<engine>
¶ Engine for regular expression parsing for stats filtering. Python will provide you regex functionality, while pandas does not accept custom funtctions and works faster. [output stats filtering option]
-
--chrom-subset
<chrom_subset>
¶ A path to a chromosomes file (tab-separated, 1st column contains chromosome names) containing a chromosome subset of interest for stats filter. If provided, additionally filter pairs with both sides originating from the provided subset of chromosomes. This operation modifies the #chromosomes: and #chromsize: header fields accordingly. Note that this will not change the deduplicated output pairs. [output stats filtering option]
-
--startup-code
<startup_code>
¶ An auxiliary code to execute before filteringfor stats. Use to define functions that can be evaluated in the CONDITION statement. [output stats filtering option]
-
-t
,
--type-cast
<type_cast>
¶ Cast a given column to a given type for stats filtering. By default, only pos and mapq are cast to int, other columns are kept as str. Provide as -t <column_name> <type>, e.g. -t read_len1 int. Multiple entries are allowed. [output stats filtering option]
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
filterbycov¶
Remove pairs from regions of high coverage.
Find and remove pairs with >(MAX_COV-1) neighbouring pairs within a +/- MAX_DIST bp window around either side. Useful for single-cell Hi-C experiments, where coverage is naturally limited by the chromosome copy number.
PAIRS_PATH : input triu-flipped sorted .pairs or .pairsam file. If the path ends with .gz/.lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools filterbycov [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file for pairs from low coverage regions. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--output-highcov
<output_highcov>
¶ output file for pairs from high coverage regions. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. If the path is the same as in –output or -, output duplicates together with deduped pairs. By default, duplicates are dropped.
-
--output-unmapped
<output_unmapped>
¶ output file for unmapped pairs. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. If the path is the same as in –output or -, output unmapped pairs together with deduped pairs. If the path is the same as –output-highcov, output unmapped reads together. By default, unmapped pairs are dropped.
-
--output-stats
<output_stats>
¶ output file for statistics of multiple interactors. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, statistics are not printed.
-
--max-cov
<max_cov>
¶ The maximum allowed coverage per region.
-
--max-dist
<max_dist>
¶ The resolution for calculating coverage. For each pair, the local coverage around each end is calculated as (1 + the number of neighbouring pairs within +/- max_dist bp)
-
--method
<method>
¶ calculate the number of neighbouring pairs as either the sum or the max of the number of neighbours on the two sides [default: max]
-
--sep
<sep>
¶ Separator (t, v, etc. characters are supported, pass them in quotes)
-
--comment-char
<comment_char>
¶ The first character of comment lines
-
--send-header-to
<send_header_to>
¶ Which of the outputs should receive header and comment lines
-
--c1
<c1>
¶ Chrom 1 column; default 1
-
--c2
<c2>
¶ Chrom 2 column; default 3
-
--p1
<p1>
¶ Position 1 column; default 2
-
--p2
<p2>
¶ Position 2 column; default 4
-
--s1
<s1>
¶ Strand 1 column; default 5
-
--s2
<s2>
¶ Strand 2 column; default 6
-
--unmapped-chrom
<unmapped_chrom>
¶ Placeholder for a chromosome on an unmapped side; default !
-
--mark-multi
¶
If specified, duplicate pairs are marked as FF in “pair_type” and as a duplicate in the sam entries.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
flip¶
Flip pairs to get an upper-triangular matrix.
Change the order of side1 and side2 in pairs, such that (order(chrom1) < order(chrom2)
or (order(chrom1) == order(chrom2)) and (pos1 <=pos2))Equivalent to reflecting the lower triangle of a Hi-C matrix onto its upper triangle, resulting in an upper triangular matrix. The order of chromosomes must be provided via a .chromsizes file.
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools flip [OPTIONS] [PAIRS_PATH]
Options
-
-c
,
--chroms-path
<chroms_path>
¶ Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided. [required]
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
header¶
Manipulate the .pairs/.pairsam header
pairtools header [OPTIONS] COMMAND [ARGS]...
generate¶
Generate the header
pairtools header generate [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 1]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
-
--chroms-path
<chroms_path>
¶ Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided.
-
--sam-path
<sam_path>
¶ Input sam file to inherit the header. Either –sam or –chroms-path should be provided to store the chromosome sizes in the header.
-
--columns
<columns>
¶ Report columns describing alignments Can take multiple values as a comma-separated list.By default, assign standard .pairs columns: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type,sam1,sam2,walk_pair_index,walk_pair_type
-
--extra-columns
<extra_columns>
¶ Report extra columns describing alignments Can take multiple values as a comma-separated list.
-
--assembly
<assembly>
¶ Name of genome assembly (e.g. hg19, mm10) to store in the pairs header.
-
--no-flip
¶
If specified, assume that the pairs are not filpped in genomic order and instead preserve the order in which they were sequenced.
-
--pairs
,
--pairsam
¶
If pairs, then the defult columns will be set to: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type if pairsam, then to: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type,sam1,sam2
Arguments
-
PAIRS_PATH
¶
Optional argument
set-columns¶
Add the columns to the .pairs/pairsam file
pairtools header set-columns [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 1]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
-
-c
,
--columns
<columns>
¶ Comma-separated list of columns to be added, e.g.: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type,sam1,sam2,walk_pair_index,walk_pair_type [required]
Arguments
-
PAIRS_PATH
¶
Optional argument
transfer¶
Transfer the header from one pairs file to another
pairtools header transfer [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 1]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
-
-r
,
--reference-file
<reference_file>
¶ Header file for transfer [required]
Arguments
-
PAIRS_PATH
¶
Optional argument
validate-columns¶
Validate the columns of the .pairs/pairsam file against reference or within file. If the checks pass, then returns full pairs file. Otherwise throws an exception.
- If reference_file is provided, check:
- columns are the same between pairs and reference_file
- number of columns in the pairs body is the same as the number of columns
- If reference_columns are provided, check:
- pairs columns are the same as provided
- number of columns in the pairs body is the same as the number of columns
If no reference_file or columns, then check only the number of columns in the pairs body. Checks only the first line in the pairs stream!
pairtools header validate-columns [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 1]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
-
-r
,
--reference-file
<reference_file>
¶ Header file for comparison (optional)
-
-c
,
--reference-columns
<reference_columns>
¶ Comma-separated list of columns fro check (optional), e.g.: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type,sam1,sam2,walk_pair_index,walk_pair_type
Arguments
-
PAIRS_PATH
¶
Optional argument
markasdup¶
Tag all pairs in the input file as duplicates.
Change the type of all pairs inside a .pairs/.pairsam file to DD. If sam entries are present, change the pair type in the Yt SAM tag to ‘Yt:Z:DD’.
PAIRSAM_PATH : input .pairs/.pairsam file. If the path ends with .gz, the input is gzip-decompressed. By default, the input is read from stdin.
pairtools markasdup [OPTIONS] [PAIRSAM_PATH]
Options
-
-o
,
--output
<output>
¶ output .pairsam file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRSAM_PATH
¶
Optional argument
merge¶
- Merge .pairs/.pairsam files.
By default, assumes that the files are sorted and maintains the sorting.
Merge triu-flipped sorted pairs/pairsam files. If present, the @SQ records of the SAM header must be identical; the sorting order of these lines is taken from the first file in the list. The ID fields of the @PG records of the SAM header are modified with a numeric suffix to produce unique records. The other unique SAM and non-SAM header lines are copied into the output header.
PAIRS_PATH : upper-triangular flipped sorted .pairs/.pairsam files to merge or a group/groups of .pairs/.pairsam files specified by a wildcard. For paths ending in .gz/.lz4, the files are decompressed by bgzip/lz4c.
pairtools merge [OPTIONS] [PAIRS_PATH]...
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz/.lz4, the output is compressed by bgzip/lz4c. By default, the output is printed into stdout.
-
--max-nmerge
<max_nmerge>
¶ The maximal number of inputs merged at once. For more, store merged intermediates in temporary files. [default: 8]
-
--tmpdir
<tmpdir>
¶ Custom temporary folder for merged intermediates.
-
--memory
<memory>
¶ The amount of memory used by default. [default: 2G]
-
--compress-program
<compress_program>
¶ A binary to compress temporary merged chunks. Must decompress input when the flag -d is provided. Suggested alternatives: lz4c, gzip, lzop, snzip. NOTE: fails silently if the command syntax is wrong. [default: ]
-
--nproc
<nproc>
¶ Number of threads for merging. [default: 8]
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 1]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
-
--keep-first-header
,
--no-keep-first-header
¶
Keep the first header or merge the headers together. Default: merge headers. [default: False]
-
--concatenate
,
--no-concatenate
¶
Simple concatenate instead of merging sorted files. [default: False]
Arguments
-
PAIRS_PATH
¶
Optional argument(s)
parse¶
- Find ligation pairs in .sam data, make .pairs.
- SAM_PATH : an input .sam/.bam file with paired-end sequence alignments of Hi-C molecules. If the path ends with .bam, the input is decompressed from bam with samtools. By default, the input is read from stdin.
pairtools parse [OPTIONS] [SAM_PATH]
Options
-
-c
,
--chroms-path
<chroms_path>
¶ Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided. [required]
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed.By default, the output is printed into stdout.
-
--assembly
<assembly>
¶ Name of genome assembly (e.g. hg19, mm10) to store in the pairs header.
-
--min-mapq
<min_mapq>
¶ The minimal MAPQ score to consider a read as uniquely mapped [default: 1]
-
--max-molecule-size
<max_molecule_size>
¶ The maximal size of a Hi-C molecule; used to rescue single ligations(from molecules with three alignments) and to rescue complex ligations.The default is based on oriented P(s) at short ranges of multiple Hi-C.Not used with walks-policy all. [default: 750]
-
--drop-readid
¶
If specified, do not add read ids to the output
-
--drop-seq
¶
If specified, remove sequences and PHREDs from the sam fields
-
--drop-sam
¶
If specified, do not add sams to the output
-
--add-pair-index
¶
If specified, each pair will have pair index in the molecule
-
--add-columns
<add_columns>
¶ Report extra columns describing alignments Possible values (can take multiple values as a comma-separated list): a SAM tag (any pair of uppercase letters) or mapq, pos5, pos3, cigar, read_len, matched_bp, algn_ref_span, algn_read_span, dist_to_5, dist_to_3, seq, mismatches.
-
--output-parsed-alignments
<output_parsed_alignments>
¶ output file for all parsed alignments, including walks. Useful for debugging and rnalysis of walks. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed. By default, not used.
-
--output-stats
<output_stats>
¶ output file for various statistics of pairs file. By default, statistics is not generated.
-
--report-alignment-end
<report_alignment_end>
¶ specifies whether the 5’ or 3’ end of the alignment is reported as the position of the Hi-C read.
-
--max-inter-align-gap
<max_inter_align_gap>
¶ read segments that are not covered by any alignment and longer than the specified value are treated as “null” alignments. These null alignments convert otherwise linear alignments into walks, and affect how they get reported as a Hi-C pair (see –walks-policy). [default: 20]
-
--walks-policy
<walks_policy>
¶ the policy for reporting unrescuable walks (reads containing more than one alignment on one or both sides, that can not be explained by a single ligation between two mappable DNA fragments). “mask” - mask walks (chrom=”!”, pos=0, strand=”-“); “5any” - report the 5’-most alignment on each side; “5unique” - report the 5’-most unique alignment on each side, if present; “3any” - report the 3’-most alignment on each side; “3unique” - report the 3’-most unique alignment on each side, if present; “all” - report all available unique alignments on each side. [default: mask]
-
--readid-transform
<readid_transform>
¶ A Python expression to modify read IDs. Useful when read IDs differ between the two reads of a pair. Must be a valid Python expression that uses variables called readID and/or i (the 0-based index of the read pair in the bam file) and returns a new value, e.g. “readID[:-2]+’_’+str(i)”. Make sure that transformed readIDs remain unique!
-
--flip
,
--no-flip
¶
If specified, do not flip pairs in genomic order and instead preserve the order in which they were sequenced.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
SAM_PATH
¶
Optional argument
parse2¶
- Extracts pairs from .sam/.bam data with complex walks, make .pairs.
- SAM_PATH : an input .sam/.bam file with paired-end or single-end sequence alignments of Hi-C (or Hi-C-like) molecules. If the path ends with .bam, the input is decompressed from bam with samtools. By default, the input is read from stdin.
pairtools parse2 [OPTIONS] [SAM_PATH]
Options
-
-c
,
--chroms-path
<chroms_path>
¶ Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided. [required]
-
-o
,
--output
<output>
¶ output file with pairs. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed.By default, the output is printed into stdout.
-
--report-position
<report_position>
¶ Reported position of alignments in pairs of complex walks (pos columns). Each alignment in .bam/.sam Hi-C-like data has two ends, and you can report one or another depending of the position of alignment on a read or in a pair.
“junction” - inner ends of sequential alignments in each pair, aka ligation junctions (complex walks default), “read” - 5’-end of alignments relative to R1 or R2 read coordinate system (as in traditional Hi-C), “walk” - 5’-end of alignments relative to the whole walk coordinate system, “outer” - outer ends of sequential alignments in each pair.
-
--report-orientation
<report_orientation>
¶ Reported orientataion of pairs in complex walk (strand columns). Each alignment in .bam/.sam Hi-C-like data has orientation, and you can report it relative to the read, pair or whole walk coordinate system.
“pair” - orientation as if each pair in complex walk was sequenced independently from the outer ends or molecule (as in traditional Hi-C, also complex walks default), “read” - orientation defined by the read (R1 or R2 read coordinate system), “walk” - orientation defined by the walk coordinate system, “junction” - reversed “pair” orientation, as if pair was sequenced in both directions starting from the junction
-
--assembly
<assembly>
¶ Name of genome assembly (e.g. hg19, mm10) to store in the pairs header.
-
--min-mapq
<min_mapq>
¶ The minimal MAPQ score to consider a read as uniquely mapped. [default: 1]
-
--max-inter-align-gap
<max_inter_align_gap>
¶ Read segments that are not covered by any alignment and longer than the specified value are treated as “null” alignments. These null alignments convert otherwise linear alignments into walks, and affect how they get reported as a Hi-C pair. [default: 20]
-
--max-insert-size
<max_insert_size>
¶ When searching for overlapping ends of left and right read (R1 and R2), this sets the minimal distance when two alignments on the same strand and chromosome are considered part of the same fragment (and thus reported as the same alignment and not a pair). For traditional Hi-C with long restriction fragments and shorter molecules after ligation+sonication, this can be the expected molecule size. For complex walks with short restriction fragments, this can be the expected restriction fragment size. Note that unsequenced insert is terra incognita and might contain unsequenced DNA (including ligations) in it. This parameter is ignored in –single-end mode. [default: 500]
-
--dedup-max-mismatch
<dedup_max_mismatch>
¶ Allowed mismatch between intramolecular alignments to detect readthrough duplicates. Pairs with both sides mapped within this distance (bp) from each other are considered duplicates. [default: 3]
-
--single-end
¶
If specified, the input is single-end. Never use this for paired-end data, because R1 read will be omitted. If single-end data is provided, but parameter is unset, the pairs will be generated, but may contain artificial UN pairs.
-
--expand
,
--no-expand
¶
If specified, perform combinatorial expansion on the pairs. Combinatorial expansion is a way to increase the number of contacts in you data, assuming that all DNA fragments in the same molecule (read) are in contact. Expanded pairs have modified pair type, ‘E{separation}_{pair type}’
-
--max-expansion-depth
<max_expansion_depth>
¶ Works in combination with –expand. Maximum number of segments separating pair. By default, expanding all possible pairs.Setting the number will limit the expansion depth and enforce contacts from the same side of the read.
-
--add-pair-index
¶
If specified, parse2 will report pair index in the walk as additional columns (R1, R2, R1&R2 or R1-R2). See documentation: https://pairtools.readthedocs.io/en/latest/parsing.html#rescuing-complex-walks For combinatorial expanded pairs, two numbers will be reported: original pair index of the left and right segments.
-
--flip
,
--no-flip
¶
If specified, flip pairs in genomic order and instead preserve the order in which they were sequenced. Note that no flip is recommended for analysis of walks because it will override the order of alignments in pairs. Flip is required for appropriate deduplication of sorted pairs. Flip is not required for cooler cload, which runs flipping internally.
-
--add-columns
<add_columns>
¶ Report extra columns describing alignments Possible values (can take multiple values as a comma-separated list): a SAM tag (any pair of uppercase letters) or mapq, pos5, pos3, cigar, read_len, matched_bp, algn_ref_span, algn_read_span, dist_to_5, dist_to_3, seq, mismatches.
-
--drop-readid
,
--keep-readid
¶
If specified, do not add read ids to the output. By default, keep read ids. Useful for long walks analysis.
-
--readid-transform
<readid_transform>
¶ A Python expression to modify read IDs. Useful when read IDs differ between the two reads of a pair. Must be a valid Python expression that uses variables called readID and/or i (the 0-based index of the read pair in the bam file) and returns a new value, e.g. “readID[:-2]+’_’+str(i)”. Make sure that transformed readIDs remain unique!
-
--drop-seq
,
--keep-seq
¶
Remove sequences and PHREDs from the sam fields by default. Kept otherwise.
-
--drop-sam
,
--keep-sam
¶
Do not add sams to the output by default. Kept otherwise.
-
--output-parsed-alignments
<output_parsed_alignments>
¶ output file with all parsed alignments (one alignment per line). Useful for debugging and analysis of walks. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4-compressed. By default, not used.
-
--output-stats
<output_stats>
¶ output file for various statistics of pairs file. By default, statistics is not generated.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
SAM_PATH
¶
Optional argument
phase¶
- Phase pairs mapped to a diploid genome.
Diploid genome is the genome with two set of the chromosome variants, where each chromosome has one of two suffixes (phase-suffixes) corresponding to the genome version (phase-suffixes).
By default, phasing adds two additional columns with phase 0, 1 or “.” (unpahsed).
Phasing is based on detection of chromosome origin of each mapped fragment. Three scores are considered: best alignment score (S1), suboptimal alignment (S2) and second suboptimal alignment (S3) scores. Each fragment can be: 1) uniquely mapped and phased (S1>S2>S3, first alignment is the best hit), 2) uniquely mapped but unphased (S1=S2>S3, cannot distinguish between chromosome variants), 3) multiply mapped (S1=S2=S3) or unmapped.
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools phase [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--phase-suffixes
<phase_suffixes>
¶ Phase suffixes (of the chrom names), always a pair.
-
--clean-output
¶
Drop all columns besides the standard ones and phase1/2
-
--tag-mode
<tag_mode>
¶ Specifies the mode of bwa reporting. XA will parse ‘XA’, the input should be generated with: –add-columns XA,NM,AS,XS –min-mapq 0 XB will parse ‘XB’ tag, the input should be generated with: –add-columns XB,AS,XS –min-mapq 0 Note that XB tag is added by running bwa with -u tag, present in github version. Both modes report similar results: XB reports 0.002% contacts more for phased data, while XA can report ~1-2% more unphased contacts because its definition multiple mappers is more premissive.
-
--report-scores
,
--no-report-scores
¶
Report scores of optional, suboptimal and second suboptimal alignments. NM (edit distance) with –tag-mode XA and AS (alfn score) with –tag-mode XB
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
restrict¶
Assign restriction fragments to pairs.
Identify the restriction fragments that got ligated into a Hi-C molecule.
Note: rfrags are 0-indexed
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz/.lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools restrict [OPTIONS] [PAIRS_PATH]
Options
-
-f
,
--frags
<frags>
¶ a tab-separated BED file with the positions of restriction fragments (chrom, start, end). Can be generated using cooler digest. [required]
-
-o
,
--output
<output>
¶ output .pairs/.pairsam file. If the path ends with .gz/.lz4, the output is compressed by bgzip/lz4c. By default, the output is printed into stdout.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
sample¶
Select a random subset of pairs in a pairs file.
FRACTION: the fraction of the randomly selected pairs subset
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
pairtools sample [OPTIONS] FRACTION [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
-s
,
--seed
<seed>
¶ the seed of the random number generator.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
FRACTION
¶
Required argument
-
PAIRS_PATH
¶
Optional argument
scaling¶
Calculate pairs scalings.
INPUT_PATH : by default, a .pairs/.pairsam file to calculate statistics. If not provided, the input is read from stdin.
The files with paths ending with .gz/.lz4 are decompressed by bgzip/lz4c.
Output is .tsv file with scaling stats (both cis scalings and trans levels).
pairtools scaling [OPTIONS] [INPUT_PATH]...
Options
-
-o
,
--output
<output>
¶ output .tsv file with summary.
-
--view
,
--regions
<view>
¶ Path to a BED file which defines which regions (viewframe) of the chromosomes to use. By default, this is parsed from .pairs header.
-
--chunksize
<chunksize>
¶ Number of pairs in each chunk. Reduce for lower memory footprint. [default: 100000]
-
--dist-range
<dist_range>
¶ Distance range. [default: 10, 1000000000]
-
--n-dist-bins
<n_dist_bins>
¶ Number of distance bins to split the distance range. [default: 128]
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
INPUT_PATH
¶
Optional argument(s)
select¶
Select pairs according to some condition.
CONDITION : A Python expression; if it returns True, select the read pair. Any column declared in the #columns line of the pairs header can be accessed by its name. If the header lacks the #columns line, the columns are assumed to follow the .pairs/.pairsam standard (readID, chrom1, chrom2, pos1, pos2, strand1, strand2, pair_type). Finally, CONDITION has access to COLS list which contains the string values of columns. In Bash, quote CONDITION with single quotes, and use double quotes for string variables inside CONDITION.
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip/lz4c. By default, the input is read from stdin.
The following functions can be used in CONDITION besides the standard Python functions:
- csv_match(x, csv) - True if variable x is contained in a list of
comma-separated values, e.g. csv_match(chrom1, ‘chr1,chr2’)
- wildcard_match(x, wildcard) - True if variable x matches a wildcard,
e.g. wildcard_match(pair_type, ‘C*’)
- regex_match(x, regex) - True if variable x matches a Python-flavor regex,
e.g. regex_match(chrom1, ‘chrd’)
Examples: pairtools select ‘(pair_type==”UU”) or (pair_type==”UR”) or (pair_type==”RU”)’ pairtools select ‘chrom1==chrom2’ pairtools select ‘COLS[1]==COLS[3]’ pairtools select ‘(chrom1==chrom2) and (abs(pos1 - pos2) < 1e6)’ pairtools select ‘(chrom1==”!”) and (chrom2!=”!”)’ pairtools select ‘regex_match(chrom1, “chrd+”) and regex_match(chrom2, “chrd+”)’
pairtools select ‘True’ –chrom-subset mm9.reduced.chromsizes
pairtools select [OPTIONS] CONDITION [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, the output is printed into stdout.
-
--output-rest
<output_rest>
¶ output file for pairs of other types. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, such pairs are dropped.
-
--chrom-subset
<chrom_subset>
¶ A path to a chromosomes file (tab-separated, 1st column contains chromosome names) containing a chromosome subset of interest. If provided, additionally filter pairs with both sides originating from the provided subset of chromosomes. This operation modifies the #chromosomes: and #chromsize: header fields accordingly.
-
--startup-code
<startup_code>
¶ An auxiliary code to execute before filtering. Use to define functions that can be evaluated in the CONDITION statement
-
-t
,
--type-cast
<type_cast>
¶ Cast a given column to a given type. By default, only pos and mapq are cast to int, other columns are kept as str. Provide as -t <column_name> <type>, e.g. -t read_len1 int. Multiple entries are allowed.
-
-r
,
--remove-columns
<remove_columns>
¶ Comma-separated list of columns to be removed, e.g.: readID,chrom1,pos1,chrom2,pos2,strand1,strand2,pair_type,sam1,sam2,walk_pair_index,walk_pair_type
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
CONDITION
¶
Required argument
-
PAIRS_PATH
¶
Optional argument
sort¶
Sort a .pairs/.pairsam file.
Sort pairs in the lexicographic order along chrom1 and chrom2, in the numeric order along pos1 and pos2 and in the lexicographic order along pair_type.
PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip or lz4c, correspondingly. By default, the input is read as text from stdin.
pairtools sort [OPTIONS] [PAIRS_PATH]
Options
-
-o
,
--output
<output>
¶ output pairs file. If the path ends with .gz or .lz4, the output is compressed by bgzip or lz4, correspondingly. By default, the output is printed into stdout.
-
--nproc
<nproc>
¶ Number of processes to split the sorting work between. [default: 8]
-
--tmpdir
<tmpdir>
¶ Custom temporary folder for sorting intermediates.
-
--memory
<memory>
¶ The amount of memory used by default. [default: 2G]
-
--compress-program
<compress_program>
¶ A binary to compress temporary sorted chunks. Must decompress input when the flag -d is provided. Suggested alternatives: gzip, lzop, lz4c, snzip. If “auto”, then use lz4c if available, and gzip otherwise. [default: auto]
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRS_PATH
¶
Optional argument
split¶
Split a .pairsam file into .pairs and .sam.
Restore a .sam file from sam1 and sam2 fields of a .pairsam file. Create a .pairs file without sam1/sam2 fields.
PAIRSAM_PATH : input .pairsam file. If the path ends with .gz or .lz4, the input is decompressed by bgzip or lz4c. By default, the input is read from stdin.
pairtools split [OPTIONS] [PAIRSAM_PATH]
Options
-
--output-pairs
<output_pairs>
¶ output pairs file. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. If -, pairs are printed to stdout. If not specified, pairs are dropped.
-
--output-sam
<output_sam>
¶ output sam file. If the path ends with .bam, the output is compressed into a bam file. If -, sam entries are printed to stdout. If not specified, sam entries are dropped.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
PAIRSAM_PATH
¶
Optional argument
stats¶
Calculate pairs statistics.
INPUT_PATH : by default, a .pairs/.pairsam file to calculate statistics. If not provided, the input is read from stdin. If –merge is specified, then INPUT_PATH is interpreted as an arbitrary number of stats files to merge.
The files with paths ending with .gz/.lz4 are decompressed by bgzip/lz4c.
pairtools stats [OPTIONS] [INPUT_PATH]...
Options
-
-o
,
--output
<output>
¶ output stats tsv file.
-
--merge
¶
If specified, merge multiple input stats files instead of calculating statistics of a .pairs/.pairsam file. Merging is performed via summation of all overlapping statistics. Non-overlapping statistics are appended to the end of the file. Supported for tsv stats with single filter.
-
--with-chromsizes
,
--no-chromsizes
¶
If enabled, will store sizes of chromosomes from the header of the pairs file in the stats file.
-
--yaml
,
--no-yaml
¶
Output stats in yaml format instead of table.
-
--bytile-dups
,
--no-bytile-dups
¶
If enabled, will analyse by-tile duplication statistics to estimate library complexity more accurately. Requires parent_readID column to be saved by dedup (will be ignored otherwise) Saves by-tile stats into –output_bytile-stats stream, or regular output if –output_bytile-stats is not provided.
-
--output-bytile-stats
<output_bytile_stats>
¶ output file for tile duplicate statistics. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed. By default, by-tile duplicate statistics are not printed. Note that the readID and parent_readID should be provided and contain tile information for this option.
-
--filter
<filter>
¶ Filters with conditions to apply to the data (similar to pairtools select). For non-YAML output only the first filter will be reported. Example: pairtools stats –yaml –filter ‘unique:(pair_type==”UU”)’ –filter ‘close:(pair_type==”UU”) and (abs(pos1-pos2)<10)’ test.pairs
-
--engine
<engine>
¶ Engine for regular expression parsing. Python will provide you regex functionality, while pandas does not accept custom funtctions and works faster.
-
--chrom-subset
<chrom_subset>
¶ A path to a chromosomes file (tab-separated, 1st column contains chromosome names) containing a chromosome subset of interest. If provided, additionally filter pairs with both sides originating from the provided subset of chromosomes. This operation modifies the #chromosomes: and #chromsize: header fields accordingly.
-
--startup-code
<startup_code>
¶ An auxiliary code to execute before filtering. Use to define functions that can be evaluated in the CONDITION statement
-
-t
,
--type-cast
<type_cast>
¶ Cast a given column to a given type. By default, only pos and mapq are cast to int, other columns are kept as str. Provide as -t <column_name> <type>, e.g. -t read_len1 int. Multiple entries are allowed.
-
--nproc-in
<nproc_in>
¶ Number of processes used by the auto-guessed input decompressing command. [default: 3]
-
--nproc-out
<nproc_out>
¶ Number of processes used by the auto-guessed output compressing command. [default: 8]
-
--cmd-in
<cmd_in>
¶ A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin and pairtools parse. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3
-
--cmd-out
<cmd_out>
¶ A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8
Arguments
-
INPUT_PATH
¶
Optional argument(s)
Pairtools walkthrough¶
Welcome to the pairtools walkthrough.
Pairtools is a tool for extraction of pairwise contacts out of sequencing chromosomes conformation capture data, such as Hi-C, Micro-C or MC-3C. Pairtools is used for obtaining .cool files by distiller, and has many more applications (see single-cell walkthrough or phasing walkthrough).
Here, we will cover the basic steps from raw reads to .cool file with binned contacts.
Outline:
- Download raw data
- Install reference genome
- Map data with bwa mem
- Extract contacts
- MultiQC
- Load pairs to cooler
- Visualize cooler
Download raw data¶
“Raw” data, or .fastq files are generated by sequencing facilities or can be taken from public databases, such as SRA. We will take a sample from Rao et at al. 2017, human datasets. To reduce computateion time, take 5 mln reads instead of full sample:
[ ]:
! fastq-dump SRR13849430 --gzip --split-spot --split-3 --minSpotId 0 --maxSpotId 5000000
[ ]:
! ls SRR13849430*.fastq.gz
[1]:
%%bash
gzip -dc SRR13849430_1.fastq.gz | head -n 8
@SRR13849430.1 1 length=150
NTCTCAGCCTTTATAAGATAGAAGAGAGTTGGGACCTTGCTCTAAATTCTGCTTTAGCAAGGGACTTTTGTACCTGCTTTCTTCCTTTATCCAGATCTAAAAATAGTTTATATGCTGACAACTCCCTGATGTTATTCTTTGTAGTATTTG
+SRR13849430.1 1 length=150
#AAFFJJJJJJJJJJJAJAJJJJFJJJAFFFFFFA7A-FJ7JJJ-AJAJJF-<-JJFFJ7FJJF7FJJFJJ<JFAF<JJJJAJJJJA-AFJ-FJ-FF<7-A-7A-FFF<J-FJJJ<F7)A-F--7))))--77--7------A---7-7-
@SRR13849430.2 2 length=150
NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA
+SRR13849430.2 2 length=150
#AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-
We next map the sequencing data to the reference genome.
Install reference genome¶
Although you may download and build bwa index for any genome manually, genompy
can do that for you automatically:
[ ]:
# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa
[ ]:
# Install hg38 genome by genomepy:
! genomepy install hg38
[2]:
# Check the location of the genome:
! ls ~/.local/share/genomes/hg38/
bwa-mem2 hg38_DpnII.bed hg38.fa hg38.fa.sizes index
hg38.blacklist.bed.gz hg38.DpnII.bed hg38.fa.fai hg38.gaps.bed README.txt
Map data with bwa mem¶
Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.
[ ]:
# Map test data:
! bwa mem -t 5 -SP ~/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz > test.bam
After mapping, you have .sam/.bam alignment file, which cannot be interpreted as pairs directly. You need to extract contacts from it:
[3]:
%%bash
samtools view test.bam | head -n 4
SRR13849430.1 121 chr12 78795720 60 53S97M = 78795720 0 CAAATACTACAAAGAATAACATCAGGGAGTTGTCAGCATATAAACTATTTTTAGATCTGGATAAAGGAAGAAAGCAGGTACAAAAGTCCCTTGCTAAAGCAGAATTTAGAGCAAGGTCCCAACTCTCTTCTATCTTATAAAGGCTGAGAN -7-7---A------7--77--))))7--F-A)7F<JJJF-J<FFF-A7-A-7<FF-JF-JFA-AJJJJAJJJJ<FAFJ<JJFJJF7FJJF7JFFJJ-<-FJJAJA-JJJ7JF-A7AFFFFFFAJJJFJJJJAJAJJJJJJJJJJJFFAA# NM:i:1 MD:Z:96G0 AS:i:96 XS:i:21
SRR13849430.1 181 chr12 78795720 0 * = 78795720 0 TTATATGCTGACGTCGTCCATTGTTGATTCTTTGTGGTCTGTGTTCCCTCTGAAGCTCTCTGTACTTGTATATACAATTAGGCACCAACAATTCAACATGGGTAGTTATTATACTGCTAAAGCGACCCCAGAGCGAACCGGGCTACCTTN ---7---7-7)))-7---7-----7-7<---7-)<)<-----)-)<77--7---)7-<7-7----A-F<<-AF7-A777--77-7-----7--<--7-F777-A7-7---<7--7<7-<--77-77--77--7-7A---7-7---FF<A# MC:Z:53S97M AS:i:0 XS:i:0
SRR13849430.2 73 chr8 43879758 60 75M75S = 43879758 0 NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA #AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7- NM:i:2 MD:Z:0A58C15 AS:i:69 XS:i:29 SA:Z:chr8,43877720,+,73S54M23S,44,1;
SRR13849430.2 2121 chr8 43877720 44 73H54M23H = 43877720 0 AATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTG A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA- NM:i:1 MD:Z:5A48 AS:i:49 XS:i:35 SA:Z:chr8,43879758,+,75M75S,60,2;
Contacts extraction¶
Challenges of contacts extraction solved by pairtools
:
- multiple walks in each Hi-C read
- alignments spaced by short genomic distance may be reported as single alignment with a gap by bwa
- PCR and optical duplicates
pairtools parse¶
The next step is to parse .sam/.bam into pairs. pairtools parse
automatically detect the file format.
Full parse documentation: https://github.com/open2c/pairtools/blob/master/doc/parsing.rst
[ ]:
%%bash
pairtools parse -o test.pairs.gz -c ~/.local/share/genomes/hg38/hg38.fa.sizes \
--drop-sam --drop-seq --output-stats test.stats \
--assembly hg38 --no-flip \
--add-columns mapq \
--walks-policy mask \
test.bam
Parse will result in .pairs file. .pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.
[4]:
%%bash
# Header
gzip -dc test.pairs.gz | head -n 5
## pairs format v1.0.0
#shape: whole matrix
#genome_assembly: hg38
#chromsize: chr1 248956422
#chromsize: chr10 133797422
[5]:
%%bash
# Last lines of the header contain the list of columns:
gzip -dc test.pairs.gz | grep "#" | tail -n 5
#samheader: @SQ SN:chrX LN:156040895
#samheader: @SQ SN:chrY LN:57227415
#samheader: @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:./bwa mem -t 5 -SP /home/agalicina/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz
#samheader: @PG ID:pairtools_parse PN:pairtools_parse CL:./pairtools parse -o test.pairs.gz -c /home/agalicina/.local/share/genomes/hg38/hg38.fa.sizes --drop-sam --drop-seq --output-stats test.stats --assembly hg38 --no-flip --add-columns mapq --walks-policy mask test.bam PP:bwa VN:1.0.0-dev1
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type mapq1 mapq2
[6]:
%%bash
# Body of the .pairs file
gzip -dc test.pairs.gz | grep -v "#" | head -n 5
SRR13849430.1 chr12 78795816 ! 0 - - UN 60 0
SRR13849430.2 ! 0 ! 0 - - WW 60 0
SRR13849430.3 chr2 72005391 ! 0 + - UN 60 0
SRR13849430.4 chr2 20530788 ! 0 + - UN 60 0
SRR13849430.5 ! 0 ! 0 - - WW 0 0
pairtools sort¶
After parsing each pair is listed in the orderthat it appeared in the reads file. Sorting makes it more organised, and also helps to perform deduplication which cannot work on non-sorted files.
[ ]:
%%bash
pairtools sort --nproc 5 -o test.sorted.pairs.gz test.pairs.gz
pairtools dedup¶
[ ]:
%%bash
pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
test.pairs.gz
[7]:
%%bash
# Unique pairs:
gzip -dc test.nodups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.513 chr20 23502312 chr20 23063544 + + RU 60 60
SRR13849430.1442 chr7 57224960 chrX 82818236 + + UU 60 30
SRR13849430.2378 chr5 115925933 chr21 24124840 + + UU 60 50
SRR13849430.2547 chr1 52097837 chr12 1888807 - - UU 60 60
SRR13849430.3015 chr17 74750879 chr11 117356318 + - UR 60 60
SRR13849430.3027 chr15 34977762 chr15 31897447 - + UR 11 60
SRR13849430.3406 chr11 1171960 chr9 121265592 + - UU 60 60
SRR13849430.3988 chr16 86824176 chr13 104521019 - + UU 60 17
SRR13849430.4030 chr17 73189645 chr4 49092470 - + UU 60 31
SRR13849430.4316 chr8 124329308 chr8 124336541 - - UU 60 60
[8]:
%%bash
# Only duplicated pairs:
gzip -dc test.dups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.60371 chr2 44507613 chr7 116276932 - + DD 60 57
SRR13849430.67567 chr5 62425895 chr5 62425612 - + DD 60 60
SRR13849430.97623 chr3 162233323 chr3 162154449 - + DD 60 52
SRR13849430.108366 chr8 48691403 chr8 48872239 - - DD 60 60
SRR13849430.138622 chr16 8435050 chr16 6032751 + - DD 60 60
SRR13849430.146482 chr14 86385083 chr2 119648648 + + DD 60 60
SRR13849430.148232 chrX 21885792 chrX 21887418 + - DD 60 60
SRR13849430.149771 chr16 6646543 chr16 6648097 - - DD 60 60
SRR13849430.156983 chr4 55704089 chr4 76039070 + + DD 60 13
SRR13849430.157962 chr6 47656758 chr6 47748395 + - DD 60 35
pairtools select¶
Sometimes you may need certain types of pairs based on their properties, such as mapq, pair type, distance or orientation. For all these manipulations, there is pairtools select
which requires a file and pythonic condition as an input:
[ ]:
%%bash
pairtools select "mapq1>0 and mapq2>0" test.nodups.pairs.gz -o test.nodups.UU.pairs.gz
pairtools stats¶
Describe the types fo distance properties of pairs:
[ ]:
%%bash
pairtools stats test.pairs.gz -o test.stats
MultiQC¶
[ ]:
%%bash
multiqc test.stats
[9]:
from IPython.display import IFrame
IFrame(src='./multiqc_report.html', width=1200, height=700)
[9]:
Load pairs to cooler¶
Finally, when you obtained a list of appropriate pairs, you may create coolers with it:
[ ]:
%%bash
cooler cload pairs \
-c1 2 -p1 3 -c2 4 -p2 5 \
--assembly hg38 \
~/.local/share/genomes/hg38/hg38.fa.sizes:1000000 \
test.nodups.UU.pairs.gz \
test.hg38.1000000.cool
[ ]:
%%bash
cooler zoomify \
--nproc 5 \
--out test.hg38.1000000.mcool \
--resolutions 1000000,2000000 \
--balance \
test.hg38.1000000.cool
Visualize cooler¶
Based on open2c vis example
[36]:
import cooler
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import cooltools.lib.plotting
from matplotlib.colors import LogNorm
import seaborn as sns
import bioframe
import numpy as np
[11]:
file = "test.hg38.1000000.mcool::/resolutions/1000000"
[12]:
clr = cooler.Cooler(file)
[19]:
# Define chromosome starts
chromstarts = []
for i in clr.chromnames:
chromstarts.append(clr.extent(i)[0])
[20]:
from matplotlib.ticker import EngFormatter
bp_formatter = EngFormatter('b')
def format_ticks(ax, x=True, y=True, rotate=True):
if y:
ax.yaxis.set_major_formatter(bp_formatter)
if x:
ax.xaxis.set_major_formatter(bp_formatter)
ax.xaxis.tick_bottom()
if rotate:
ax.tick_params(axis='x',rotation=45)
[40]:
# Define the bounds of the continuous fragment of whole-genome interaction map
chrom_start, chrom_end = clr.chromnames.index('chr3'), clr.chromnames.index('chr6')
start, end = chromstarts[chrom_start], chromstarts[chrom_end]
[43]:
vmax = 15
norm = LogNorm(vmin=1, vmax=vmax)
f, axs = plt.subplots(
figsize=(13, 10),
nrows=2,
ncols=1,
sharex=False, sharey=False)
ax = axs[0]
ax.set_title('Interaction maps (chr1)')
im = ax.matshow(clr.matrix(balance=False).fetch('chr1'), vmax=vmax, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='chr1');
ax = axs[1]
ax.set_title('Chromosomes 3-5')
im = ax.matshow(clr.matrix(balance=False)[start:end, start:end], norm=norm, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='Whole-genome');
ax.set_xticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
ax.set_yticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
format_ticks(axs[0], rotate=False)
plt.tight_layout()

[ ]:
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker
import matplotlib.gridspec
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-poster')
import pandas as pd
import pairtools
import bioframe
Pairtools: restriction walkthrough¶
The common approach to analyse Hi-C data is based to analyse the contacts of the restriction fragments. It is used in hiclib, Juicer, HiC-Pro.
Throughout this notebook, we will work with one of Rao et al. 2014 datasets for IMR90 cells [1].
[1] Rao, S. S., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., Sanborn, A. L., Machol, I., Omer, A. D., Lander, E. S., & Aiden, E. L. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7), 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021
Download the data from 4DN portal¶
To download the data from 4DN, you may need to register, get key and secret and write a spceialized curl command for your user:
[61]:
!curl -O -L --user RG6CSRMC:xlii3stnkphfygmu https://data.4dnucleome.org/files-processed/4DNFIW2BKSNF/@@download/4DNFIW2BKSNF.pairs.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 330 100 330 0 0 931 0 --:--:-- --:--:-- --:--:-- 932
100 3395M 100 3395M 0 0 29.7M 0 0:01:54 0:01:54 --:--:-- 33.1M 0:01:48 0:00:12 0:01:36 32.8M
[ ]:
%%bash
# Get total number of contacts to assess how many reads you can read in the future:
pairtools stats 4DNFIW2BKSNF.pairs.gz | head -n 1
# This will produce around 173 M pairs
[ ]:
%%bash
# Sample the fraction of pairs that will produce ~ 1 M of pairs:
pairtools sample 0.007 4DNFIW2BKSNF.pairs.gz -o 4DNFIW2BKSNF.pairs.sampled.gz
Annotate restriction fragments¶
[ ]:
%%bash
# Digest the genome into restriction fragments:
cooler digest ../tests_chromap/hg38/hg38.fa.sizes ../tests_chromap/hg38/hg38.fa MboI > hg38/hg38.MboI.restricted.bed
[ ]:
%%bash
# Annotate restriction fragments in the sampled file:
pairtools restrict -f hg38/hg38.MboI.restricted.bed 4DNFIW2BKSNF.pairs.sampled.gz -o 4DNFIW2BKSNF.pairs.sampled.restricted.gz
Read the pairs and analyse them as dataframe¶
[3]:
from pairtools.lib import headerops, fileio
[4]:
pairs_file = '4DNFIW2BKSNF.pairs.sampled.restricted.gz'
[5]:
pairs_stream = fileio.auto_open(pairs_file, 'r')
header, pairs_stream = headerops.get_header(pairs_stream)
columns = headerops.get_colnames(header)
[6]:
df = pd.read_table(pairs_stream, comment="#", header=None)
df.columns = columns
[ ]:
[7]:
df.loc[:, 'dist_rfrag1_left'] = df.pos1 - df.rfrag_start1
df.loc[:, 'dist_rfrag1_right'] = df.rfrag_end1 - df.pos1
df.loc[:, 'dist_rfrag2_left'] = df.pos2 - df.rfrag_start2
df.loc[:, 'dist_rfrag2_right'] = df.rfrag_end2 - df.pos2
Many of the 5’-ends of reads are mapped to the restriction sites:
[8]:
xmin = 0
xmax = 2000
step = 20
sns.distplot(df.query('strand1=="+"').dist_rfrag1_left, bins=np.arange(xmin, xmax, step), label='Distance from the 5\' read end to the nearest upstream rsite, + mapped reads')
sns.distplot(df.query('strand1=="+"').dist_rfrag1_right, bins=np.arange(xmin, xmax, step), label='Distance from the 5\' read end to the nearest downstream rsite, + mapped reads')
plt.xlim(xmin, xmax)
plt.legend()
plt.tight_layout()

[9]:
xmin = 0
xmax = 200
step = 1
sns.distplot(df.query('strand1=="+"').dist_rfrag1_left, bins=np.arange(xmin, xmax, step), label='Distance from the 5\' read end to the nearest upstream rsite, + mapped reads')
sns.distplot(df.query('strand1=="+"').dist_rfrag1_right, bins=np.arange(xmin, xmax, step), label='Distance from the 5\' read end to the nearest downstream rsite, + mapped reads')
plt.xlim(xmin, xmax)
plt.legend()
plt.tight_layout()

However, if we select only the pairs that map to the restriction sites, there is no significant skew in scaling:
[10]:
hg38_chromsizes = bioframe.fetch_chromsizes('hg38',
as_bed=True)
hg38_cens = bioframe.fetch_centromeres('hg38')
hg38_arms = bioframe.make_chromarms(hg38_chromsizes,
dict(hg38_cens.set_index('chrom').mid),
cols_chroms=('chrom', 'start', 'end') )
# To fix pandas bug in some versions:
hg38_arms['start'] = hg38_arms['start'].astype(int)
hg38_arms['end'] = hg38_arms['end'].astype(int)
[11]:
import pairtools.lib.scaling as scaling
[12]:
def plot(cis_scalings, n, xlim=(1e1,1e9), label='' ):
strand_gb = cis_scalings.groupby(['strand1', 'strand2'])
for strands in ['+-', '-+', '++', '--']:
sc_strand = strand_gb.get_group(tuple(strands))
sc_agg = (sc_strand
.groupby(['min_dist','max_dist'])
.agg({'n_pairs':'sum', 'n_bp2':'sum'})
.reset_index())
dist_bin_mids = np.sqrt(sc_agg.min_dist * sc_agg.max_dist)
pair_frequencies = sc_agg.n_pairs / sc_agg.n_bp2
pair_frequencies = pair_frequencies/cis_scalings.n_pairs.sum()
mask = pair_frequencies>0
label_long = f'{strands[0]}{strands[1]} {label}'
if np.sum(mask)>0:
plt.loglog(
dist_bin_mids[mask],
pair_frequencies[mask],
label=label_long,
lw=2
)
plt.gca().xaxis.set_major_locator(matplotlib.ticker.LogLocator(base=10.0,numticks=20))
plt.gca().yaxis.set_major_locator(matplotlib.ticker.LogLocator(base=10.0,numticks=20))
plt.gca().set_aspect(1.0)
plt.xlim(xlim)
plt.grid(lw=0.5,color='gray')
plt.legend(loc=(1.1,0.4))
plt.ylabel('contact frequency, \nHi-C molecule per bp pair normalized by total')
plt.xlabel('distance, bp')
plt.tight_layout()
[13]:
# Get the pairs where R1 is far enough from site of restriction, but not too far
df_subset = df.query("(strand1=='+' and dist_rfrag1_left>5 and dist_rfrag1_left<=250)")
n_distant = len(df_subset)
cis_scalings_distant, trans_levels_distant = scaling.compute_scaling(
df_subset,
regions=hg38_arms,
chromsizes=hg38_arms,
dist_range=(10, 1e9),
n_dist_bins=128,
chunksize=int(1e7),
)
plot(cis_scalings_distant, n_distant, label="pairs, 5' distant from rsite")
# Get the pairs where R1 is too far enough from site of restriction
df_subset = df.query("(strand1=='+' and dist_rfrag1_left>550)")
n_toodistant = len(df_subset)
cis_scalings_toodistant, trans_levels_toodistant = scaling.compute_scaling(
df_subset,
regions=hg38_arms,
chromsizes=hg38_arms,
dist_range=(10, 1e9),
n_dist_bins=128,
chunksize=int(1e7),
)
plot(cis_scalings_toodistant, n_toodistant, label="pairs, 5' too far from rsite")
# Get the pairs where R1 is very close to the site of restriction
df_subset = df.query("(strand1=='+' and dist_rfrag1_left<5)")
n_tooclose = len(df_subset)
cis_scalings_tooclose, trans_levels_tooclose = scaling.compute_scaling(
df_subset,
regions=hg38_arms,
chromsizes=hg38_arms,
dist_range=(10, 1e9),
n_dist_bins=128,
chunksize=int(1e7),
)
plot(cis_scalings_tooclose, n_tooclose, label="pairs, 5' close to rsite")
# Try another replicate of replicate, maybe the last one

How many pairs we take if not strictly filtering by dangling ends and self-circles?¶
[14]:
df.loc[:, "type_rfrag"] = "Regular pair"
mask_neighboring_rfrags = (np.abs(df.rfrag1-df.rfrag2)<=1)
mask_DE = (df.strand1=="+") & (df.strand2=="-") & mask_neighboring_rfrags
df.loc[mask_DE, "type_rfrag"] = "DanglingEnd"
mask_SS = (df.strand1=="-") & (df.strand2=="+") & mask_neighboring_rfrags
df.loc[mask_SS, "type_rfrag"] = "SelfCircle"
mask_Err = (df.strand1==df.strand2) & mask_neighboring_rfrags
df.loc[mask_Err, "type_rfrag"] = "Mirror"
[15]:
df.sort_values("type_rfrag").groupby("type_rfrag").count()['readID']
[15]:
type_rfrag
DanglingEnd 76902
Mirror 3214
Regular pair 1132002
SelfCircle 3036
Name: readID, dtype: int64
[16]:
# Full scaling
n = len(df)
cis_scalings, trans_levels = scaling.compute_scaling(
df,
regions=hg38_arms,
chromsizes=hg38_arms,
dist_range=(10, 1e9),
n_dist_bins=128,
chunksize=int(1e7),
)
plot(cis_scalings, n, label="pairs")
# The point where the scalings by distance become balanced:
plt.axvline(2e3, ls='--', c='gray', label='Balancing point')
plt.savefig("./oriented_scalings.pdf")

[17]:
df.loc[:, "type_bydist"] = "Regular pair"
mask_ondiagonal = (np.abs(df.pos2-df.pos1)<=2e3)
mask_DE = (df.strand1=="+") & (df.strand2=="-") & mask_ondiagonal
df.loc[mask_DE, "type_bydist"] = "DanglingEnd"
mask_SS = (df.strand1=="-") & (df.strand2=="+") & mask_ondiagonal
df.loc[mask_SS, "type_bydist"] = "SelfCircle"
mask_Err = (df.strand1==df.strand2) & mask_ondiagonal
df.loc[mask_Err, "type_bydist"] = "Mirror"
[18]:
df.sort_values("type_bydist").groupby("type_bydist").count()['readID']
[18]:
type_bydist
DanglingEnd 135381
Mirror 18383
Regular pair 1053213
SelfCircle 8177
Name: readID, dtype: int64
[19]:
df.sort_values(["type_rfrag", "type_bydist"])\
.groupby(["type_rfrag", "type_bydist"])\
.count()[['readID']]\
.reset_index()\
.pivot(columns="type_bydist", index="type_rfrag")\
.fillna(0).astype(int)
[19]:
readID | ||||
---|---|---|---|---|
type_bydist | DanglingEnd | Mirror | Regular pair | SelfCircle |
type_rfrag | ||||
DanglingEnd | 76898 | 0 | 4 | 0 |
Mirror | 0 | 3176 | 38 | 0 |
Regular pair | 58483 | 15207 | 1052994 | 5318 |
SelfCircle | 0 | 0 | 177 | 2859 |
False Positives are in 3rd row, False Negatives are in 3rd column. Filtering by distance is, thus, nearly as effective as filtering by restriction fragment, but removes additional pairs that can be potential undercut by restriction enzyme.
Removing all contacts closer than 2 Kb will remove Hi-C artifacts.
Pairtools phase walkthrough¶
Welcome to the pairtools phase walkthrough!
Haplotype-resolved Hi-C is a popular technique that helps you to resolve contacts of homologous chromosomes. It relies on a simple idea tha homologous chromosomes have variations (e.g., SNPs) that are inherited together as haplotypes. DNA reads in Hi-C will have the SNVs from one of two haplotypes, which can be used to distinguish the contacts on the same chromosome (cis-homologous) and contacts connecting two homologs (trans-homologous).
The experimental challenge of the haplotype-resolved Hi-C is to increase the number of SNPs that distinguish reads from different chromosomes. This can be dome by mating highly diverged.
- Erceg et al. 2019 create highly heterozygous embryos of Drosophila [1]
- Collombet et al. 2020 create highly polymorphic F1 hybrid embryos obtained by crossing female Mus musculus domesticus (C57Bl/6J) with male Mus musculus castaneus CAST/EiJ) to resolve structures of individual chromosomes in the zygote and embryos [2]
- Tan et al. 2018 uses available heterozygous positions to infer the 3D structures of single chromosomes by single-cell variant of the protocol Dip-C [3]
- Duan et al. use dikaryonic nuclei of fungi with 0.7% heterozygosity [4]
In pairtools
we implement an approach to resolving haplotypes from Erceg et al. The outline of haplotype-resolved parsing of pairs:
Create the reference genome: create the concatenated reference genomes from two haplotypes.
Usually the SNVs are known and can be obtained in .vsf format. We will incorporate the SNVs by bcftools into the reference and create updated fasta files with haplotype-corrected sequences. For each homologue we will add the suffixes that identify the type of homologue (
_hap1
or_hap2
).Map the Hi-C data to the concatenated reference and parse allowing multimappers (mapq 0).
We will also need the mapper to report two suboptimal alignments (aka the second and the third hit). When the Hi-C read is mapped to some location in the genome, it will have the suffix of the homologue reported as part of chromosome name. However, the true resolved pairs are not yet known at this step.
See sections:
pairtools phase: phase the pairs based on the reported suboptimal alignments.
By checking the scores of two suboptimal alignments, we will distinguish the true multi-mappers from unresolved pairs (i.e. cases when the read aligns to the location with no distinguishing SNV). Phasing procedure will remove the haplotype suffixes from chromosome names and add extra fields to the .pairs file with:
‘.’ (non-resolved)
‘0’ (first haplotype) or
‘1’ (second haplotype).
Phasing schema:
Post-procesing. Do sorting, dedup and stats, as usual.
See sections:
[1] Erceg, J., AlHaj Abed, J., Goloborodko, A., Lajoie, B. R., Fudenberg, G., Abdennur, N., Imakaev, M., McCole, R. B., Nguyen, S. C., Saylor, W., Joyce, E. F., Senaratne, T. N., Hannan, M. A., Nir, G., Dekker, J., Mirny, L. A., & Wu, C. T. (2019). The genome-wide multi-layered architecture of chromosome pairing in early Drosophila embryos. Nature communications, 10(1), 4486. https://doi.org/10.1038/s41467-019-12211-8
[2] Collombet, S., Ranisavljevic, N., Nagano, T., Varnai, C., Shisode, T., Leung, W., Piolot, T., Galupa, R., Borensztein, M., Servant, N., Fraser, P., Ancelin, K., & Heard, E. (2020). Parental-to-embryo switch of chromosome organization in early embryogenesis. Nature, 580(7801), 142–146. https://doi.org/10.1038/s41586-020-2125-z
[3] Tan, L., Xing, D., Chang, C. H., Li, H., & Xie, X. S. (2018). Three-dimensional genome structures of single diploid human cells. Science (New York, N.Y.), 361(6405), 924–928. https://doi.org/10.1126/science.aat5641
[4] Duan, H., Jones, A. W., Hewitt, T., Mackenzie, A., Hu, Y., Sharp, A., Lewis, D., Mago, R., Upadhyaya, N. M., Rathjen, J. P., Stone, E. A., Schwessinger, B., Figueroa, M., Dodds, P. N., Periyannan, S., & Sperschneider, J. (2022). Physical separation of haplotypes in dikaryons allows benchmarking of phasing accuracy in Nanopore and HiFi assemblies with Hi-C data. Genome biology, 23(1), 84. https://doi.org/10.1186/s13059-022-02658-2
We will test on a sample from Collombet et al. 2019 [2], example of mouse single-cell Hi-C on embryos obtained from highly heterozygous parents. We will take some cell from the dataset, GSM3691125_2CSE_70. Note that becuase the procedure is not strictly Hi-C, the properties of this dataset may differ from what you may obtain on bulk data.
Create the reference genome¶
For phasing, map the data to the concatenated genome with two haplotypes. Obtaining such genome is not a simple task. You will need a reference genome, and one or two lists of mutations to instroduce to the reference.
Download reference genome¶
[ ]:
! wget ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
Download .vcf file with variants¶
[ ]:
! wget ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
Index the variants¶
[ ]:
! bcftools index CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
Introduce the variants into the genome¶
Note that you may select the variants that are only SNPs but not SNVs (deletions/insertions) by using --include
parameter of bcftools consensus
(e.g. --include '(STRLEN(REF)=1) & (STRLEN(ALT[0])=1)'
). This will make sure that the genomic coorditates correspond between the haplotypes. Correspondence of coordinates is not a requirement, but might be important for downstream analysis.
[ ]:
%%bash
bcftools consensus --fasta-ref GRCm38_68.fa.gz \
--haplotype 1 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap1/' | bgzip -c > GRCm38_EiJ_snpsonly_hap1.fa.gz
bcftools consensus --fasta-ref GRCm38_68.fa.gz \
--haplotype 2 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap2/' | bgzip -c > GRCm38_EiJ_snpsonly_hap2.fa.gz
Create the index of concatenated haplotypes¶
Concatenate the genomes and index them together. Note that bwa-mem2 produces very similar results to bwa mem, while being x2-3 times faster. We highly recommend to use it instead of bwa!
[ ]:
%%bash
cat GRCm38_EiJ_snpsonly_hap1.fa.gz GRCm38_EiJ_snpsonly_hap2.fa.gz > GRCm38_EiJ_snpsonly.fa.gz
bwa index GRCm38_EiJ_snpsonly.fa.gz
Generate chromosome sizes file:
[ ]:
%%bash
faidx GRCm38_EiJ_snpsonly.fa.gz -i chromsizes > GRCm38_EiJ_snpsonly.chromsizes
Download data¶
Uncomment the --minSpotId
and --maxSpotId
if you want to run the small test instead of full run.
[ ]:
! fastq-dump SRR8811373 --gzip --split-spot --split-3 # --minSpotId 0 --maxSpotId 1000000
[ ]:
! ls SRR8811373*.fastq.gz
Map data with bwa mem to diploid genome¶
Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.
There are two modes to work with phasing.
Github mode with XB bwa tag. This is the most precise algorithm that operates based on alignment scores of optimal alignment (best hit), and two suboptimal ones.
Download and install bwa from GitHub. Map with:
./bwa/bwa mem -SPu -t 5 mm10_EiJ_snpsonly.fa.gz test.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XB.bam
Regular mode with XA bwa tag.
This is simplified version that operates on number of mismatches for the suboptimal alignments.
bwa mem -SP -t 5 mm10_EiJ_snpsonly.fa.gz est.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XA.bam
We will try the second option for the simplicity:
[ ]:
%%bash
bwa mem -SP -t 5 GRCm38_EiJ_snpsonly.fa.gz SRR8811373_1.fastq.gz SRR8811373_2.fastq.gz \
| samtools view -@ 8 -b > mapped.XA.bam
pairtools parse¶
For phasing, we need additional tags and no filtering by mapq.
--min-mapq
is 1 by default, which removes all multiply mapped sequences. However, we need this information for phasing to distinguish true multiply mapped pairs from pairs mapped to both haplotypes:
[ ]:
%%bash
pairtools parse --add-columns XA,NM,AS,XS --min-mapq 0 --drop-sam --walks-policy all \
-c GRCm38_EiJ_snpsonly.chromsizes mapped.XA.bam -o unphased.XA.pairs.gz
pairtools phase¶
Phasing will remove the tags “_1” and “_2” from chromosome names and add a separate field for the phase:
[ ]:
%%bash
pairtools phase --phase-suffixes _hap1 _hap2 --tag-mode XA --clean-output unphased.XA.pairs.gz -o phased.XA.pairs.gz
pairtools dedup¶
Sort prior to dedup:
[ ]:
%%bash
pairtools sort phased.XA.pairs.gz --nproc 10 -o phased.sorted.XA.pairs.gz
Deduplication now should take additional columns with phases into account:
[ ]:
%%bash
pairtools dedup --mark-dups --extra-col-pair phase1 phase2 \
--output-dups - --output-unmapped - --output-stats phased.XA.dedup.stats \
-o phased.sorted.XA.nodup.pairs.gz phased.sorted.XA.pairs.gz
Dedup might generate warning that phase columns now contain mixed data types (‘.’ alongside with 0 and 1). This warning is inherited from reading by reading the pairs file by pandas.
Stats¶
First, filter different types of reads:
[ ]:
%%bash
pairtools select '(phase1=="0") and (phase2=="0")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase0.pairs.gz
pairtools select '(phase1=="1") and (phase2=="1")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase1.pairs.gz
pairtools select '(phase1==".") or (phase2==".")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.unphased.pairs.gz
pairtools select '(phase1!=phase2) and (phase1!=".") and (phase2!=".")' phased.sorted.XA.nodup.pairs.gz \
-o phased.XA.trans-phase.pairs.gz
Calculate stats for these different types:
[ ]:
%%bash
pairtools stats phased.XA.phase0.pairs.gz -o phased.XA.phase0.stats
pairtools stats phased.XA.phase1.pairs.gz -o phased.XA.phase1.stats
pairtools stats phased.XA.unphased.pairs.gz -o phased.XA.unphased.stats
pairtools stats phased.XA.trans-phase.pairs.gz -o phased.XA.trans-phase.stats
Visualize with multiQC:
[ ]:
%%bash
multiqc phased.XA.*phase*.stats -o multiqc_report_phasing
[1]:
from IPython.display import IFrame
IFrame(src='./multiqc_report_phasing/multiqc_report.html', width=1200, height=700)
[1]:
[ ]:
Pairtools benchmarking¶
Welcome to pairtools benchmarking. These are the instructions on how to test performance of different software for mapping Hi-C and Hi-C-like methods. Mapping usually results in the file with mapped pairs, which is then converted into binned matrix format. Pairs format is the “rawest” interpretable type of data after reads.
Reviewing the literature suggests that there are at least 6 methods to map Hi-C and Hi-C-like data. These include:
pairtools is a lightweight Python CLI that extracts and manipulates Hi-C contacts post-alignment. Aslignment can be done by:
- bwa mem
- bwa-mem2, ahn optimized version of bwa mem, which x2-2.5 improves speed over bwa
chromap is a fast alignment tool for chromatin profiles, not specialized for Hi-C but parameterized for a broad range of sequencing data including Hi-C short reads.
Does not require separate step of mapping.
HiC-Pro is a pipeline for Hi-C and DNase-C mapping, “optimized and flexible”.
It calls mapping within. By default, creates the output cooler files with binned data, but the script can be tinkered in order to stop the processing at the step of pairs.
Juicer is a platform for analysis of Hi-C data, which is already adapted to a wide range of cluster types.
It calls mapping within. Has an option to stop the data processing at the step of pairs, without further construction of binned matrices.
HiCExplorer is a broad-range set of tools for processing, normalization, analysis and visualization Hi-C and Hi-C-like methods.
FAN-C is a set of CLI tools that runs the mapping (bowtie or bwa mem), extracts and manipulates Hi-C contacts. It also has the tools for data visualization and downstream analysis.
We benchmark these programs on one million of representative reads. These reads are taken from random replicate from Rao SSP et al., “Cohesin Loss Eliminates All Loop Domains.”, Cell, 2017 Oct 5;171(2):305-320.e24 Generally, it is useful to assess how much computational time you need per million of reads. As long as you have this assessment, you may multiply the size of your experiment by the whole library size (in mlns of reads), because we expect linear growth of computational complexity of reads mapping with library size.
The benchmarking consists of four general steps. If you want to reproduce it, you need to run steps 1 and 2 manually in order to create the working environment, and then use snakemake script to run the benchmarks. You may use the commands form the “3. Run” section to get an understanding how each indiviaul framework works and what parameters can be changed. Note that you need separate run of juicer with single value of –ncores, because it does not support parallel launches (because it writes to the default output). Finally, there is a visualization section with a display of all the results that we calcualted on our machines.
1. Install software¶
We will use separate conda environments to install different utilities. Each utility will have its own environment and peth to the binaries.
[ ]:
%%bash
mkdir ./soft
pairtools¶
pairtools v1.0.0¶
[ ]:
%%bash
conda create -y --prefix soft/pairtools1.0.0 python=3.9 pip
conda activate soft/pairtools1.0.0
pip install cython
pip install git+https://github.com/open2c/pairtools.git@pre0.4.0
conda install -y -c bioconda "bwa>=0.7.17"
bwa-mem2¶
[ ]:
%%bash
conda activate soft/pairtools1.0.0
# bwa-mem2: compile from source (not recommended for general users)
# Get the source
git clone --recursive https://github.com/bwa-mem2/bwa-mem2 soft/bwa-mem2
cd soft/bwa-mem2
# Compile
make
# Exit compilation folder
cd ../../
chromap¶
[ ]:
%%bash
conda create -y --prefix soft/chromap
conda activate soft/chromap
conda install -y -c bioconda -c conda-forge chromap
HiC-Pro¶
HiC-Pro is a popular software for Hi-C mapping, its now part of nf-core Hi-C pipeline, supports both fragment-based analysis of Hi-C and fragement-free analysis of DNase-based Hi-C.
[ ]:
%%bash
git clone https://github.com/nservant/HiC-Pro.git soft/HiC-Pro_env/HiC-Pro
conda env create -f soft/HiC-Pro/environment.yml -p soft/HiC-Pro_env
### Working environment will be soft/HiC-Pro_env
conda activate soft/HiC-Pro_env
# Install dependencies
conda install -y -c bioconda bowtie2 samtools pysam numpy scipy bx-python
conda install -y -c r r r-rcolorbrewer r-ggplot2
# Copy prepared config:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt
# Configure and install:
cd soft/HiC-Pro_env/HiC-Pro
make configure
make install
cd ../../../
# Retain only data processing steps with no creating of maps:
sed -i "s/all : init mapping proc_hic merge_persample hic_qc build_raw_maps ice_norm/all : init mapping proc_hic merge_persample #hic_qc build_raw_maps ice_norm/" soft/HiC-Pro_env/HiC-Pro/scripts/Makefile
[ ]:
%%bash
# Note that the configs should be adjusted for your system:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt
cp configs/config-hicpro.txt soft/HiC-Pro_env/HiC-Pro/config-hicpro.txt
FAN-C¶
[ ]:
%%bash
conda create -y --prefix soft/fanc python=3.9 pip hdf5
conda activate soft/fanc
pip install fanc
conda install -y -c bioconda samtools
Juicer¶
[ ]:
%%bash
conda create -y --prefix soft/juicer
conda activate soft/juicer
conda install -y -c bioconda bwa java-jdk
conda install -y -c conda-forge coreutils
# Download the recommended stable version:
wget https://github.com/aidenlab/juicer/archive/refs/tags/1.6.zip
unzip 1.6.zip
rm 1.6.zip
mv juicer-1.6 soft/juicer-1.6
# Download compile jar files of the stable version:
wget http://hicfiles.tc4ga.com.s3.amazonaws.com/public/juicer/juicer_tools.1.6.2_jcuda.0.7.5.jar
mv juicer_tools.1.6.2_jcuda.0.7.5.jar soft/juicer-1.6/CPU/scripts/common/juicer_tools.jar
# Copy the scripts to some accessible location:
mkdir -p soft/juicer-1.6/CPU/scripts/
cp -r soft/juicer-1.6/CPU/[^s]* soft/juicer-1.6/CPU/scripts/
HiCExplorer¶
[ ]:
%%bash
conda create -y --prefix soft/hicexplorer python=3.9
conda activate soft/hicexplorer
conda install -y -c bioconda hicexplorer bwa
2. Download data and genome¶
[ ]:
%%bash
mkdir data
Download raw data¶
Test data from Rao et al. 2017, 1 mln pairs:
[ ]:
%%bash
fastq-dump -O data --gzip --split-files SRR6107789 --minSpotId 0 --maxSpotId 1000000
[ ]:
%%bash
# Put the data in accessible folder for juicer:
mkdir -p data/4juicer/fastq/
mkdir -p data/4juicer/splits/
cp data/SRR6107789_1.fastq.gz data/4juicer/fastq/SRR6107789_R1.fastq.gz
cp data/SRR6107789_2.fastq.gz data/4juicer/fastq/SRR6107789_R2.fastq.gz
cp data/4juicer/fastq/* data/4juicer/splits/
[ ]:
%%bash
# Put the data in accessible folder for HiC-Pro:
mkdir -p soft/HiC-Pro_env/HiC-Pro/rawdata/sample1
cp data/S*fastq.gz soft/HiC-Pro_env/HiC-Pro/rawdata/sample1/
Install genome¶
Genomepy installation¶
will install fasta, bwa and bowtie2 indexes:
[ ]:
%%bash
# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa bowtie2
[ ]:
%%bash
# Install hg38 genome by genomepy:
! genomepy install hg38 -g data/
[ ]:
%%bash
# Restrict the genome:
! cooler digest data/hg38/hg38.fa.sizes data/hg38/hg38.fa DpnII --rel-ids 1 -o data/hg38/hg38.DpnII.bed
Build genome index: bwa-mem2¶
[ ]:
%%bash
mkdir data/hg38/index/bwa-mem2/
soft/bwa-mem2/bwa-mem2 index -p data/hg38/index/bwa-mem2/hg38 data/hg38/hg38.fa
Build genome index: chromap¶
[ ]:
%%bash
mkdir data/hg38/index/chromap
chromap -i -r data/hg38/hg38.fa -o data/hg38/index/chromap/hg38
3. Run¶
The banchmarking is usually cumbersome, but it can be simplified by snakemake. We provide a Snakemake pipeline that will allow you to benchmark different approaches.
The output of snakemake will consist of resulting Hi-C pairs/maps in output
folder and benchmarking files in benchmarks
folder. The file names have the information on parameters in their names:
[ ]:
%%bash
# Running
snakemake --cores 10
[ ]:
%%bash
# Cleanup
rm output/*; rm benchmarks/*
Manual run¶
You may also run them to test individual steps of the pipeline.
pairtools¶
[ ]:
%%bash
soft/bwa-mem2/bwa-mem2 mem -t 5 -SP data/hg38/index/bwa-mem2/hg38 data/SRR6107789_1.fastq.gz data/SRR6107789_2.fastq.gz | \
soft/pairtools1.0.0/bin/pairtools parse --nproc-in 5 --nproc-out 5 --drop-sam --drop-seq -c data/hg38/hg38.fa.sizes | \
soft/pairtools1.0.0/bin/pairtools sort --nproc 5 | \
soft/pairtools1.0.0/bin/pairtools dedup -p 5 --backend cython \
-o output/result.pairtools.pairs
chromap¶
[ ]:
%%bash
soft/chromap/bin/chromap --preset hic --low-mem \
-t 5 -x data/hg38/index/chromap/hg38 -r data/hg38/hg38.fa \
-1 data/SRR6107789_1.fastq.gz -2 data/SRR6107789_2.fastq.gz -o output/result.chromap.pairs
HiC-Pro¶
[ ]:
%%bash
cd soft/HiC-Pro_env/HiC-Pro
bin/HiC-Pro -i rawdata/ -o output -c config-hicpro.txt
cd ../../../
FAN-C¶
Based on CLI tutorial:
[ ]:
%%bash
fanc map -t 5 data/SRR6107789_1.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_1.bam
fanc map -t 5 data/SRR6107789_2.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_2.bam
samtools sort -@ 5 -n output/fanc-output_1.bam -o output/fanc-output_1.sorted.bam
samtools sort -@ 5 -n output/fanc-output_2.bam -o output/fanc-output_2.sorted.bam
fanc pairs output/fanc-output_1.sorted.bam output/fanc-output_2.sorted.bam output/fanc-output.pairs -g data/hg38/hg38.DpnII.bed
Juicer¶
[ ]:
%%bash
soft/juicer-1.6/CPU/juicer.sh -g hg38 -d data/4juicer/ -s DpnII -S early -p data/hg38/hg38.fa.sizes -y data/hg38/hg38.DpnII.bed -z data/hg38/index/bwa/hg38.fa -t 5 -D soft/juicer-1.6/CPU
HiCExplorer¶
Based on the example: https://hicexplorer.readthedocs.io/en/latest/content/example_usage.html
Note that it does not procude the pairs, but binned coolers.
[ ]:
%%bash
hicBuildMatrix --samFiles \
<(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_1.fastq.gz | samtools view -Shb -) \
<(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_2.fastq.gz | samtools view -Shb -) \
--restrictionSequence GATC \
--danglingSequence GATC \
--restrictionCutFile data/hg38/hg38.DpnII.bed \
--threads 4 \
--inputBufferSize 100000 \
--QCfolder hicexplorer_tmp \
-o hicexplorer_output.cool
4. Visualize benchmarks¶
[1]:
# Check the CPU properties:
import psutil
print(f"{psutil.cpu_count()} CPUs at {psutil.cpu_freq().current:.0f} GHz")
36 CPUs at 2715 GHz
[2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['font.family'] = "sans-serif"
figsize_A4 = np.array([11.69, 8.27])
plt.rcParams["figure.figsize"] = figsize_A4.T
plt.rcParams['figure.facecolor']='white'
plt.rcParams['font.size']=16
import glob
[9]:
## If you start from .csv. file:
timings = pd.read_csv('benchmarking_1mln.csv', index_col=0)
[4]:
# If you start from your own benchmarks:
files = glob.glob("benchmarks/*")
print(len(files))
24
[4]:
def get_params(filename):
split = filename.split('.')
util= split[1]
ncores = int(split[2])
return util, ncores
timings = []
for f in files:
t = pd.read_table(f)
t[['util', 'ncores']] = get_params(f)
timings.append(t)
timings = pd.concat(timings)
[10]:
timings.head()
[10]:
s | h:m:s | max_rss | max_vms | max_uss | max_pss | io_in | io_out | mean_load | cpu_time | util | ncores | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 171.2557 | 0:02:51 | 18999.14 | 20579.61 | 18990.64 | 18991.51 | 14942.02 | 61.25 | 81.02 | 140.18 | chromap | 1 |
1 | 161.8126 | 0:02:41 | 19010.55 | 20589.94 | 19002.19 | 19003.05 | 29831.66 | 88.35 | 72.64 | 122.87 | chromap | 1 |
2 | 147.2525 | 0:02:27 | 19251.87 | 20844.66 | 19244.98 | 19245.37 | 41893.51 | 176.68 | 89.18 | 139.78 | chromap | 1 |
3 | 160.3510 | 0:02:40 | 19010.16 | 20589.94 | 19004.14 | 19004.56 | 56673.84 | 265.02 | 74.24 | 132.19 | chromap | 1 |
4 | 152.5034 | 0:02:32 | 19141.59 | 20730.08 | 19134.09 | 19134.47 | 71454.04 | 353.35 | 83.52 | 144.57 | chromap | 1 |
[17]:
df = timings.sort_values(['ncores', 'util'])
df.loc[:, "max_rss_gb"] = df.loc[:, "max_rss"]/1024
df.loc[:, "min"] = df.loc[:, "s"]/60
[101]:
labels = ['chromap', 'pairtools_bwamem2', 'pairtools', 'juicer', 'hicpro', 'hicexplorer', 'fanc_bwa', 'fanc_bowtie2']
labels_mod = ['Chromap', 'bwa-mem2 + pairtools', 'bwa mem + pairtools', 'Juicer', 'HiC-Pro', 'HiCExplorer*', 'bwa mem + FANC', 'bowtie2 + FANC']
[102]:
df.query(f'ncores==2 and util=="chromap"')
[102]:
s | h:m:s | max_rss | max_vms | max_uss | max_pss | io_in | io_out | mean_load | cpu_time | util | ncores | max_rss_gb | min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 94.1938 | 0:01:34 | 19068.07 | 20678.57 | 19047.60 | 19052.58 | 14768.99 | 0.02 | 119.14 | 113.62 | chromap | 2 | 18.621162 | 1.569897 |
1 | 77.7396 | 0:01:17 | 19045.91 | 20665.29 | 19025.18 | 19029.89 | 14768.99 | 88.35 | 112.30 | 92.39 | chromap | 2 | 18.599521 | 1.295660 |
2 | 77.1351 | 0:01:17 | 19045.99 | 20665.54 | 19025.28 | 19029.69 | 14768.99 | 176.68 | 113.29 | 98.83 | chromap | 2 | 18.599600 | 1.285585 |
3 | 77.6121 | 0:01:17 | 19046.05 | 20665.54 | 19025.38 | 19029.77 | 14769.00 | 265.02 | 113.53 | 105.77 | chromap | 2 | 18.599658 | 1.293535 |
4 | 77.5393 | 0:01:17 | 19045.89 | 20665.54 | 19025.11 | 19029.47 | 14769.00 | 353.35 | 112.50 | 111.44 | chromap | 2 | 18.599502 | 1.292322 |
[104]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)
cmap = ['#3E9ADE', '#EF242B', '#9FC741']
ncores_order = [4, 2, 1]
style_dict = dict(
orient='h',
palette=cmap,
#edgecolor="k",
linewidth=0.0,
errwidth=1.0,
errcolor='k',
capsize=0.07)
ax = axes[0]
# Plot barplot:
splot = sns.barplot(
x="min",
y="util",
data=df.sort_values('util'),
order=labels,
hue='ncores',
hue_order=ncores_order,
ax=ax,
**style_dict
)
# Add text, slowdown over chromap
for icore, ncores in enumerate(ncores_order):
for ilabels, util in enumerate(labels):
if util=="chromap":
continue
df_reference = df.query(f'ncores=={ncores} and util=="chromap"')
mean_reference = np.mean(df_reference['min'].values)
df_target = df.query(f'ncores=={ncores} and util=="{util}"')
mean_target = np.mean(df_target['min'].values)
slowdown = mean_target / mean_reference
w = splot.patches[0].get_height()
splot.text( s=f"x{slowdown:.1f}",
x=mean_target+4, y=ilabels + (icore-1)*w,
ha = 'left', va = 'center', fontsize=11, weight='bold')
ax.set_ylabel('')
ax.set_xlabel('Time (min)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
#ax.set_xscale('log')
ax.set_xlim([0, 95])
ax.set_xticks(np.arange(0, 96, 5), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()
ax = axes[1]
sns.barplot(
x="max_rss_gb",
y="util",
data=df.sort_values('util'),
order=labels,
hue='ncores',
hue_order=ncores_order,
ax=ax,
**style_dict)
ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (GB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 26, 1), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)
# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)
plt.savefig("benchmarking_1mln.pdf")

[10]:
# df.to_csv('benchmarking_1mln.csv')