Pairtools walkthrough

Welcome to the pairtools walkthrough.

Pairtools is a tool for extraction of pairwise contacts out of sequencing chromosomes conformation capture data, such as Hi-C, Micro-C or MC-3C. Pairtools is used for obtaining .cool files by distiller, and has many more applications (see single-cell walkthrough or phasing walkthrough).

Here, we will cover the basic steps from raw reads to .cool file with binned contacts.

Outline:

Download raw data

“Raw” data, or .fastq files are generated by sequencing facilities or can be taken from public databases, such as SRA. We will take a sample from Rao et at al. 2017, human datasets. To reduce computateion time, take 5 mln reads instead of full sample:

[ ]:
! fastq-dump SRR13849430 --gzip --split-spot --split-3 --minSpotId 0 --maxSpotId 5000000
[1]:
! ls SRR13849430*.fastq.gz
SRR13849430_1.fastq.gz  SRR13849430_2.fastq.gz
[2]:
%%bash
gzip -dc SRR13849430_1.fastq.gz | head -n 8
@SRR13849430.1 1 length=150
NTCTCAGCCTTTATAAGATAGAAGAGAGTTGGGACCTTGCTCTAAATTCTGCTTTAGCAAGGGACTTTTGTACCTGCTTTCTTCCTTTATCCAGATCTAAAAATAGTTTATATGCTGACAACTCCCTGATGTTATTCTTTGTAGTATTTG
+SRR13849430.1 1 length=150
#AAFFJJJJJJJJJJJAJAJJJJFJJJAFFFFFFA7A-FJ7JJJ-AJAJJF-<-JJFFJ7FJJF7FJJFJJ<JFAF<JJJJAJJJJA-AFJ-FJ-FF<7-A-7A-FFF<J-FJJJ<F7)A-F--7))))--77--7------A---7-7-
@SRR13849430.2 2 length=150
NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA
+SRR13849430.2 2 length=150
#AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-

We next map the sequencing data to the reference genome.

Install reference genome

Although you may download and build bwa index for any genome manually, genompy can do that for you automatically:

[ ]:
# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa
[ ]:
# Install hg38 genome by genomepy, it might take some time to compile the bwa index for hg38:
! genomepy install hg38
[2]:
# Check the location of the genome:
! ls ~/.local/share/genomes/hg38/
bwa-mem2               hg38_DpnII.bed  hg38.fa      hg38.fa.sizes  index
hg38.blacklist.bed.gz  hg38.DpnII.bed  hg38.fa.fai  hg38.gaps.bed  README.txt

Map data with bwa mem

Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.

[ ]:
# Map test data:
! bwa mem -t 5 -SP ~/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz > test.bam

After mapping, you have .sam/.bam alignment file, which cannot be interpreted as pairs directly. You need to extract contacts from it:

[2]:
%%bash
samtools view test.bam | head -n 4
SRR13849430.1   121     chr12   78795720        60      53S97M  =       78795720        0       CAAATACTACAAAGAATAACATCAGGGAGTTGTCAGCATATAAACTATTTTTAGATCTGGATAAAGGAAGAAAGCAGGTACAAAAGTCCCTTGCTAAAGCAGAATTTAGAGCAAGGTCCCAACTCTCTTCTATCTTATAAAGGCTGAGAN  -7-7---A------7--77--))))7--F-A)7F<JJJF-J<FFF-A7-A-7<FF-JF-JFA-AJJJJAJJJJ<FAFJ<JJFJJF7FJJF7JFFJJ-<-FJJAJA-JJJ7JF-A7AFFFFFFAJJJFJJJJAJAJJJJJJJJJJJFFAA#        NM:i:1  MD:Z:96G0       MQ:i:0  AS:i:96 XS:i:21
SRR13849430.1   181     chr12   78795720        0       *       =       78795720        0       TTATATGCTGACGTCGTCCATTGTTGATTCTTTGTGGTCTGTGTTCCCTCTGAAGCTCTCTGTACTTGTATATACAATTAGGCACCAACAATTCAACATGGGTAGTTATTATACTGCTAAAGCGACCCCAGAGCGAACCGGGCTACCTTN  ---7---7-7)))-7---7-----7-7<---7-)<)<-----)-)<77--7---)7-<7-7----A-F<<-AF7-A777--77-7-----7--<--7-F777-A7-7---<7--7<7-<--77-77--77--7-7A---7-7---FF<A#      MC:Z:53S97M     MQ:i:60 AS:i:0  XS:i:0
SRR13849430.2   73      chr8    43879758        60      75M75S  =       43879758        0       NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA  #AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-  NM:i:2  MD:Z:0A58C15    MQ:i:0  AS:i:69 XS:i:29 SA:Z:chr8,43877720,+,73S54M23S,44,1;
SRR13849430.2   2121    chr8    43877720        44      73H54M23H       =       43877720        0       AATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTG  A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA-       NM:i:1  MD:Z:5A48       MQ:i:0  AS:i:49 XS:i:35 SA:Z:chr8,43879758,+,75M75S,60,2;

Contacts extraction

Challenges of contacts extraction solved by pairtools:

  • multiple walks in each Hi-C read

  • alignments spaced by short genomic distance may be reported as single alignment with a gap by bwa

  • PCR and optical duplicates

pairtools parse

The next step is to parse .sam/.bam into pairs. pairtools parse automatically detect the file format.

Full parse documentation: https://github.com/open2c/pairtools/blob/master/doc/parsing.rst

[ ]:
%%bash
pairtools parse -o test.pairs.gz -c ~/.local/share/genomes/hg38/hg38.fa.sizes \
  --drop-sam --drop-seq --output-stats test.stats \
  --assembly hg38 --no-flip \
  --add-columns mapq \
  --walks-policy mask \
  test.bam

Parse will result in .pairs file. .pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.

[3]:
%%bash
# Header
gzip -dc test.pairs.gz | head -n 5
## pairs format v1.0.0
#shape: whole matrix
#genome_assembly: hg38
#chromsize: chr1 248956422
#chromsize: chr10 133797422
[4]:
%%bash
# Last lines of the header contain the list of columns:
gzip -dc test.pairs.gz | grep "#" | tail -n 5
#samheader: @SQ SN:chrY_KI270740v1_random       LN:37240
#samheader: @HD VN:1.5  SO:unsorted     GO:query
#samheader: @PG ID:bwa  PN:bwa  VN:0.7.18-r1243-dirty   CL:bwa mem -t 5 -SP /home/agalicina/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz
#samheader: @PG ID:pairtools_parse      PN:pairtools_parse      CL:/home/agalicina/anaconda3/envs/test/bin/pairtools parse -o test.pairs.gz -c /home/agalicina/.local/share/genomes/hg38/hg38.fa.sizes --drop-sam --drop-seq --output-stats test.stats --assembly hg38 --no-flip --add-columns mapq --walks-policy mask test.bam        PP:bwa  VN:1.1.3
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type mapq1 mapq2
[5]:
%%bash
# Body of the .pairs file
gzip -dc test.pairs.gz | grep -v "#" | head -n 5
SRR13849430.1   chr12   78795816        !       0       -       -       UN      60      0
SRR13849430.2   !       0       !       0       -       -       WW      60      0
SRR13849430.3   chr2    72005391        !       0       +       -       UN      60      0
SRR13849430.4   chr2    20530788        !       0       +       -       UN      60      0
SRR13849430.5   !       0       !       0       -       -       WW      0       0

pairtools sort

After parsing each pair is listed in the orderthat it appeared in the reads file. Sorting makes it more organised, and also helps to perform deduplication which cannot work on non-sorted files.

[ ]:
%%bash
pairtools sort --nproc 5 -o test.sorted.pairs.gz test.pairs.gz

pairtools dedup

[ ]:
%%bash
pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    test.sorted.pairs.gz
[6]:
%%bash
# Unique pairs:
gzip -dc test.nodups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.4544778     chr1    136880  chr1    134881  +       +       UR      28      3
SRR13849430.4112679     chr1    136992  chr1    136650  -       +       UU      17      12
SRR13849430.1113517     chr1    599102  chr1    599228  +       -       UU      22      22
SRR13849430.1262300     chr1    656693  chr1    886234  +       -       UU      5       60
SRR13849430.2778343     chr1    760774  chr1    808155  -       +       UU      6       21
SRR13849430.3231567     chr1    777267  chr1    35948382        +       -       UU      32      60
SRR13849430.4836373     chr1    778257  chr1    777890  -       +       UU      8       48
SRR13849430.1256119     chr1    785664  chr1    954301  +       +       UU      60      60
SRR13849430.2390636     chr1    788677  chr1    22439216        -       +       UU      8       60
SRR13849430.3622660     chr1    788734  chr1    822206  +       -       RU      60      26
[7]:
%%bash
# Only duplicated  pairs:
gzip -dc test.dups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.2781007     chr1    760774  chr1    808155  -       +       DD      6       20
SRR13849430.1103881     chr1    855594  chr1    856474  +       -       DD      42      47
SRR13849430.3637378     chr1    860367  chr1    863893  -       -       DD      50      42
SRR13849430.3708228     chr1    860367  chr1    863893  -       -       DD      50      43
SRR13849430.1681017     chr1    904370  chr1    952493  -       -       DD      60      60
SRR13849430.4712708     chr1    912376  chr1    913168  -       -       DD      60      60
SRR13849430.169202      chr1    920729  chr1    929007  -       +       DD      60      60
SRR13849430.1735365     chr1    929632  chr1    925420  +       +       DD      60      60
SRR13849430.1743119     chr1    929632  chr1    925420  +       +       DD      60      60
SRR13849430.1755818     chr1    929632  chr1    925420  +       +       DD      60      60

pairtools select

Sometimes you may need certain types of pairs based on their properties, such as mapq, pair type, distance or orientation. For all these manipulations, there is pairtools select which requires a file and pythonic condition as an input:

[ ]:
%%bash
pairtools select "mapq1>0 and mapq2>0" test.nodups.pairs.gz -o test.nodups.UU.pairs.gz

pairtools stats

Describe the types fo distance properties of pairs:

[ ]:
%%bash
pairtools stats test.pairs.gz -o test.stats

MultiQC

To make a nice HTML with pairs stats, install Open2C version of multiQC from: https://github.com/open2c/MultiQC

[ ]:
%%bash
multiqc test.stats
[10]:
from IPython.display import IFrame

IFrame(src='./multiqc_report.html', width=1200, height=700)

Load pairs to cooler

Finally, when you obtained a list of appropriate pairs, you may create coolers with it:

[ ]:
%%bash
cooler cload pairs \
    -c1 2 -p1 3 -c2 4 -p2 5 \
    --assembly hg38 \
    ~/.local/share/genomes/hg38/hg38.fa.sizes:1000000 \
    test.nodups.UU.pairs.gz \
    test.hg38.1000000.cool
[ ]:
%%bash
cooler zoomify \
    --nproc 5 \
    --out test.hg38.1000000.mcool \
    --resolutions 1000000,2000000 \
    --balance \
    test.hg38.1000000.cool

Visualize cooler

Based on open2c vis example

[11]:
import cooler
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import cooltools.lib.plotting
from matplotlib.colors import LogNorm
import seaborn as sns
import bioframe
import numpy as np
[12]:
file = "test.hg38.1000000.mcool::/resolutions/1000000"
[13]:
clr = cooler.Cooler(file)
[14]:
# Define chromosome starts
chromstarts = []
for i in clr.chromnames:
        chromstarts.append(clr.extent(i)[0])
[15]:
from matplotlib.ticker import EngFormatter
bp_formatter = EngFormatter('b')

def format_ticks(ax, x=True, y=True, rotate=True):
    if y:
        ax.yaxis.set_major_formatter(bp_formatter)
    if x:
        ax.xaxis.set_major_formatter(bp_formatter)
        ax.xaxis.tick_bottom()
    if rotate:
        ax.tick_params(axis='x',rotation=45)
[20]:
# Define the bounds of the continuous fragment of whole-genome interaction map
chrom_start, chrom_end = clr.chromnames.index('chr10'), clr.chromnames.index('chr11')
start, end = chromstarts[chrom_start], chromstarts[chrom_end]
[21]:
vmax = 15
norm = LogNorm(vmin=1, vmax=vmax)

f, axs = plt.subplots(
    figsize=(13, 10),
    nrows=2,
    ncols=1,
    sharex=False, sharey=False)

ax = axs[0]
ax.set_title('Interaction maps (chr1)')
im = ax.matshow(clr.matrix(balance=False).fetch('chr1'), vmax=vmax, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='chr1');

ax = axs[1]
ax.set_title('Chromosomes 10-11')
im = ax.matshow(clr.matrix(balance=False)[start:end, start:end], norm=norm, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='Whole-genome');
ax.set_xticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
ax.set_yticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);

format_ticks(axs[0], rotate=False)

plt.tight_layout()
../_images/examples_pairtools_walkthrough_43_0.png
[ ]: