Pairtools walkthrough

Welcome to the pairtools walkthrough.

Pairtools is a tool for extraction of pairwise contacts out of sequencing chromosomes conformation capture data, such as Hi-C, Micro-C or MC-3C. Pairtools is used for obtaining .cool files by distiller, and has many more applications (see single-cell walkthrough or phasing walkthrough).

Here, we will cover the basic steps from raw reads to .cool file with binned contacts.

Outline:

Download raw data

“Raw” data, or .fastq files are generated by sequencing facilities or can be taken from public databases, such as SRA. We will take a sample from Rao et at al. 2017, human datasets. To reduce computateion time, take 5 mln reads instead of full sample:

[ ]:
! fastq-dump SRR13849430 --gzip --split-spot --split-3 --minSpotId 0 --maxSpotId 5000000
[ ]:
! ls SRR13849430*.fastq.gz
[1]:
%%bash
gzip -dc SRR13849430_1.fastq.gz | head -n 8
@SRR13849430.1 1 length=150
NTCTCAGCCTTTATAAGATAGAAGAGAGTTGGGACCTTGCTCTAAATTCTGCTTTAGCAAGGGACTTTTGTACCTGCTTTCTTCCTTTATCCAGATCTAAAAATAGTTTATATGCTGACAACTCCCTGATGTTATTCTTTGTAGTATTTG
+SRR13849430.1 1 length=150
#AAFFJJJJJJJJJJJAJAJJJJFJJJAFFFFFFA7A-FJ7JJJ-AJAJJF-<-JJFFJ7FJJF7FJJFJJ<JFAF<JJJJAJJJJA-AFJ-FJ-FF<7-A-7A-FFF<J-FJJJ<F7)A-F--7))))--77--7------A---7-7-
@SRR13849430.2 2 length=150
NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA
+SRR13849430.2 2 length=150
#AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-

We next map the sequencing data to the reference genome.

Install reference genome

Although you may download and build bwa index for any genome manually, genompy can do that for you automatically:

[ ]:
# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa
[ ]:
# Install hg38 genome by genomepy:
! genomepy install hg38
[2]:
# Check the location of the genome:
! ls ~/.local/share/genomes/hg38/
bwa-mem2               hg38_DpnII.bed  hg38.fa      hg38.fa.sizes  index
hg38.blacklist.bed.gz  hg38.DpnII.bed  hg38.fa.fai  hg38.gaps.bed  README.txt

Map data with bwa mem

Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.

[ ]:
# Map test data:
! bwa mem -t 5 -SP ~/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz > test.bam

After mapping, you have .sam/.bam alignment file, which cannot be interpreted as pairs directly. You need to extract contacts from it:

[3]:
%%bash
samtools view test.bam | head -n 4
SRR13849430.1   121     chr12   78795720        60      53S97M  =       78795720        0       CAAATACTACAAAGAATAACATCAGGGAGTTGTCAGCATATAAACTATTTTTAGATCTGGATAAAGGAAGAAAGCAGGTACAAAAGTCCCTTGCTAAAGCAGAATTTAGAGCAAGGTCCCAACTCTCTTCTATCTTATAAAGGCTGAGAN  -7-7---A------7--77--))))7--F-A)7F<JJJF-J<FFF-A7-A-7<FF-JF-JFA-AJJJJAJJJJ<FAFJ<JJFJJF7FJJF7JFFJJ-<-FJJAJA-JJJ7JF-A7AFFFFFFAJJJFJJJJAJAJJJJJJJJJJJFFAA#        NM:i:1  MD:Z:96G0       AS:i:96 XS:i:21
SRR13849430.1   181     chr12   78795720        0       *       =       78795720        0       TTATATGCTGACGTCGTCCATTGTTGATTCTTTGTGGTCTGTGTTCCCTCTGAAGCTCTCTGTACTTGTATATACAATTAGGCACCAACAATTCAACATGGGTAGTTATTATACTGCTAAAGCGACCCCAGAGCGAACCGGGCTACCTTN  ---7---7-7)))-7---7-----7-7<---7-)<)<-----)-)<77--7---)7-<7-7----A-F<<-AF7-A777--77-7-----7--<--7-F777-A7-7---<7--7<7-<--77-77--77--7-7A---7-7---FF<A#      MC:Z:53S97M     AS:i:0  XS:i:0
SRR13849430.2   73      chr8    43879758        60      75M75S  =       43879758        0       NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA  #AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-  NM:i:2  MD:Z:0A58C15    AS:i:69 XS:i:29 SA:Z:chr8,43877720,+,73S54M23S,44,1;
SRR13849430.2   2121    chr8    43877720        44      73H54M23H       =       43877720        0       AATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTG  A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA-       NM:i:1  MD:Z:5A48       AS:i:49 XS:i:35 SA:Z:chr8,43879758,+,75M75S,60,2;

Contacts extraction

Challenges of contacts extraction solved by pairtools:

  • multiple walks in each Hi-C read

  • alignments spaced by short genomic distance may be reported as single alignment with a gap by bwa

  • PCR and optical duplicates

pairtools parse

The next step is to parse .sam/.bam into pairs. pairtools parse automatically detect the file format.

Full parse documentation: https://github.com/open2c/pairtools/blob/master/doc/parsing.rst

[ ]:
%%bash
pairtools parse -o test.pairs.gz -c ~/.local/share/genomes/hg38/hg38.fa.sizes \
  --drop-sam --drop-seq --output-stats test.stats \
  --assembly hg38 --no-flip \
  --add-columns mapq \
  --walks-policy mask \
  test.bam

Parse will result in .pairs file. .pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.

[4]:
%%bash
# Header
gzip -dc test.pairs.gz | head -n 5
## pairs format v1.0.0
#shape: whole matrix
#genome_assembly: hg38
#chromsize: chr1 248956422
#chromsize: chr10 133797422
[5]:
%%bash
# Last lines of the header contain the list of columns:
gzip -dc test.pairs.gz | grep "#" | tail -n 5
#samheader: @SQ SN:chrX LN:156040895
#samheader: @SQ SN:chrY LN:57227415
#samheader: @PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:./bwa mem -t 5 -SP /home/agalicina/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz
#samheader: @PG ID:pairtools_parse      PN:pairtools_parse      CL:./pairtools parse -o test.pairs.gz -c /home/agalicina/.local/share/genomes/hg38/hg38.fa.sizes --drop-sam --drop-seq --output-stats test.stats --assembly hg38 --no-flip --add-columns mapq --walks-policy mask test.bam      PP:bwa  VN:1.0.0-dev1
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type mapq1 mapq2
[6]:
%%bash
# Body of the .pairs file
gzip -dc test.pairs.gz | grep -v "#" | head -n 5
SRR13849430.1   chr12   78795816        !       0       -       -       UN      60      0
SRR13849430.2   !       0       !       0       -       -       WW      60      0
SRR13849430.3   chr2    72005391        !       0       +       -       UN      60      0
SRR13849430.4   chr2    20530788        !       0       +       -       UN      60      0
SRR13849430.5   !       0       !       0       -       -       WW      0       0

pairtools sort

After parsing each pair is listed in the orderthat it appeared in the reads file. Sorting makes it more organised, and also helps to perform deduplication which cannot work on non-sorted files.

[ ]:
%%bash
pairtools sort --nproc 5 -o test.sorted.pairs.gz test.pairs.gz

pairtools dedup

[ ]:
%%bash
pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    test.pairs.gz
[7]:
%%bash
# Unique pairs:
gzip -dc test.nodups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.513 chr20   23502312        chr20   23063544        +       +       RU      60      60
SRR13849430.1442        chr7    57224960        chrX    82818236        +       +       UU      60      30
SRR13849430.2378        chr5    115925933       chr21   24124840        +       +       UU      60      50
SRR13849430.2547        chr1    52097837        chr12   1888807 -       -       UU      60      60
SRR13849430.3015        chr17   74750879        chr11   117356318       +       -       UR      60      60
SRR13849430.3027        chr15   34977762        chr15   31897447        -       +       UR      11      60
SRR13849430.3406        chr11   1171960 chr9    121265592       +       -       UU      60      60
SRR13849430.3988        chr16   86824176        chr13   104521019       -       +       UU      60      17
SRR13849430.4030        chr17   73189645        chr4    49092470        -       +       UU      60      31
SRR13849430.4316        chr8    124329308       chr8    124336541       -       -       UU      60      60
[8]:
%%bash
# Only duplicated  pairs:
gzip -dc test.dups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.60371       chr2    44507613        chr7    116276932       -       +       DD      60      57
SRR13849430.67567       chr5    62425895        chr5    62425612        -       +       DD      60      60
SRR13849430.97623       chr3    162233323       chr3    162154449       -       +       DD      60      52
SRR13849430.108366      chr8    48691403        chr8    48872239        -       -       DD      60      60
SRR13849430.138622      chr16   8435050 chr16   6032751 +       -       DD      60      60
SRR13849430.146482      chr14   86385083        chr2    119648648       +       +       DD      60      60
SRR13849430.148232      chrX    21885792        chrX    21887418        +       -       DD      60      60
SRR13849430.149771      chr16   6646543 chr16   6648097 -       -       DD      60      60
SRR13849430.156983      chr4    55704089        chr4    76039070        +       +       DD      60      13
SRR13849430.157962      chr6    47656758        chr6    47748395        +       -       DD      60      35

pairtools select

Sometimes you may need certain types of pairs based on their properties, such as mapq, pair type, distance or orientation. For all these manipulations, there is pairtools select which requires a file and pythonic condition as an input:

[ ]:
%%bash
pairtools select "mapq1>0 and mapq2>0" test.nodups.pairs.gz -o test.nodups.UU.pairs.gz

pairtools stats

Describe the types fo distance properties of pairs:

[ ]:
%%bash
pairtools stats test.pairs.gz -o test.stats

MultiQC

[ ]:
%%bash
multiqc test.stats
[9]:
from IPython.display import IFrame

IFrame(src='./multiqc_report.html', width=1200, height=700)
[9]:

Load pairs to cooler

Finally, when you obtained a list of appropriate pairs, you may create coolers with it:

[ ]:
%%bash
cooler cload pairs \
    -c1 2 -p1 3 -c2 4 -p2 5 \
    --assembly hg38 \
    ~/.local/share/genomes/hg38/hg38.fa.sizes:1000000 \
    test.nodups.UU.pairs.gz \
    test.hg38.1000000.cool
[ ]:
%%bash
cooler zoomify \
    --nproc 5 \
    --out test.hg38.1000000.mcool \
    --resolutions 1000000,2000000 \
    --balance \
    test.hg38.1000000.cool

Visualize cooler

Based on open2c vis example

[36]:
import cooler
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import cooltools.lib.plotting
from matplotlib.colors import LogNorm
import seaborn as sns
import bioframe
import numpy as np
[11]:
file = "test.hg38.1000000.mcool::/resolutions/1000000"
[12]:
clr = cooler.Cooler(file)
[19]:
# Define chromosome starts
chromstarts = []
for i in clr.chromnames:
        chromstarts.append(clr.extent(i)[0])
[20]:
from matplotlib.ticker import EngFormatter
bp_formatter = EngFormatter('b')

def format_ticks(ax, x=True, y=True, rotate=True):
    if y:
        ax.yaxis.set_major_formatter(bp_formatter)
    if x:
        ax.xaxis.set_major_formatter(bp_formatter)
        ax.xaxis.tick_bottom()
    if rotate:
        ax.tick_params(axis='x',rotation=45)
[40]:
# Define the bounds of the continuous fragment of whole-genome interaction map
chrom_start, chrom_end = clr.chromnames.index('chr3'), clr.chromnames.index('chr6')
start, end = chromstarts[chrom_start], chromstarts[chrom_end]
[43]:
vmax = 15
norm = LogNorm(vmin=1, vmax=vmax)

f, axs = plt.subplots(
    figsize=(13, 10),
    nrows=2,
    ncols=1,
    sharex=False, sharey=False)

ax = axs[0]
ax.set_title('Interaction maps (chr1)')
im = ax.matshow(clr.matrix(balance=False).fetch('chr1'), vmax=vmax, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='chr1');

ax = axs[1]
ax.set_title('Chromosomes 3-5')
im = ax.matshow(clr.matrix(balance=False)[start:end, start:end], norm=norm, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='Whole-genome');
ax.set_xticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
ax.set_yticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);

format_ticks(axs[0], rotate=False)

plt.tight_layout()
../_images/examples_pairtools_walkthrough_43_0.png
[ ]: