Pairtools walkthrough
Welcome to the pairtools walkthrough.
Pairtools is a tool for extraction of pairwise contacts out of sequencing chromosomes conformation capture data, such as Hi-C, Micro-C or MC-3C. Pairtools is used for obtaining .cool files by distiller, and has many more applications (see single-cell walkthrough or phasing walkthrough).
Here, we will cover the basic steps from raw reads to .cool file with binned contacts.
Outline:
Download raw data
“Raw” data, or .fastq files are generated by sequencing facilities or can be taken from public databases, such as SRA. We will take a sample from Rao et at al. 2017, human datasets. To reduce computateion time, take 5 mln reads instead of full sample:
[ ]:
! fastq-dump SRR13849430 --gzip --split-spot --split-3 --minSpotId 0 --maxSpotId 5000000
[ ]:
! ls SRR13849430*.fastq.gz
[1]:
%%bash
gzip -dc SRR13849430_1.fastq.gz | head -n 8
@SRR13849430.1 1 length=150
NTCTCAGCCTTTATAAGATAGAAGAGAGTTGGGACCTTGCTCTAAATTCTGCTTTAGCAAGGGACTTTTGTACCTGCTTTCTTCCTTTATCCAGATCTAAAAATAGTTTATATGCTGACAACTCCCTGATGTTATTCTTTGTAGTATTTG
+SRR13849430.1 1 length=150
#AAFFJJJJJJJJJJJAJAJJJJFJJJAFFFFFFA7A-FJ7JJJ-AJAJJF-<-JJFFJ7FJJF7FJJFJJ<JFAF<JJJJAJJJJA-AFJ-FJ-FF<7-A-7A-FFF<J-FJJJ<F7)A-F--7))))--77--7------A---7-7-
@SRR13849430.2 2 length=150
NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA
+SRR13849430.2 2 length=150
#AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7-
We next map the sequencing data to the reference genome.
Install reference genome
Although you may download and build bwa index for any genome manually, genompy
can do that for you automatically:
[ ]:
# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa
[ ]:
# Install hg38 genome by genomepy:
! genomepy install hg38
[2]:
# Check the location of the genome:
! ls ~/.local/share/genomes/hg38/
bwa-mem2 hg38_DpnII.bed hg38.fa hg38.fa.sizes index
hg38.blacklist.bed.gz hg38.DpnII.bed hg38.fa.fai hg38.gaps.bed README.txt
Map data with bwa mem
Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.
[ ]:
# Map test data:
! bwa mem -t 5 -SP ~/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz > test.bam
After mapping, you have .sam/.bam alignment file, which cannot be interpreted as pairs directly. You need to extract contacts from it:
[3]:
%%bash
samtools view test.bam | head -n 4
SRR13849430.1 121 chr12 78795720 60 53S97M = 78795720 0 CAAATACTACAAAGAATAACATCAGGGAGTTGTCAGCATATAAACTATTTTTAGATCTGGATAAAGGAAGAAAGCAGGTACAAAAGTCCCTTGCTAAAGCAGAATTTAGAGCAAGGTCCCAACTCTCTTCTATCTTATAAAGGCTGAGAN -7-7---A------7--77--))))7--F-A)7F<JJJF-J<FFF-A7-A-7<FF-JF-JFA-AJJJJAJJJJ<FAFJ<JJFJJF7FJJF7JFFJJ-<-FJJAJA-JJJ7JF-A7AFFFFFFAJJJFJJJJAJAJJJJJJJJJJJFFAA# NM:i:1 MD:Z:96G0 AS:i:96 XS:i:21
SRR13849430.1 181 chr12 78795720 0 * = 78795720 0 TTATATGCTGACGTCGTCCATTGTTGATTCTTTGTGGTCTGTGTTCCCTCTGAAGCTCTCTGTACTTGTATATACAATTAGGCACCAACAATTCAACATGGGTAGTTATTATACTGCTAAAGCGACCCCAGAGCGAACCGGGCTACCTTN ---7---7-7)))-7---7-----7-7<---7-)<)<-----)-)<77--7---)7-<7-7----A-F<<-AF7-A777--77-7-----7--<--7-F777-A7-7---<7--7<7-<--77-77--77--7-7A---7-7---FF<A# MC:Z:53S97M AS:i:0 XS:i:0
SRR13849430.2 73 chr8 43879758 60 75M75S = 43879758 0 NAAAAGGTTTATCTCTGTGAGATGAAGCCACATAAAACCAAGCAGTTTCACAGAAACTTGTTTGTAGATTTTAAATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTGCGGCCCTTCCGTATTCCCGCATA #AAFAJJFJJJJFAAFJA<JJJFJJFJJJJJFFJ-FF<FFJJJJJJ7AFFF<FJJ-<-F7<7<A-AF7AJAJ-A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA--))--))-<))-------)))7- NM:i:2 MD:Z:0A58C15 AS:i:69 XS:i:29 SA:Z:chr8,43877720,+,73S54M23S,44,1;
SRR13849430.2 2121 chr8 43877720 44 73H54M23H = 43877720 0 AATCCGCACATCACAAAGCCATTTCACAGACAGCTTCTTTCTACTTTTTATCTG A-7FF7JJJFJJFJJFFFFAJFJJ7-7-A-F--777FJFJF7---AAAF<AAA- NM:i:1 MD:Z:5A48 AS:i:49 XS:i:35 SA:Z:chr8,43879758,+,75M75S,60,2;
Contacts extraction
Challenges of contacts extraction solved by pairtools
:
multiple walks in each Hi-C read
alignments spaced by short genomic distance may be reported as single alignment with a gap by bwa
PCR and optical duplicates
pairtools parse
The next step is to parse .sam/.bam into pairs. pairtools parse
automatically detect the file format.
Full parse documentation: https://github.com/open2c/pairtools/blob/master/doc/parsing.rst
[ ]:
%%bash
pairtools parse -o test.pairs.gz -c ~/.local/share/genomes/hg38/hg38.fa.sizes \
--drop-sam --drop-seq --output-stats test.stats \
--assembly hg38 --no-flip \
--add-columns mapq \
--walks-policy mask \
test.bam
Parse will result in .pairs file. .pairs is a simple tabular format for storing DNA contacts detected in a Hi-C experiment. The detailed .pairs specification is defined by the 4DN Consortium.
[4]:
%%bash
# Header
gzip -dc test.pairs.gz | head -n 5
## pairs format v1.0.0
#shape: whole matrix
#genome_assembly: hg38
#chromsize: chr1 248956422
#chromsize: chr10 133797422
[5]:
%%bash
# Last lines of the header contain the list of columns:
gzip -dc test.pairs.gz | grep "#" | tail -n 5
#samheader: @SQ SN:chrX LN:156040895
#samheader: @SQ SN:chrY LN:57227415
#samheader: @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:./bwa mem -t 5 -SP /home/agalicina/.local/share/genomes/hg38/index/bwa/hg38.fa SRR13849430_1.fastq.gz SRR13849430_2.fastq.gz
#samheader: @PG ID:pairtools_parse PN:pairtools_parse CL:./pairtools parse -o test.pairs.gz -c /home/agalicina/.local/share/genomes/hg38/hg38.fa.sizes --drop-sam --drop-seq --output-stats test.stats --assembly hg38 --no-flip --add-columns mapq --walks-policy mask test.bam PP:bwa VN:1.0.0-dev1
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2 pair_type mapq1 mapq2
[6]:
%%bash
# Body of the .pairs file
gzip -dc test.pairs.gz | grep -v "#" | head -n 5
SRR13849430.1 chr12 78795816 ! 0 - - UN 60 0
SRR13849430.2 ! 0 ! 0 - - WW 60 0
SRR13849430.3 chr2 72005391 ! 0 + - UN 60 0
SRR13849430.4 chr2 20530788 ! 0 + - UN 60 0
SRR13849430.5 ! 0 ! 0 - - WW 0 0
pairtools sort
After parsing each pair is listed in the orderthat it appeared in the reads file. Sorting makes it more organised, and also helps to perform deduplication which cannot work on non-sorted files.
[ ]:
%%bash
pairtools sort --nproc 5 -o test.sorted.pairs.gz test.pairs.gz
pairtools dedup
[ ]:
%%bash
pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
test.pairs.gz
[7]:
%%bash
# Unique pairs:
gzip -dc test.nodups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.513 chr20 23502312 chr20 23063544 + + RU 60 60
SRR13849430.1442 chr7 57224960 chrX 82818236 + + UU 60 30
SRR13849430.2378 chr5 115925933 chr21 24124840 + + UU 60 50
SRR13849430.2547 chr1 52097837 chr12 1888807 - - UU 60 60
SRR13849430.3015 chr17 74750879 chr11 117356318 + - UR 60 60
SRR13849430.3027 chr15 34977762 chr15 31897447 - + UR 11 60
SRR13849430.3406 chr11 1171960 chr9 121265592 + - UU 60 60
SRR13849430.3988 chr16 86824176 chr13 104521019 - + UU 60 17
SRR13849430.4030 chr17 73189645 chr4 49092470 - + UU 60 31
SRR13849430.4316 chr8 124329308 chr8 124336541 - - UU 60 60
[8]:
%%bash
# Only duplicated pairs:
gzip -dc test.dups.pairs.gz | grep -v "#" | head -n 10
SRR13849430.60371 chr2 44507613 chr7 116276932 - + DD 60 57
SRR13849430.67567 chr5 62425895 chr5 62425612 - + DD 60 60
SRR13849430.97623 chr3 162233323 chr3 162154449 - + DD 60 52
SRR13849430.108366 chr8 48691403 chr8 48872239 - - DD 60 60
SRR13849430.138622 chr16 8435050 chr16 6032751 + - DD 60 60
SRR13849430.146482 chr14 86385083 chr2 119648648 + + DD 60 60
SRR13849430.148232 chrX 21885792 chrX 21887418 + - DD 60 60
SRR13849430.149771 chr16 6646543 chr16 6648097 - - DD 60 60
SRR13849430.156983 chr4 55704089 chr4 76039070 + + DD 60 13
SRR13849430.157962 chr6 47656758 chr6 47748395 + - DD 60 35
pairtools select
Sometimes you may need certain types of pairs based on their properties, such as mapq, pair type, distance or orientation. For all these manipulations, there is pairtools select
which requires a file and pythonic condition as an input:
[ ]:
%%bash
pairtools select "mapq1>0 and mapq2>0" test.nodups.pairs.gz -o test.nodups.UU.pairs.gz
pairtools stats
Describe the types fo distance properties of pairs:
[ ]:
%%bash
pairtools stats test.pairs.gz -o test.stats
MultiQC
[ ]:
%%bash
multiqc test.stats
[9]:
from IPython.display import IFrame
IFrame(src='./multiqc_report.html', width=1200, height=700)
[9]:
Load pairs to cooler
Finally, when you obtained a list of appropriate pairs, you may create coolers with it:
[ ]:
%%bash
cooler cload pairs \
-c1 2 -p1 3 -c2 4 -p2 5 \
--assembly hg38 \
~/.local/share/genomes/hg38/hg38.fa.sizes:1000000 \
test.nodups.UU.pairs.gz \
test.hg38.1000000.cool
[ ]:
%%bash
cooler zoomify \
--nproc 5 \
--out test.hg38.1000000.mcool \
--resolutions 1000000,2000000 \
--balance \
test.hg38.1000000.cool
Visualize cooler
Based on open2c vis example
[36]:
import cooler
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import cooltools.lib.plotting
from matplotlib.colors import LogNorm
import seaborn as sns
import bioframe
import numpy as np
[11]:
file = "test.hg38.1000000.mcool::/resolutions/1000000"
[12]:
clr = cooler.Cooler(file)
[19]:
# Define chromosome starts
chromstarts = []
for i in clr.chromnames:
chromstarts.append(clr.extent(i)[0])
[20]:
from matplotlib.ticker import EngFormatter
bp_formatter = EngFormatter('b')
def format_ticks(ax, x=True, y=True, rotate=True):
if y:
ax.yaxis.set_major_formatter(bp_formatter)
if x:
ax.xaxis.set_major_formatter(bp_formatter)
ax.xaxis.tick_bottom()
if rotate:
ax.tick_params(axis='x',rotation=45)
[40]:
# Define the bounds of the continuous fragment of whole-genome interaction map
chrom_start, chrom_end = clr.chromnames.index('chr3'), clr.chromnames.index('chr6')
start, end = chromstarts[chrom_start], chromstarts[chrom_end]
[43]:
vmax = 15
norm = LogNorm(vmin=1, vmax=vmax)
f, axs = plt.subplots(
figsize=(13, 10),
nrows=2,
ncols=1,
sharex=False, sharey=False)
ax = axs[0]
ax.set_title('Interaction maps (chr1)')
im = ax.matshow(clr.matrix(balance=False).fetch('chr1'), vmax=vmax, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='chr1');
ax = axs[1]
ax.set_title('Chromosomes 3-5')
im = ax.matshow(clr.matrix(balance=False)[start:end, start:end], norm=norm, cmap='fall');
plt.colorbar(im, ax=ax ,fraction=0.046, pad=0.04, label='Whole-genome');
ax.set_xticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
ax.set_yticks(np.array(chromstarts[chrom_start:chrom_end])-start, clr.chromnames[chrom_start:chrom_end], rotation=90);
format_ticks(axs[0], rotate=False)
plt.tight_layout()
[ ]: