Pairtools phase walkthrough

Welcome to the pairtools phase walkthrough!

Haplotype-resolved Hi-C is a popular technique that helps you to resolve contacts of homologous chromosomes. It relies on a simple idea tha homologous chromosomes have variations (e.g., SNPs) that are inherited together as haplotypes. DNA reads in Hi-C will have the SNVs from one of two haplotypes, which can be used to distinguish the contacts on the same chromosome (cis-homologous) and contacts connecting two homologs (trans-homologous).

The experimental challenge of the haplotype-resolved Hi-C is to increase the number of SNPs that distinguish reads from different chromosomes. This can be dome by mating highly diverged.

  • Erceg et al. 2019 create highly heterozygous embryos of Drosophila [1]
  • Collombet et al. 2020 create highly polymorphic F1 hybrid embryos obtained by crossing female Mus musculus domesticus (C57Bl/6J) with male Mus musculus castaneus CAST/EiJ) to resolve structures of individual chromosomes in the zygote and embryos [2]
  • Tan et al. 2018 uses available heterozygous positions to infer the 3D structures of single chromosomes by single-cell variant of the protocol Dip-C [3]
  • Duan et al. use dikaryonic nuclei of fungi with 0.7% heterozygosity [4]

In pairtools we implement an approach to resolving haplotypes from Erceg et al. The outline of haplotype-resolved parsing of pairs:

  1. Create the reference genome: create the concatenated reference genomes from two haplotypes.

    Usually the SNVs are known and can be obtained in .vsf format. We will incorporate the SNVs by bcftools into the reference and create updated fasta files with haplotype-corrected sequences. For each homologue we will add the suffixes that identify the type of homologue (_hap1 or _hap2).

  2. Map the Hi-C data to the concatenated reference and parse allowing multimappers (mapq 0).

    We will also need the mapper to report two suboptimal alignments (aka the second and the third hit). When the Hi-C read is mapped to some location in the genome, it will have the suffix of the homologue reported as part of chromosome name. However, the true resolved pairs are not yet known at this step.

    See sections:

    1. Download data
    2. Map data with bwa mem to diploid genome
    3. pairtools parse
  3. pairtools phase: phase the pairs based on the reported suboptimal alignments.

    By checking the scores of two suboptimal alignments, we will distinguish the true multi-mappers from unresolved pairs (i.e. cases when the read aligns to the location with no distinguishing SNV). Phasing procedure will remove the haplotype suffixes from chromosome names and add extra fields to the .pairs file with:

    ‘.’ (non-resolved)

    ‘0’ (first haplotype) or

    ‘1’ (second haplotype).

    Phasing schema:


  1. Post-procesing. Do sorting, dedup and stats, as usual.

    See sections:

    1. pairtools dedup
    2. Stats

[1] Erceg, J., AlHaj Abed, J., Goloborodko, A., Lajoie, B. R., Fudenberg, G., Abdennur, N., Imakaev, M., McCole, R. B., Nguyen, S. C., Saylor, W., Joyce, E. F., Senaratne, T. N., Hannan, M. A., Nir, G., Dekker, J., Mirny, L. A., & Wu, C. T. (2019). The genome-wide multi-layered architecture of chromosome pairing in early Drosophila embryos. Nature communications, 10(1), 4486.

[2] Collombet, S., Ranisavljevic, N., Nagano, T., Varnai, C., Shisode, T., Leung, W., Piolot, T., Galupa, R., Borensztein, M., Servant, N., Fraser, P., Ancelin, K., & Heard, E. (2020). Parental-to-embryo switch of chromosome organization in early embryogenesis. Nature, 580(7801), 142–146.

[3] Tan, L., Xing, D., Chang, C. H., Li, H., & Xie, X. S. (2018). Three-dimensional genome structures of single diploid human cells. Science (New York, N.Y.), 361(6405), 924–928.

[4] Duan, H., Jones, A. W., Hewitt, T., Mackenzie, A., Hu, Y., Sharp, A., Lewis, D., Mago, R., Upadhyaya, N. M., Rathjen, J. P., Stone, E. A., Schwessinger, B., Figueroa, M., Dodds, P. N., Periyannan, S., & Sperschneider, J. (2022). Physical separation of haplotypes in dikaryons allows benchmarking of phasing accuracy in Nanopore and HiFi assemblies with Hi-C data. Genome biology, 23(1), 84.

We will test on a sample from Collombet et al. 2019 [2], example of mouse single-cell Hi-C on embryos obtained from highly heterozygous parents. We will take some cell from the dataset, GSM3691125_2CSE_70. Note that becuase the procedure is not strictly Hi-C, the properties of this dataset may differ from what you may obtain on bulk data.

Create the reference genome

For phasing, map the data to the concatenated genome with two haplotypes. Obtaining such genome is not a simple task. You will need a reference genome, and one or two lists of mutations to instroduce to the reference.

Download reference genome

[ ]:
! wget

Download .vcf file with variants

[ ]:
! wget

Index the variants

[ ]:
! bcftools index CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz

Introduce the variants into the genome

Note that you may select the variants that are only SNPs but not SNVs (deletions/insertions) by using --include parameter of bcftools consensus (e.g. --include '(STRLEN(REF)=1) & (STRLEN(ALT[0])=1)'). This will make sure that the genomic coorditates correspond between the haplotypes. Correspondence of coordinates is not a requirement, but might be important for downstream analysis.

[ ]:
bcftools consensus --fasta-ref GRCm38_68.fa.gz \
  --haplotype 1 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap1/' | bgzip -c > GRCm38_EiJ_snpsonly_hap1.fa.gz

bcftools consensus --fasta-ref GRCm38_68.fa.gz \
  --haplotype 2 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz |sed '/^>/ s/$/_hap2/' | bgzip -c > GRCm38_EiJ_snpsonly_hap2.fa.gz

Create the index of concatenated haplotypes

Concatenate the genomes and index them together. Note that bwa-mem2 produces very similar results to bwa mem, while being x2-3 times faster. We highly recommend to use it instead of bwa!

[ ]:
cat GRCm38_EiJ_snpsonly_hap1.fa.gz GRCm38_EiJ_snpsonly_hap2.fa.gz > GRCm38_EiJ_snpsonly.fa.gz
bwa index GRCm38_EiJ_snpsonly.fa.gz

Generate chromosome sizes file:

[ ]:
faidx GRCm38_EiJ_snpsonly.fa.gz -i chromsizes > GRCm38_EiJ_snpsonly.chromsizes

Download data

Uncomment the --minSpotId and --maxSpotId if you want to run the small test instead of full run.

[ ]:
! fastq-dump SRR8811373 --gzip --split-spot --split-3 # --minSpotId 0 --maxSpotId 1000000
[ ]:
! ls SRR8811373*.fastq.gz

Map data with bwa mem to diploid genome

Note that you may use bwa mem2, which is x2 times faster. It proved to produce results very similar to bwa mem.

There are two modes to work with phasing.

  1. Github mode with XB bwa tag. This is the most precise algorithm that operates based on alignment scores of optimal alignment (best hit), and two suboptimal ones.

    Download and install bwa from GitHub. Map with:

    ./bwa/bwa mem -SPu -t 5 mm10_EiJ_snpsonly.fa.gz  test.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XB.bam
  2. Regular mode with XA bwa tag.

    This is simplified version that operates on number of mismatches for the suboptimal alignments.

    bwa mem -SP -t 5 mm10_EiJ_snpsonly.fa.gz  est.1.fastq.gz test.2.fastq.gz | samtools view -@ 8 -b > mapped.XA.bam

We will try the second option for the simplicity:

[ ]:
bwa mem -SP -t 5 GRCm38_EiJ_snpsonly.fa.gz SRR8811373_1.fastq.gz SRR8811373_2.fastq.gz \
  | samtools view -@ 8 -b > mapped.XA.bam

pairtools parse

For phasing, we need additional tags and no filtering by mapq.

--min-mapq is 1 by default, which removes all multiply mapped sequences. However, we need this information for phasing to distinguish true multiply mapped pairs from pairs mapped to both haplotypes:

[ ]:
pairtools parse --add-columns XA,NM,AS,XS --min-mapq 0 --drop-sam --walks-policy all \
  -c GRCm38_EiJ_snpsonly.chromsizes mapped.XA.bam -o unphased.XA.pairs.gz

pairtools phase

Phasing will remove the tags “_1” and “_2” from chromosome names and add a separate field for the phase:

[ ]:
pairtools phase --phase-suffixes _hap1 _hap2 --tag-mode XA --clean-output unphased.XA.pairs.gz -o phased.XA.pairs.gz

pairtools dedup

Sort prior to dedup:

[ ]:
pairtools sort phased.XA.pairs.gz --nproc 10 -o phased.sorted.XA.pairs.gz

Deduplication now should take additional columns with phases into account:

[ ]:
pairtools dedup --mark-dups --extra-col-pair phase1 phase2 \
  --output-dups - --output-unmapped - --output-stats phased.XA.dedup.stats \
  -o phased.sorted.XA.nodup.pairs.gz phased.sorted.XA.pairs.gz

Dedup might generate warning that phase columns now contain mixed data types (‘.’ alongside with 0 and 1). This warning is inherited from reading by reading the pairs file by pandas.


First, filter different types of reads:

[ ]:
pairtools select '(phase1=="0") and (phase2=="0")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase0.pairs.gz
pairtools select '(phase1=="1") and (phase2=="1")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.phase1.pairs.gz
pairtools select '(phase1==".") or (phase2==".")' phased.sorted.XA.nodup.pairs.gz -o phased.XA.unphased.pairs.gz
pairtools select '(phase1!=phase2) and (phase1!=".") and (phase2!=".")' phased.sorted.XA.nodup.pairs.gz \
  -o phased.XA.trans-phase.pairs.gz

Calculate stats for these different types:

[ ]:
pairtools stats phased.XA.phase0.pairs.gz -o phased.XA.phase0.stats
pairtools stats phased.XA.phase1.pairs.gz -o phased.XA.phase1.stats
pairtools stats phased.XA.unphased.pairs.gz -o phased.XA.unphased.stats
pairtools stats phased.XA.trans-phase.pairs.gz -o phased.XA.trans-phase.stats

Visualize with multiQC:

[ ]:
multiqc phased.XA.*phase*.stats -o multiqc_report_phasing
from IPython.display import IFrame

IFrame(src='./multiqc_report_phasing/multiqc_report.html', width=1200, height=700)
[ ]: