Workflows and Parameters

This page provides guidance on using pairtools for the most common Hi-C protocols and helps users fine-tune the pipeline for different variations of the Hi-C protocol. It covers recommended parameters and best practices for processing Hi-C data using pairtools.

Typical Hi-C Workflow

A typical pairtools workflow for processing standard Hi-C data is outlined below. Please, note that this is a shorter version. For a detailed reproducible example, please, check the Jupyter notebook “Pairtools Walkthrough”.

  1. Align sequences to the reference genome with bwa mem:

    bwa mem -SP index_file input.R1.fastq input.R2.fastq > input.sam
    
  2. Parse alignments into Hi-C pairs using pairtools parse:

    pairtools parse -c /path/to/chrom_sizes -o output.pairs.gz input.sam
    
  3. Sort pairs using pairtools sort:

    pairtools sort --nproc 8 -o output.sorted.pairs.gz output.pairs.gz
    
  4. Detect and remove duplicates using pairtools dedup and generate statistics:

    pairtools dedup \
    --output output.nodups.pairs.gz \
    --output-dups output.dups.pairs.gz \
    --output-unmapped output.unmapped.pairs.gz
    --output-stats output.dedup.stats \
    output.sorted.pairs.gz
    
  5. Aggregate into a cooler file:

    cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 /path/to/chrom_sizes:1000 output.nodups.pairs.gz output.1000.cool
    

Technical tips

  • Pipe between commands to save space and I/O throughput

    Use Unix pipes to connect the output of one command directly to the input of the next command in the pipeline. This eliminates the need to store intermediate files on disk, saving storage space and reducing I/O overhead. Specifically, mapping, parsing, sorting and deduplication can all be connected into a single pipeline:

    bwa mem -SP index input.R1.fastq input.R2.fastq | \
    pairtools parse -c chromsizes.txt | \
    pairtools sort | \
    pairtools dedup | \
        --output output.nodups.pairs.gz \
        --output-dups output.dups.pairs.gz \
        --output-unmapped output.unmapped.pairs.gz
        --output-stats output.dedup.stats
    
  • Use recommended compression for efficient storage and processing. .sam, .pairs and .pairsam files are text-based format that are rather inefficient and slow to process. Pairtools recognize .bam, .gz and .lz4 file extensions and automatically compress and decompress files on the fly. Compression saves space, and reduces I/O overhead at a relatively minor CPU cost.

  • Parallelize tasks and manage resources effectively for faster execution. Each pairtool has the CLI flags –nproc-in and –nproc-out to control the number of cores dedicated to input decompression and output compression. Additionally, pairtools sort parallelizes sorting with –nproc

Advanced Workflows

For more advanced workflows, please check the following projects:

  • Distiller-nf is a feature-rich Open2C Hi-C processing pipeline for the Nextflow workflow manager.

  • Distiller-sm is a similarly feature-rich and optimized pipeline implemented in Snakemake.