Pairtools benchmarking

Welcome to pairtools benchmarking. These are the instructions on how to test performance of different software for mapping Hi-C and Hi-C-like methods. Mapping usually results in the file with mapped pairs, which is then converted into binned matrix format. Pairs format is the “rawest” interpretable type of data after reads.

Reviewing the literature suggests that there are at least 6 methods to map Hi-C and Hi-C-like data. These include:

We benchmark these programs on one million of representative reads. These reads are taken from random replicate from Rao SSP et al., “Cohesin Loss Eliminates All Loop Domains.”, Cell, 2017 Oct 5;171(2):305-320.e24 Generally, it is useful to assess how much computational time you need per million of reads. As long as you have this assessment, you may multiply the size of your experiment by the whole library size (in mlns of reads), because we expect linear growth of computational complexity of reads mapping with library size.

The benchmarking consists of four general steps. If you want to reproduce it, you need to run steps 1 and 2 manually in order to create the working environment, and then use snakemake script to run the benchmarks. You may use the commands form the “3. Run” section to get an understanding how each indiviaul framework works and what parameters can be changed. Note that you need separate run of juicer with single value of –ncores, because it does not support parallel launches (because it writes to the default output). Finally, there is a visualization section with a display of all the results that we calcualted on our machines.

  1. Install software
  2. Download data and genome.
  3. Run
  4. Visualize benchmarks

1. Install software

We will use separate conda environments to install different utilities. Each utility will have its own environment and peth to the binaries.

[ ]:
%%bash
mkdir ./soft

pairtools

pairtools v1.0.0

[ ]:
%%bash
conda create -y --prefix soft/pairtools1.0.0 python=3.9 pip
conda activate soft/pairtools1.0.0
pip install cython
pip install git+https://github.com/open2c/pairtools.git@pre0.4.0
conda install -y -c bioconda "bwa>=0.7.17"

bwa-mem2

[ ]:
%%bash
conda activate soft/pairtools1.0.0

# bwa-mem2: compile from source (not recommended for general users)

# Get the source
git clone --recursive https://github.com/bwa-mem2/bwa-mem2 soft/bwa-mem2
cd soft/bwa-mem2

# Compile
make

# Exit compilation folder
cd ../../

chromap

[ ]:
%%bash
conda create -y --prefix soft/chromap
conda activate soft/chromap
conda install -y -c bioconda -c conda-forge chromap

HiC-Pro

HiC-Pro is a popular software for Hi-C mapping, its now part of nf-core Hi-C pipeline, supports both fragment-based analysis of Hi-C and fragement-free analysis of DNase-based Hi-C.

[ ]:
%%bash
git clone https://github.com/nservant/HiC-Pro.git soft/HiC-Pro_env/HiC-Pro
conda env create -f soft/HiC-Pro/environment.yml -p soft/HiC-Pro_env
### Working environment will be soft/HiC-Pro_env

conda activate soft/HiC-Pro_env

# Install dependencies
conda install -y -c bioconda bowtie2 samtools pysam numpy scipy bx-python
conda install -y -c r r r-rcolorbrewer r-ggplot2

# Copy prepared config:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt

# Configure and install:
cd soft/HiC-Pro_env/HiC-Pro
make configure
make install

cd ../../../

# Retain only data processing steps with no creating of maps:
sed -i "s/all : init mapping proc_hic merge_persample hic_qc build_raw_maps ice_norm/all : init mapping proc_hic merge_persample #hic_qc build_raw_maps ice_norm/" soft/HiC-Pro_env/HiC-Pro/scripts/Makefile
[ ]:
%%bash
# Note that the configs should be adjusted for your system:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt
cp configs/config-hicpro.txt soft/HiC-Pro_env/HiC-Pro/config-hicpro.txt

FAN-C

[ ]:
%%bash
conda create -y --prefix soft/fanc python=3.9 pip hdf5
conda activate soft/fanc
pip install fanc
conda install -y -c bioconda samtools

Juicer

[ ]:
%%bash

conda create -y --prefix soft/juicer
conda activate soft/juicer

conda install -y -c bioconda bwa java-jdk
conda install -y -c conda-forge coreutils

# Download the recommended stable version:
wget https://github.com/aidenlab/juicer/archive/refs/tags/1.6.zip
unzip 1.6.zip
rm 1.6.zip
mv juicer-1.6 soft/juicer-1.6

# Download compile jar files of the stable  version:
wget http://hicfiles.tc4ga.com.s3.amazonaws.com/public/juicer/juicer_tools.1.6.2_jcuda.0.7.5.jar
mv juicer_tools.1.6.2_jcuda.0.7.5.jar soft/juicer-1.6/CPU/scripts/common/juicer_tools.jar

# Copy the scripts to some accessible location:
mkdir -p soft/juicer-1.6/CPU/scripts/
cp -r soft/juicer-1.6/CPU/[^s]* soft/juicer-1.6/CPU/scripts/

HiCExplorer

[ ]:
%%bash

conda create -y --prefix soft/hicexplorer python=3.9
conda activate soft/hicexplorer
conda install -y -c bioconda hicexplorer bwa

2. Download data and genome

[ ]:
%%bash

mkdir data

Download raw data

Test data from Rao et al. 2017, 1 mln pairs:

[ ]:
%%bash

fastq-dump -O data --gzip --split-files SRR6107789 --minSpotId 0 --maxSpotId 1000000
[ ]:
%%bash

# Put the data in accessible folder for juicer:
mkdir -p data/4juicer/fastq/
mkdir -p data/4juicer/splits/
cp data/SRR6107789_1.fastq.gz data/4juicer/fastq/SRR6107789_R1.fastq.gz
cp data/SRR6107789_2.fastq.gz data/4juicer/fastq/SRR6107789_R2.fastq.gz
cp data/4juicer/fastq/* data/4juicer/splits/
[ ]:
%%bash

# Put the data in accessible folder for HiC-Pro:
mkdir -p soft/HiC-Pro_env/HiC-Pro/rawdata/sample1
cp data/S*fastq.gz soft/HiC-Pro_env/HiC-Pro/rawdata/sample1/

Install genome

Genomepy installation

will install fasta, bwa and bowtie2 indexes:

[ ]:
%%bash

# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa bowtie2
[ ]:
%%bash

# Install hg38 genome by genomepy:
! genomepy install hg38 -g data/
[ ]:
%%bash

# Restrict the genome:
! cooler digest data/hg38/hg38.fa.sizes data/hg38/hg38.fa DpnII --rel-ids 1 -o data/hg38/hg38.DpnII.bed

Build genome index: bwa-mem2

[ ]:
%%bash
mkdir data/hg38/index/bwa-mem2/
soft/bwa-mem2/bwa-mem2 index -p data/hg38/index/bwa-mem2/hg38 data/hg38/hg38.fa

Build genome index: chromap

[ ]:
%%bash
mkdir data/hg38/index/chromap
chromap -i -r data/hg38/hg38.fa -o data/hg38/index/chromap/hg38

3. Run

The banchmarking is usually cumbersome, but it can be simplified by snakemake. We provide a Snakemake pipeline that will allow you to benchmark different approaches.

The output of snakemake will consist of resulting Hi-C pairs/maps in output folder and benchmarking files in benchmarks folder. The file names have the information on parameters in their names:

[ ]:
%%bash

# Running
snakemake --cores 10
[ ]:
%%bash

# Cleanup
rm output/*; rm benchmarks/*

Manual run

You may also run them to test individual steps of the pipeline.

pairtools

[ ]:
%%bash

soft/bwa-mem2/bwa-mem2 mem -t 5 -SP data/hg38/index/bwa-mem2/hg38 data/SRR6107789_1.fastq.gz data/SRR6107789_2.fastq.gz | \
    soft/pairtools1.0.0/bin/pairtools parse --nproc-in 5 --nproc-out 5 --drop-sam --drop-seq -c data/hg38/hg38.fa.sizes | \
    soft/pairtools1.0.0/bin/pairtools sort --nproc 5 | \
    soft/pairtools1.0.0/bin/pairtools dedup -p 5 --backend cython \
    -o output/result.pairtools.pairs

chromap

[ ]:
%%bash

soft/chromap/bin/chromap --preset hic --low-mem \
    -t 5 -x data/hg38/index/chromap/hg38 -r data/hg38/hg38.fa \
    -1 data/SRR6107789_1.fastq.gz -2 data/SRR6107789_2.fastq.gz -o output/result.chromap.pairs

HiC-Pro

[ ]:
%%bash

cd soft/HiC-Pro_env/HiC-Pro
bin/HiC-Pro -i rawdata/ -o output -c config-hicpro.txt

cd ../../../

FAN-C

Based on CLI tutorial:

[ ]:
%%bash

fanc map -t 5 data/SRR6107789_1.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_1.bam
fanc map -t 5 data/SRR6107789_2.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_2.bam
samtools sort -@ 5 -n output/fanc-output_1.bam -o output/fanc-output_1.sorted.bam
samtools sort -@ 5 -n output/fanc-output_2.bam -o output/fanc-output_2.sorted.bam
fanc pairs output/fanc-output_1.sorted.bam output/fanc-output_2.sorted.bam output/fanc-output.pairs -g data/hg38/hg38.DpnII.bed

Juicer

[ ]:
%%bash

soft/juicer-1.6/CPU/juicer.sh -g hg38 -d data/4juicer/ -s DpnII -S early -p data/hg38/hg38.fa.sizes -y data/hg38/hg38.DpnII.bed -z data/hg38/index/bwa/hg38.fa -t 5 -D soft/juicer-1.6/CPU

HiCExplorer

Based on the example: https://hicexplorer.readthedocs.io/en/latest/content/example_usage.html

Note that it does not procude the pairs, but binned coolers.

[ ]:
%%bash

hicBuildMatrix --samFiles \
    <(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_1.fastq.gz | samtools view -Shb -) \
    <(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_2.fastq.gz | samtools view -Shb -) \
     --restrictionSequence GATC \
     --danglingSequence GATC \
     --restrictionCutFile data/hg38/hg38.DpnII.bed \
     --threads 4 \
     --inputBufferSize 100000 \
     --QCfolder hicexplorer_tmp \
     -o hicexplorer_output.cool

4. Visualize benchmarks

[1]:
# Check the CPU properties:
import psutil
print(f"{psutil.cpu_count()} CPUs at {psutil.cpu_freq().current:.0f} GHz")
36 CPUs at 2715 GHz
[2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rcParams['font.family'] = "sans-serif"
figsize_A4 = np.array([11.69, 8.27])
plt.rcParams["figure.figsize"] = figsize_A4.T
plt.rcParams['figure.facecolor']='white'
plt.rcParams['font.size']=16

import glob
[9]:
## If you start from .csv. file:
timings = pd.read_csv('benchmarking_1mln.csv', index_col=0)
[4]:
# If you start from your own benchmarks:
files = glob.glob("benchmarks/*")
print(len(files))
24
[4]:
def get_params(filename):
    split = filename.split('.')
    util= split[1]
    ncores = int(split[2])

    return util, ncores

timings = []
for f in files:
    t = pd.read_table(f)
    t[['util', 'ncores']] = get_params(f)
    timings.append(t)
timings = pd.concat(timings)
[10]:
timings.head()
[10]:
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time util ncores
0 171.2557 0:02:51 18999.14 20579.61 18990.64 18991.51 14942.02 61.25 81.02 140.18 chromap 1
1 161.8126 0:02:41 19010.55 20589.94 19002.19 19003.05 29831.66 88.35 72.64 122.87 chromap 1
2 147.2525 0:02:27 19251.87 20844.66 19244.98 19245.37 41893.51 176.68 89.18 139.78 chromap 1
3 160.3510 0:02:40 19010.16 20589.94 19004.14 19004.56 56673.84 265.02 74.24 132.19 chromap 1
4 152.5034 0:02:32 19141.59 20730.08 19134.09 19134.47 71454.04 353.35 83.52 144.57 chromap 1
[17]:
df = timings.sort_values(['ncores', 'util'])
df.loc[:, "max_rss_gb"] = df.loc[:, "max_rss"]/1024
df.loc[:, "min"] = df.loc[:, "s"]/60
[101]:
labels = ['chromap', 'pairtools_bwamem2', 'pairtools', 'juicer', 'hicpro', 'hicexplorer', 'fanc_bwa', 'fanc_bowtie2']
labels_mod = ['Chromap', 'bwa-mem2 + pairtools', 'bwa mem + pairtools', 'Juicer', 'HiC-Pro', 'HiCExplorer*', 'bwa mem + FANC', 'bowtie2 + FANC']
[102]:
df.query(f'ncores==2 and util=="chromap"')
[102]:
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load cpu_time util ncores max_rss_gb min
0 94.1938 0:01:34 19068.07 20678.57 19047.60 19052.58 14768.99 0.02 119.14 113.62 chromap 2 18.621162 1.569897
1 77.7396 0:01:17 19045.91 20665.29 19025.18 19029.89 14768.99 88.35 112.30 92.39 chromap 2 18.599521 1.295660
2 77.1351 0:01:17 19045.99 20665.54 19025.28 19029.69 14768.99 176.68 113.29 98.83 chromap 2 18.599600 1.285585
3 77.6121 0:01:17 19046.05 20665.54 19025.38 19029.77 14769.00 265.02 113.53 105.77 chromap 2 18.599658 1.293535
4 77.5393 0:01:17 19045.89 20665.54 19025.11 19029.47 14769.00 353.35 112.50 111.44 chromap 2 18.599502 1.292322
[104]:
fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)

cmap = ['#3E9ADE', '#EF242B', '#9FC741']
ncores_order = [4, 2, 1]

style_dict = dict(
        orient='h',
        palette=cmap,
        #edgecolor="k",
        linewidth=0.0,
        errwidth=1.0,
        errcolor='k',
        capsize=0.07)

ax = axes[0]

# Plot barplot:
splot = sns.barplot(
        x="min",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=ncores_order,
        ax=ax,
        **style_dict
)

# Add text, slowdown over chromap
for icore, ncores in enumerate(ncores_order):
    for ilabels, util in enumerate(labels):
        if util=="chromap":
            continue

        df_reference = df.query(f'ncores=={ncores} and util=="chromap"')
        mean_reference = np.mean(df_reference['min'].values)
        df_target = df.query(f'ncores=={ncores} and util=="{util}"')
        mean_target = np.mean(df_target['min'].values)

        slowdown = mean_target / mean_reference

        w = splot.patches[0].get_height()

        splot.text( s=f"x{slowdown:.1f}",
                   x=mean_target+4, y=ilabels + (icore-1)*w,
                   ha = 'left', va = 'center', fontsize=11, weight='bold')

ax.set_ylabel('')
ax.set_xlabel('Time (min)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
#ax.set_xscale('log')
ax.set_xlim([0, 95])
ax.set_xticks(np.arange(0, 96, 5), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()

ax = axes[1]
sns.barplot(
        x="max_rss_gb",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=ncores_order,
        ax=ax,
        **style_dict)

ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (GB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 26, 1), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')

fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)

# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)

plt.savefig("benchmarking_1mln.pdf")
../../_images/examples_benchmark_benchmark_60_0.png
[10]:
# df.to_csv('benchmarking_1mln.csv')