Pairtools benchmarking

Welcome to pairtools benchmarking. These are the instructions on how to test performance of different software for mapping Hi-C and Hi-C-like methods. Mapping usually results in the file with mapped pairs, which is then converted into binned matrix format. Pairs format is the “rawest” interpretable type of data after reads.

Reviewing the literature suggests that there are at least 6 methods to map Hi-C and Hi-C-like data. These include:

pairtools is a lightweight Python CLI that extracts and manipulates Hi-C contacts post-alignment. Aslignment can be done by:
- bwa mem
- bwa-mem2, ahn optimized version of bwa mem, which x2-2.5 improves speed over bwa
chromap is a fast alignment tool for chromatin profiles, not specialized for Hi-C but parameterized for a broad range of sequencing data including Hi-C short reads.

Does not require separate step of mapping.
HiC-Pro is a pipeline for Hi-C and DNase-C mapping, “optimized and flexible”.

It calls mapping within. By default, creates the output cooler files with binned data, but the script can be tinkered in order to stop the processing at the step of pairs.
Juicer is a platform for analysis of Hi-C data, which is already adapted to a wide range of cluster types.

It calls mapping within. Has an option to stop the data processing at the step of pairs, without further construction of binned matrices.
HiCExplorer is a broad-range set of tools for processing, normalization, analysis and visualization Hi-C and Hi-C-like methods.

It builds Hi-C binned matrix post-alignment with bwa mem.
FAN-C is a set of CLI tools that runs the mapping (bowtie or bwa mem), extracts and manipulates Hi-C contacts. It also has the tools for data visualization and downstream analysis.
TADbit is multi-task Python API that handles all the steps from the alignment of paired-end reads to the detection of Topologically Associating Domain (TAD) borders, compartments and three-dimensional modeling of chromatin based on interaction matrices.

We benchmark these programs on one million of representative reads. These reads are taken from random replicate from Rao SSP et al., “Cohesin Loss Eliminates All Loop Domains.”, Cell, 2017 Oct 5;171(2):305-320.e24 Generally, it is useful to assess how much computational time you need per million of reads. As long as you have this assessment, you may multiply the size of your experiment by the whole library size (in mlns of reads), because we expect linear growth of computational complexity of reads mapping with library size.

The benchmarking consists of four general steps. If you want to reproduce it, you need to run steps 1 and 2 manually in order to create the working environment, and then use snakemake script to run the benchmarks. You may use the commands form the “3. Run” section to get an understanding how each indiviaul framework works and what parameters can be changed. Note that you need separate run of juicer with single value of –ncores, because it does not support parallel launches (because it writes to the default output). Finally, there is a visualization section with a display of all the results that we calcualted on our machines.

Install software
Download data and genome.
Run
Visualize benchmarks

1. Install software

We will use separate conda environments to install different utilities. Each utility will have its own environment and peth to the binaries.

[ ]:

%%bash
mkdir ./soft

pairtools

pairtools v1.0.2

[ ]:

%%bash
conda create -y --prefix soft/pairtools1.0.2 python=3.9 pip
conda activate soft/pairtools1.0.2
pip install cython numpy pysam
pip install git+https://github.com/open2c/pairtools.git@v1.0.2

conda install -c conda-forge lz4-c # conda install -c anaconda lz4

conda install -y -c bioconda "bwa>=0.7.17"

bwa-mem2

[ ]:

%%bash
conda activate soft/pairtools1.0.2

# bwa-mem2: compile from source (not recommended for general users)

# Get the source
git clone --recursive https://github.com/bwa-mem2/bwa-mem2 soft/bwa-mem2
cd soft/bwa-mem2

# Compile
make

# Exit compilation folder
cd ../../

chromap

[ ]:

%%bash
conda create -y --prefix soft/chromap
conda activate soft/chromap
conda install -y -c bioconda -c conda-forge chromap

HiC-Pro

HiC-Pro is a popular software for Hi-C mapping, its now part of nf-core Hi-C pipeline, supports both fragment-based analysis of Hi-C and fragement-free analysis of DNase-based Hi-C.

[ ]:

%%bash
git clone https://github.com/nservant/HiC-Pro.git soft/HiC-Pro_env/HiC-Pro
conda env create -f soft/HiC-Pro_env/HiC-Pro/environment.yml -p soft/HiC-Pro_env
### Working environment will be soft/HiC-Pro_env

conda activate soft/HiC-Pro_env

# Install dependencies
conda install -y -c bioconda bowtie2 samtools pysam numpy scipy bx-python
conda install -y -c r r r-rcolorbrewer r-ggplot2

# Copy prepared config:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt
cp configs/config-hicpro.txt soft/HiC-Pro_env/HiC-Pro/config-hicpro.txt

# Configure and install:
cd soft/HiC-Pro_env/HiC-Pro
make configure
make install

cd ../../../

# Patch the code to retain only data processing steps with no creating of maps:
sed -i "s/all : init mapping proc_hic merge_persample hic_qc build_raw_maps ice_norm/all : init mapping proc_hic merge_persample #hic_qc build_raw_maps ice_norm/" soft/HiC-Pro_env/HiC-Pro/scripts/Makefile

[ ]:

%%bash
# Note that the configs should be adjusted for your system:
cp configs/config-hicpro_install.txt soft/HiC-Pro_env/HiC-Pro/config-install.txt
cp configs/config-hicpro.txt soft/HiC-Pro_env/HiC-Pro/config-hicpro.txt

FAN-C

[ ]:

%%bash
conda create -y --prefix soft/fanc python=3.9 pip hdf5
conda activate soft/fanc
pip install fanc
conda install -y -c bioconda samtools

Juicer

[ ]:

%%bash

conda create -y --prefix soft/juicer
conda activate soft/juicer

conda install -y -c bioconda bwa java-jdk
conda install -y -c conda-forge coreutils

# Download the recommended stable version:
wget https://github.com/aidenlab/juicer/archive/refs/tags/1.6.zip
unzip 1.6.zip
rm 1.6.zip
mv juicer-1.6 soft/juicer-1.6

# Download compile jar files of the stable  version:
wget http://hicfiles.tc4ga.com.s3.amazonaws.com/public/juicer/juicer_tools.1.6.2_jcuda.0.7.5.jar
mv juicer_tools.1.6.2_jcuda.0.7.5.jar soft/juicer-1.6/CPU/scripts/common/juicer_tools.jar

# Copy the scripts to some accessible location:
mkdir -p soft/juicer-1.6/CPU/scripts/
cp -r soft/juicer-1.6/CPU/[^s]* soft/juicer-1.6/CPU/scripts/

HiCExplorer

[ ]:

%%bash

conda create -y --prefix soft/hicexplorer python=3.9
conda activate soft/hicexplorer
conda install -y -c bioconda hicexplorer bwa

TADbit

[ ]:

%%bash

conda create -y --prefix soft/tadbit
conda activate soft/tadbit

# # Install mappers:
conda install -y -q -c bioconda gem3-mapper bowtie2

# install tadbit
conda install -y -q -c bioconda tadbit

2. Download data and genome

[ ]:

%%bash

mkdir data

Download raw data

Test data from Rao et al. 2017, 1 mln pairs:

[ ]:

%%bash

fastq-dump -O data --gzip --split-files SRR6107789 --minSpotId 0 --maxSpotId 1000000

[ ]:

%%bash

# Put the data in accessible folder for juicer:
mkdir -p data/4juicer/fastq/
mkdir -p data/4juicer/splits/
cp data/SRR6107789_1.fastq.gz data/4juicer/fastq/SRR6107789_R1.fastq.gz
cp data/SRR6107789_2.fastq.gz data/4juicer/fastq/SRR6107789_R2.fastq.gz
cp data/4juicer/fastq/* data/4juicer/splits/

[ ]:

%%bash

# Put the data in accessible folder for HiC-Pro:
mkdir -p soft/HiC-Pro_env/HiC-Pro/rawdata/sample1
cp data/S*fastq.gz soft/HiC-Pro_env/HiC-Pro/rawdata/sample1/

Install genome

Genomepy installation

will install fasta, bwa and bowtie2 indexes:

[ ]:

%%bash

# Activate bwa plugin for genomepy:
! genomepy plugin enable bwa bowtie2

[ ]:

%%bash

# Install hg38 genome by genomepy:
! genomepy install hg38 -g data/

[ ]:

%%bash

# Restrict the genome:
! cooler digest data/hg38/hg38.fa.sizes data/hg38/hg38.fa DpnII --rel-ids 1 -o data/hg38/hg38.DpnII.bed

Build genome index: bwa-mem2

[ ]:

%%bash
mkdir data/hg38/index/bwa-mem2/
soft/bwa-mem2/bwa-mem2 index -p data/hg38/index/bwa-mem2/hg38 data/hg38/hg38.fa

Build genome index: chromap

[ ]:

%%bash
mkdir data/hg38/index/chromap
chromap -i -r data/hg38/hg38.fa -o data/hg38/index/chromap/hg38

Build genome index: GEM

[ ]:

%%bash
mkdir data/hg38/index/gem
gem-indexer -T 8 -i data/hg38/hg38.fa -o data/hg38/index/gem/hg38

3. Run

The banchmarking is usually cumbersome, but it can be simplified by snakemake. We provide a Snakemake pipeline that will allow you to benchmark different approaches.

The output of snakemake will consist of resulting Hi-C pairs/maps in output folder and benchmarking files in benchmarks folder. The file names have the information on parameters in their names:

[ ]:

%%bash

# Running
snakemake --cores 10

[ ]:

%%bash

# Cleanup (only if you want to erase all the output)
rm output/*; rm benchmarks/*

Manual run

You may also run them to test individual steps of the pipeline.

pairtools

[ ]:

%%bash

soft/bwa-mem2/bwa-mem2 mem -t 5 -SP data/hg38/index/bwa-mem2/hg38 data/SRR6107789_1.fastq.gz data/SRR6107789_2.fastq.gz | \
    soft/pairtools1.0.2/bin/pairtools parse --nproc-in 5 --nproc-out 5 --drop-sam --drop-seq -c data/hg38/hg38.fa.sizes | \
    soft/pairtools1.0.2/bin/pairtools sort --nproc 5 | \
    soft/pairtools1.0.2/bin/pairtools dedup -p 5 --backend cython \
    -o output/result.pairtools.pairs

chromap

[ ]:

%%bash

soft/chromap/bin/chromap --preset hic --low-mem \
    -t 5 -x data/hg38/index/chromap/hg38 -r data/hg38/hg38.fa \
    -1 data/SRR6107789_1.fastq.gz -2 data/SRR6107789_2.fastq.gz -o output/result.chromap.pairs

HiC-Pro

[ ]:

%%bash

cd soft/HiC-Pro_env/HiC-Pro
bin/HiC-Pro -i rawdata/ -o output -c config-hicpro.txt

cd ../../../

FAN-C

Based on CLI tutorial:

[ ]:

%%bash

fanc map -t 5 data/SRR6107789_1.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_1.bam
fanc map -t 5 data/SRR6107789_2.fastq.gz data/hg38/index/bwa/hg38.fa output/fanc-output_2.bam
samtools sort -@ 5 -n output/fanc-output_1.bam -o output/fanc-output_1.sorted.bam
samtools sort -@ 5 -n output/fanc-output_2.bam -o output/fanc-output_2.sorted.bam
fanc pairs output/fanc-output_1.sorted.bam output/fanc-output_2.sorted.bam output/fanc-output.pairs -g data/hg38/hg38.DpnII.bed

Juicer

[ ]:

%%bash

soft/juicer-1.6/CPU/juicer.sh -g hg38 -d data/4juicer/ -s DpnII -S early -p data/hg38/hg38.fa.sizes -y data/hg38/hg38.DpnII.bed -z data/hg38/index/bwa/hg38.fa -t 5 -D soft/juicer-1.6/CPU

HiCExplorer

Based on the example: https://hicexplorer.readthedocs.io/en/latest/content/example_usage.html

Note that it does not procude the pairs, but binned coolers.

[ ]:

%%bash

hicBuildMatrix --samFiles \
    <(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_1.fastq.gz | samtools view -Shb -) \
    <(bwa mem -t 4 -A1 -B4 -E50 -L0 data/hg38/index/bwa/hg38.fa data/SRR6107789_2.fastq.gz | samtools view -Shb -) \
     --restrictionSequence GATC \
     --danglingSequence GATC \
     --restrictionCutFile data/hg38/hg38.DpnII.bed \
     --threads 4 \
     --inputBufferSize 100000 \
     --QCfolder hicexplorer_tmp \
     -o hicexplorer_output.cool

TADbit

[ ]:

%%bash

tadbit map tadbit_output --fastq data/SRR6107789_1.fastq.gz --read 1 --index data/hg38/index/gem/hg38.gem --renz DpnII
tadbit map tadbit_output --fastq data/SRR6107789_2.fastq.gz --read 2 --index data/hg38/index/gem/hg38.gem --renz DpnII

tadbit parse tadbit_output --genome data/hg38/hg38.fa

tadbit filter tadbit_output --format short

Read mapping only

[ ]:

%%bash

# bwa mem
soft/pairtools0.3.0/bin/bwa mem -t 4 -SP data/hg38/index/bwa/hg38.fa data/SRR6107789_1.fastq.gz data/SRR6107789_2.fastq.gz > bwa-mem.sam

# bwa mem2
soft/bwa-mem2/bwa-mem2 mem -t 4 -SP data/hg38/index/bwa-mem2/hg38 data/SRR6107789_1.fastq.gz data/SRR6107789_2.fastq.gz > bwa-mem2.sam

# bowtie2 only
soft/tadbit/bin/bowtie2 -p 4 -x data/hg38/index/bowtie2/hg38 -1 data/SRR6107789_1.fastq.gz -2 data/SRR6107789_2.fastq.gz -S bowtie2.sam

4. Visualize benchmarks

[2]:

# Check the CPU properties:
import psutil
print(f"{psutil.cpu_count()} CPUs at {psutil.cpu_freq().current:.0f} GHz")

36 CPUs at 1211 GHz

[3]:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rcParams['font.family'] = "sans-serif"
figsize_A4 = np.array([11.69, 8.27])
plt.rcParams["figure.figsize"] = figsize_A4.T
plt.rcParams['figure.facecolor']='white'
plt.rcParams['font.size']=16

import glob

[4]:

## If you start from .csv. file:
# df = pd.read_csv('benchmarking_1mln.csv')

[13]:

# If you start from your own benchmarks:
files = glob.glob("benchmarks/*") #+ glob.glob("benchmarks_v1_2022/*")# +
print(len(files))

[14]:

def get_params(filename):
    split = filename.split('.')
    util= split[1]
    ncores = int(split[2])

    return util, ncores

timings = []
for f in files:
    t = pd.read_table(f)
    t[['util', 'ncores']] = get_params(f)
    timings.append(t)
timings = pd.concat(timings)

[15]:

timings.head()

[15]:

	s	h:m:s	max_rss	max_vms	max_uss	max_pss	io_in	io_out	mean_load	cpu_time	util	ncores
0	482.1029	0:08:02	17068.97	20572.73	16931.91	16951.71	16275.21	0.02	102.18	493.82	pairtools_bwamem2	1
1	482.5261	0:08:02	17079.46	20508.73	16941.46	16956.34	32534.11	39.25	101.73	498.88	pairtools_bwamem2	1
2	488.9997	0:08:08	17055.04	20508.59	16920.96	16939.69	42104.61	78.47	100.17	502.58	pairtools_bwamem2	1
3	484.3460	0:08:04	16981.59	20380.60	16961.30	16962.37	45493.36	117.70	93.06	50.54	pairtools_bwamem2	1
4	483.3159	0:08:03	16969.02	20595.51	16944.75	16945.79	61922.79	156.93	99.61	507.64	pairtools_bwamem2	1

[87]:

df = timings.sort_values(['ncores', 'util'])

[88]:

df.to_csv('benchmarking_1mln.csv')

[17]:

np.unique(df.util)

[17]:

array(['bowtie', 'bwamem', 'bwamem2', 'chromap', 'fanc_bowtie2',
       'fanc_bwa', 'hicexplorer', 'hicpro', 'juicer', 'pairtools',
       'pairtools_bwamem2', 'tadbit', 'tadbit_bowtie2'], dtype=object)

[21]:

labels = [
    'chromap',
    'pairtools_bwamem2',
    'tadbit',
    'pairtools',
    'tadbit_bowtie2',
    'juicer',
    'hicpro',
    'hicexplorer',
    'fanc_bwa',
    'fanc_bowtie2',
    'bwamem2',
    'bwamem',
    'bowtie',
]
labels_mod = [
    'Chromap',
    'bwa-mem2 + pairtools',
    'GEM + TADbit',
    'bwa mem + pairtools',
    'bowtie2 + TADbit',
    'Juicer',
    'Hi-Pro',
    'HiCExplorer',
    'bwa mem + FANC',
    'bowtie2 + FANC',
    'bwa-mem2',
    'bwa mem',
    'bowtie2',
]

[24]:

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)

cmap = ['#FD7F69', '#9FC9DD', '#89A76F']

style_dict = dict(
        orient='h',
        palette=cmap,
        edgecolor="k",
        linewidth=2.0,
        errwidth=2.0,
        capsize=0.07)

ax = axes[0]
b = sns.barplot(x="s",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict
)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Time (sec)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
#ax.set_xscale('log')
ax.set_xlim([0, 5e3])
ax.set_xticks(np.arange(0, 6000, 100), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()

ax = axes[1]
b = sns.barplot(x="max_rss",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (MB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 30000, 1000), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')

fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)

# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)

plt.savefig("benchmarking_1mln.pdf")

../../_images/examples_benchmark_benchmark_69_0.png

[25]:

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)

cmap = ['#FD7F69', '#9FC9DD', '#89A76F']

style_dict = dict(
        orient='h',
        palette=cmap,
        edgecolor="k",
        linewidth=2.0,
        errwidth=2.0,
        capsize=0.07)

ax = axes[0]
b = sns.barplot(x="s",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict
)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Time (sec)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xscale('log')
ax.set_xlim([1, 5e3])
# ax.set_xticks(np.arange(0, 5000, 100), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()

ax = axes[1]
b = sns.barplot(x="max_rss",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (MB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 30000, 1000), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')

fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)

# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)

plt.savefig("benchmarking_1mln_log.pdf")

../../_images/examples_benchmark_benchmark_70_0.png

[89]:

labels = [
    'chromap',
    'pairtools_bwamem2',
    'tadbit',
    'pairtools',
    'tadbit_bowtie2',
    'juicer',
    'hicpro',
    'hicexplorer',
    'fanc_bwa',
    'fanc_bowtie2',
    # 'bwamem2',
    # 'bwamem',
    # 'bowtie',
]
labels_mod = [
    'Chromap',
    'bwa-mem2 + pairtools',
    'GEM + TADbit',
    'bwa mem + pairtools',
    'bowtie2 + TADbit',
    'Juicer',
    'Hi-Pro',
    'HiCExplorer',
    'bwa mem + FANC',
    'bowtie2 + FANC',
    # 'bwa-mem2',
    # 'bwa mem',
    # 'bowtie2',
]

[90]:

df = timings.sort_values(['ncores', 'util'])
df.loc[:, "max_rss_gb"] = df.loc[:, "max_rss"]/1024
df.loc[:, "min"] = df.loc[:, "s"]

[91]:

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)

cmap = ['#FD7F69', '#9FC9DD', '#89A76F']

style_dict = dict(
        orient='h',
        palette=cmap,
        edgecolor="k",
        linewidth=2.0,
        errwidth=2.0,
        capsize=0.07)

ax = axes[0]
b = sns.barplot(x="s",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict
)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Time (sec)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
#ax.set_xscale('log')
ax.set_xlim([0, 5e3])
ax.set_xticks(np.arange(0, 5000, 100), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()


# Add text, slowdown over chromap
ncores_order = [4, 2, 1]
for icore, ncores in enumerate(ncores_order):
    for ilabels, util in enumerate(labels):
        if util=="chromap":
            continue

        df_reference = df.query(f'ncores=={ncores} and util=="chromap"')
        mean_reference = np.mean(df_reference['min'].values)
        df_target = df.query(f'ncores=={ncores} and util=="{util}"')
        mean_target = np.mean(df_target['min'].values)

        slowdown = mean_target / mean_reference

        w = b.patches[0].get_height()

        b.text( s=f"x {slowdown:.1f}",
                x=mean_target+150, y=ilabels + (icore-1)*w,
                ha = 'left', va = 'center', fontsize=8, weight='bold')



ax = axes[1]
b = sns.barplot(x="max_rss",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict)
plt.setp(b.patches, linewidth=0.5)

ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (MB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 30000, 1000), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')

fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)

# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)

plt.savefig("benchmarking_1mln.pdf")

../../_images/examples_benchmark_benchmark_73_0.png

[92]:

dct_mapper = {
    'bowtie': 'bowtie',
    'bwamem': 'bwamem',
    'bwamem2': 'bwamem2',
    'chromap': "",
    'fanc_bowtie2': 'bowtie',
    'fanc_bwa': 'bwamem',
    'hicexplorer': 'bwamem',
    'hicpro':'bowtie',
    'juicer': 'bwamem',
    'pairtools': 'bwamem',
    'pairtools_bwamem2': 'bwamem2',
    'tadbit': 'GEM',
    'tadbit_bowtie2': 'bowtie'
}
df.loc[:, "mapper"] = df.util.replace(dct_mapper)

[93]:

df = pd.merge(df, df, left_on=['mapper', 'ncores'], right_on=['util', 'ncores'], suffixes=["", "_mapper"])

[94]:

labels = [
    'pairtools_bwamem2',
    # 'tadbit',
    'pairtools',
    'tadbit_bowtie2',
    'juicer',
    'fanc_bwa',
    'hicexplorer',
    'hicpro',
    'fanc_bowtie2',
    # 'bwamem2',
    # 'bwamem',
    # 'bowtie',
]
labels_mod = [
    'bwa-mem2 + pairtools',
    # 'GEM + TADbit',
    'bwa mem + pairtools',
    'bowtie2 + TADbit',
    'Juicer',
    'bwa mem + FANC',
    'HiCExplorer',
    'Hi-Pro',
    'bowtie2 + FANC',
    # 'bwa-mem2',
    # 'bwa mem',
    # 'bowtie2',
]

[113]:

fig, axes = plt.subplots(nrows=1, ncols=2, sharey=True)

cmap = ['#FD7F69', '#9FC9DD', '#89A76F']
style_dict = dict(
        orient='h',
        palette=cmap,
        edgecolor="k",
        linewidth=2.0,
        errwidth=2.0,
        capsize=0.07)

cmap1 = ['#ECECEC', '#BFBFBF', '#868686']
style_dict1 = dict(
        orient='h',
        palette=cmap1,
        edgecolor="k",
        linewidth=2.0,
        errwidth=2.0,
        capsize=0.07,
        alpha=0.8)


ax = axes[0]
b = sns.barplot(x="s",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict
)
plt.setp(b.patches, linewidth=0.5)


b1 = sns.barplot(x="s_mapper",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict1
)
plt.setp(b1.patches, linewidth=0.5)


ax.set_ylabel('')
ax.set_xlabel('Time (sec)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
#ax.set_xscale('log')
ax.set_xlim([0, 5e3])
ax.set_xticks(np.arange(0, 6000, 100), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')
ax.get_legend().remove()


# Add text, runtime percentage out of mapper timing
ncores_order = [4, 2, 1]
for icore, ncores in enumerate(ncores_order):
    for ilabels, util in enumerate(labels):
        if util=="chromap":
            continue

        df_target = df.query(f'ncores=={ncores} and util=="{util}"')
        mean_target = np.mean(df_target['min'].values)
        mean_mapper = np.mean(df_target['min_mapper'].values)

        prc = 100 * (mean_mapper) / mean_target

        w = b.patches[0].get_height()

        if prc>100:
            signature = f"~0 : ~100 %"
        else:
            signature = f"{prc:.0f} : {100-prc:.0f} %"

        b.text( s=signature,
                x=mean_target+150, y=ilabels + (icore-1)*w,
                ha = 'left', va = 'center', fontsize=8, weight='bold')



ax = axes[1]
b = sns.barplot(x="max_rss",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict)
plt.setp(b.patches, linewidth=0.5)


b1 = sns.barplot(x="max_rss_mapper",
        y="util",
        data=df.sort_values('util'),
        order=labels,
        hue='ncores',
        hue_order=[4,2,1],
        ax=ax,
        **style_dict1
)
plt.setp(b1.patches, linewidth=0.5)
ax.get_legend().remove()



# Add text, runtime percentage out of mapper timing
ncores_order = [4, 2, 1]
for icore, ncores in enumerate(ncores_order):
    for ilabels, util in enumerate(labels):
        if util=="chromap":
            continue

        df_target = df.query(f'ncores=={ncores} and util=="{util}"')
        mean_target = np.mean(df_target['max_rss'].values)
        mean_mapper = np.mean(df_target['max_rss_mapper'].values)

        prc = 100 * (mean_mapper) / mean_target

        w = b.patches[0].get_height()

        if prc>=100:
            signature = f""
        else:
            signature = f"{prc:.0f} : {100-prc:.0f} %"

        b.text( s=signature,
                x=mean_target+550, y=ilabels + (icore-1)*w,
                ha = 'left', va = 'center', fontsize=8, weight='bold')



ax.set_ylabel('')
ax.set_xlabel('Maximum Resident Set Size (MB)')
ax.set_yticklabels(labels_mod)
ax.set_axisbelow(True)
ax.grid(which='both', axis='x', color='k')
ax.set_xticks(np.arange(0, 30000, 1000), minor=True)
ax.grid(which='minor', axis='x', alpha=0.2, color='k')

# fig.suptitle('Benchmark of different Hi-C mapping tools for 1 mln reads (5 iterations)', y=0.99)

# (x, y, width, height)
bb = (fig.subplotpars.left, fig.subplotpars.top+0.002, fig.subplotpars.right-fig.subplotpars.left, 0.2)
# ax.legend(bbox_to_anchor=bb, title="Number of cores", loc="lower right", ncol=3, borderaxespad=0., bbox_transform=fig.transFigure, frameon=False)

plt.savefig("benchmarking_1mln.mappers.pdf")

../../_images/examples_benchmark_benchmark_77_0.png

[ ]: