Replay analysis pipeline

We provide a Nextflow pipeline for running the vast majority of the analyses seen in our manuscript. There are a few analyses that were run outside of this pipeline including the passenger analysis, and the POTP analysis.

Reproduce analysis

Install Nextflow by using the following command:

$ curl -s https://get.nextflow.io | bash

Download the Docker Desktop, there exists several distibutions packaged for various linux flavors

$ curl -fsSL https://get.docker.com -o get-docker.sh && sudo sh get-docker.sh

Note: the Dockerfile contains all the required dependencies. Add the -profile docker to enable the containerised execution to the example command line shown below.

Launch the pipeline execution with the following command:

$ git clone git@github.com:matsengrp/gcreplay.git && cd gcreplay
$ nextflow run main.nf -profile docker -resume

Note that this pipeline is computationally intensive, we run the pipeline on a SLURM cluster using the configurations seen in nextflow.config

Pipeline parameters

--ngs_manifest - Path to the NGS manifest CSV file containing sequencing run information. Default: ngs_manifest.csv
--gc_metadata - Path to the germinal center metadata CSV file. Default: gc_metadata.csv
--reads_prefix - Directory containing the NGS read files. Default: data/NGS-gz
--results - Directory where pipeline results will be stored. Default: results/
--plate_barcodes - File containing plate barcode sequences. Default: data/barcodes/plateBC.txt
--well_barcodes - File containing 96-well plate barcode sequences. Default: data/barcodes/96FBC.txt
--partis_anno_dir - Directory containing partis annotation germline data. Default: data/partis_annotation/germlines
--hdag_sub - File containing HDAG substitution model data. Default: data/mutability/MK_RS5NF_substitution.csv
--hdag_mut - File containing HDAG mutability model data. Default: data/mutability/MK_RS5NF_mutability.csv
--chigy_hc_mut_rates - File containing chimeric gamma heavy chain mutation rates. Default: data/mutability/chigy_hc_mutation_rates_nt.csv
--chigy_lc_mut_rates - File containing chimeric gamma light chain mutation rates. Default: data/mutability/chigy_lc_mutation_rates_nt.csv
--pdb - PDB structure file for antibody analysis. Default: data/AbCGG_structure/combined_ch2_eh2-coot_IMGT.pdb
--dms_vscores - File containing deep mutational scanning variant scores. Default: data/dms/final_variant_scores.csv
--dms_sites - File containing naive sites information. Default: data/dms/CGGnaive_sites.csv
--heavy_chain_motif - DNA motif sequence used to identify heavy chain reads. Default: aGCgACgGGaGTtCAcagACTGCAACCGGTGTACATTCC
--light_chain_motif - DNA motif sequence used to identify light chain reads. Default: aGCgACgGGaGTtCAcagGTATACATGTTGCTGTGGTTGTCTG
--igk_idx - Index position for immunoglobulin kappa chain processing. Default: 336
--bcr_count_thresh - Minimum count threshold for B-cell receptor sequences. Default: 5

Pipeline steps

(1) trim and combine paired end files (2) demultiplex both plates and wells (3) Split Heavy and light chain reads per well (4) Collapse identical sequences, while retaining and ordering rank of each sequence per well (5) Prune to keep the top N sequences observed from each well (6) Merge the results, formatting for partis annotation. (7) partis annotation (8) curation, cleaning, and merging of Heavy and Light Chains (9) gctree lineage inference using HDAG. (10) Merge the results

Beast pipeline

beast.nf runs BEAST (v1) on a set of naive and observed BCR sequences from a single germinal center to infer time trees for each clonal family.

In more detail, the pipeline prepares the xml files from beastgen and a specified template (pre-configured templates can be found in the data/beast/beast_templates directory). By default, the pipeline uses the skyline histlog template. The pipeline then patches the xml to fix the naive sequence in time using the beast_template_root_fix.py script before running BEAST on the patched xml files.

Quick start

There's a testing set of sequences in the data/beast/ directory. To run the beast pipeline on this data (with a small number of MCMC iterations for quick execution), you can use the following command:

nextflow run beast/main.nf --chain_length 1000 --log_every 100 -profile docker -resume

Pipeline parameters

--seqs is a string parameter specifying a file path pattern to the fasta files containing the naive and observed BCR sequences. The beast pipeline will be run on each of the files matching this pattern.
--beast_template is a string parameter specifying the path to the beast template file. The pipeline will use this template to generate the xml files for the beast runs.
--results is a string parameter specifying the path to the directory where the results of the beast runs will be stored.
--chain_length is an integer parameter specifying the number of MCMC iterations to run the beast for.
--log_every is an integer parameter specifying the interval at which MCMC-step trees are recorded.
--convert_to_ete is a boolean parameter specifying whether to convert the beast trees to ete trees.
--dms_vscores is the url to the dms variant scores for adding phenotypes to the ete converted trees.
--dms_sites is the url to the dms sites for adding phenotypes to the ete converted trees.
--burn_frac is the fraction of the chain to discard as burn-in instead of converting to ete.
--save_pkl_trees is a boolean parameter specifying whether to save the ete trees as pickle files. This can be very memory intensive when there are many logged tree iterations for each tree.