Feedback Requested for GeneLab MethylSeq Pipeline Updates

Hi AWG members,

We have made a few updates to the MethylSeq pipeline since it was first drafted. Most of the updates are minor, but one bugfix update impacts the output file format, and we wanted to get your feedback before continuing.

Specifically, we switched from using the coordinate sorted BAM file as input for both deduplication and methylation extraction to fix two issues:

  1. Deduplication fails on coordinate-sorted files for PE datasets.
  2. Bismark summary report generation is sensitive to file naming and the โ€œsortedโ€ suffix broke generation of the summary report.

An updated version of the pipeline can be found on the DEV_Methyl-Seq branch of the GeneLab DP GitHub Repo. We tested the new version of the pipeline with the prior example data and all results were the same. The only difference is in the sorting of some of the methylation extraction output (specifically the methylation call files for CpG, CHH, and CHG context). All values are the same as generated from the coordinate sorted file and all other files are unaffected.

For more information on the output file changes please see Methylation_pipeline_updates_to_output_files.xlsx. A full changelog is provided here: Updates_to_MethylSeq_Pipeline.docx. Example output files for both a single-end RRBS dataset (same as before) and a paired-end dataset can be found here: MethylSeq_updates. Both datasets were subsampled to 1 million reads prior to processing.

Can you please let us know if this change is acceptable by Wednesday, July 31st or if the coordinate-sorting of these methylation context files needs to be preserved.

Thanks,
Barbara

@pmadrigal @keith.siew @chm2042

Other AWG members who may be interested in reviewing @MultiOmicsAWG @AnimalAWG @MicrobesAWG

3 Likes