New Low Biomass Metagenomics Pipelines - AWG review needed

Hi @PPawg members (and also the @MicrobesAWG members if you believe you have the expertise to review-comment):

We have drafted two new pipelines for processing low biomass metagenomics datasets that are ready for your review. Please provide your feedback ASAP and no later than February 28th.

The Long-read (Nanopore) Low Biomass Metagenomics Pipeline is available here:

The Short-read (Illumina) Low Biomass Metagenomics Pipeline is available here:

@Alex @Haley_Sapers @gregcaporaso @lorna @Rettberg @jneufeld @barbara.novak @Stighe @stefan_green @gebresg @olabiyi @lguan @kjvvenkat @cdavis

7 Likes

@asaravia Would it be okay if I use your markdown style and content (not copying, just components) as a template for my HTGAA 2026 homework?

1 Like

For 1. Basecalling

Since version Dorado v1.0.0, fast5 files haven’t been supported. It would be good to remove all reference to them. Fast5 files need to be converted to pod5 files first with the pod5 tool: Tools — Pod5 File Format 0.1.21 documentation

This pipeline looks really great! Thank you so much for establishing this resource. :slight_smile:

@asaravia @olabiyi @barbara.novak

1 Like

Of course, @ccnaney !

1 Like

Thanks, @tmn2126 , that’s a good point. We’ll add a step for fast5 → pod5 file conversion.

2 Likes

You’ve put together a really great workflow for long and short reads! Congrats! Still, there are a few comments from my side:

  • QC) Is there a way to use fastp for QC of long reads as well? Fastp has great performance and more options.
  • Host removal) Human pangenomes have recently been used for the removal of host DNA. Perhaps a selection of different reference genomes would be useful? (e.g. hg19, GRCh38, T2T-chm13 etc.)
  • Taxonomic annotation) Due to kraken2’s tendency to produce false positives, perhaps a strict confidence level of 0.6 or higher would be useful? For performance reasons, a memory-mapping option could be useful for kraken2 databases as well. I’m missing bracken for abundance estimation from kraken2 outputs here.
  • Assembly) As far as I know, there is also a meta-sensitive mode for megahit.
  • Binning) Metabat2 is widely used, but has been around for quite some time. There are now numerous more modern binning tools (Benchmarking metagenomic binning tools on real datasets across sequencing platforms and binning modes | Nature Communications) such as QickBin, COMEBin, SemiBin2, GenomeFAce, MetaBinner, TaxVAMB, etc. Perhaps it would be advisable to consider more than one binning tool in the workflow followed by DASTool for bin refinement.
  • Bin quality) CheckM2 and GUNC are newer tools for estimating the quality of a bin.
  • Decontamination) It’s great that you have introduced a regular decontamination process to remove potential contaminants with the R tool decontam. However, in our experience, the default threshold of 0.1 is quite superficial. We usually use a threshold of 0.5 in our workflows to be more stringent. I’m looking forward to testing your workflow with our data sets!

@asaravia

2 Likes

@Alex Thanks for your comments. Please see my response below:

  • QC) Is there a way to use fastp for QC of long reads as well? Fastp has great performance and more options. Fastp is not designed for long reads ( the * documentation says so) hence the reason we have used filtlong and poreChop which are designed for long reads. * Fasptlong is one that we may try.

  • Host removal) Human pangenomes have recently been used for the removal of host DNA. Perhaps a selection of different reference genomes would be useful? (e.g. hg19, GRCh38, T2T-chm13 etc.). Thanks for this wonderful suggestion. I believe we already do this internally.

  • Taxonomic annotation) Due to kraken2’s tendency to produce false positives, perhaps a strict confidence level of 0.6 or higher would be useful? For performance reasons, a memory-mapping option could be useful for kraken2 databases as well. I’m missing bracken for abundance estimation from kraken2 outputs here. Thanks for this. Yes, we acknowledge that Kraken2 tends to generate a lot of false positives so we filter out taxa with abundance less than 0.5% and also filter out unclassified reads. We found that this greatly reduced the amount of false positives and thus improves the results. We do not use braken (optional) for abundance estimation but use a combination of Pavian and relative abundance (number of reads assigned to taxa in sample / total number of reads in sample) estimation. It will be interesting to see what difference it will make if we use bracken2. How to you perform memory mapping for kraken2?

  • Assembly) As far as I know, there is also a meta-sensitive mode for megahit. Yes, we use this mode for assembling with megahit.

  • Binning) Metabat2 is widely used, but has been around for quite some time. There are now numerous more modern binning tools ( Benchmarking metagenomic binning tools on real datasets across sequencing platforms and binning modes | Nature Communications ) such as QickBin, COMEBin, SemiBin2, GenomeFAce, MetaBinner, TaxVAMB, etc. Perhaps it would be advisable to consider more than one binning tool in the workflow followed by DASTool for bin refinement. Great idea! Thank you. We will look into this.

  • Bin quality) CheckM2 and GUNC are newer tools for estimating the quality of a bin. Thank you again. We will look into updating our workflow to use more recent tools. We have used tools that had been previously tested, trusted and approved by the AWG.

  • Decontamination) It’s great that you have introduced a regular decontamination process to remove potential contaminants with the R tool decontam. However, in our experience, the default threshold of 0.1 is quite superficial. We usually use a threshold of 0.5 in our workflows to be more stringent. I’m looking forward to testing your workflow with our data sets! Exactly! We tested the 0.1 threshold and found that it is indeed superficial. In fact, the 0.5 threshold isn’t perfect either. So the default in our workflow is to use the more stringent 0.5 threshold just like you do.

Thanks once again for taking the time to provide this invaluable comment.

2 Likes

I can confirm that our latest version of the human removal step does use a reference DB built from both hg38 and T2T-chm13 (which is the default for building the human DB in kraken2). We’ll make that more clear in the pipeline documentation.

2 Likes