Updated AmpliconSeq pipeline - AWG review needed

barbara.novak · March 27, 2025, 11:15pm

Hi AWG members,

I’m excited to announce that the updated Illumina AmpliconSeq pipeline is ready for your review. Please click on the link to review the pipeline and provide your feedback ASAP and no later than Monday, 4/14/2025.

Pipeline document detailing each step of the pipeline:

github.com/nasa/GeneLab_Data_Processing

Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-C.md

DEV_Amplicon_Illumina_NF_conversion

# Bioinformatics pipeline for amplicon Illumina sequencing data  

> **This page holds an overview and instructions for how GeneLab processes Illumina amplicon sequencing datasets. Exact processing commands for specific datasets that have been released are available in the [GLDS_Processing_Scripts](../GLDS_Processing_Scripts) sub-directory and/or are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).**  

---

**Date:** March XX, 2025  
**Revision:** C  
**Document Number:** GL-DPPD-7104  

**Submitted by:**  
Olabiyi Obayomi, Alexis Torres, and Michael D. Lee (GeneLab Data Processing Team)

**Approved by:**  
Samrawit Gebre (OSDR Project Manager)  
Danielle Lopez (OSDR Deputy Project Manager)  
Jonathan Galazka (OSDR Project Scientist)  
Amanda Saravia-Butler (GeneLab Science Lead)  
Barbara Novak (GeneLab Data Processing Lead)

This file has been truncated. show original

Example outputs from raw counts through differential abundance testing from OSD-487:

Changes from previous version ( GL-DPPD-7104-B.md):

Software version updates
Added new processing steps in R to generate processed data outputs for alpha and beta diversity, taxonomic summary plots, and differential abundance using ANCOMBC 1 and 2 and DESeq2.
Updated DECIPHER reference files.

Specific questions that we need feedback on:

Should the “_number” suffix added to the taxonomy names by DECIPHER be removed (see the taxonomy_GLAmpSeq.tsv file for an example)?
Should the “passed sensitivity analysis” column be dropped in the ANCOMBC2 results table (see the ancombc2_differential_abundance_GLAmpSeq.csv file for an example)?
Note: The passed_ss column simply states whether the result (differentially abundant or not) of the ASV remains the same when a pseudocount is added to zero or not. Meaning, if passed_ss = TRUE then adding a pseudocount didn’t change the result and vice versa.
Review the parameters used to generate the differential abundance analysis results for each of the 3 tools and let us know if any should be changed.
Review all example output files and let us know if there is any additional information you want to see in those files or if anything is unnecessary that can be removed.
Review all output data files in bold in the pipeline document (these are the files we plan to publish in the repository), and let us know if there are any additional files you want us to publish, or if there are any files we are planning to publish that you will note need or find useful.

A big thank you to @olabiyi on the Data Processing Team for leading this effort!

@MicrobesAWG @MultiOmicsAWG @olabiyi @asaravia @jessica.a.lee @daniela.bezdan @ccnaney @emmanuel.gonzalez @nicholas.brereton @AstrobioMike @alexis.torres

emmanuel.gonzalez · March 28, 2025, 1:46am

Hey there,

Just reviewed your amplicon pipeline seriously impressive work! The multi-method DA approach is killer and gives users great flexibility.

I had a few thoughts about optimizations that might help:

Taxonomy visualization needs compositional transformation

Your current percentage calculation (function(x) x/sum(x)) * 100) doesn’t account for the compositional nature of microbiome data. This creates several issues: dependencies between taxa that percentages mask, skewed comparisons between samples with different sequencing depths, unaddressed variance-mean relationships, and potential spurious correlations. You could implement CLR transformation here.

Missing VST validation checkpoint

You’ve built 2 normalization options but there’s no diagnostic to validate if VST is actually stabilizing variance. Microbiome data has unique variance-mean relationships that can break traditional approaches. Add meanSdPlot(assay(vsd)) to give users a critical quality metric. Adding CLR normalization option might also help (it’s generally fast to run).

DESeq2 needs sparsity assessment

DESeq2 underperforms with sparse data (a known issue). Including the recommended diagnostic plot (sparsity plot) might help users determine if DESeq2 is appropriate:

rs <- rowSums(counts(dds))
rmx <- apply(counts(dds), 1, max)
plot(rs+1, rmx/rs, log="x")

Method selection needs a decision framework

You’ve built multiple statistical paths (ANCOMBC1, ANCOMBC2, DESeq2, PERMANOVA) without guiding users on method selection. Different methods make different assumptions about the data, and their suitability varies depending on overdispersion, sparsity, zero-inflation, and heteroscedasticity.. Adding adaptive method selection or at least guidance on interpretation would maximize scientific impact and user success.

I hope these suggestions are helpful! Happy to discuss further if any of these points need clarification.

rtscott2001 · March 28, 2025, 1:51am

Hey @daniela.bezdan @jaume.puig — chance that @barbara.novak @olabiyi could attend the @MicrobesAWG mtg on April 2, 10am PT, to present this AmpliconSeq pipeline and get feedback?

@barbara.novak & @olabiyi (and anyone else on data processing @alexis.torres ) hope that time may work for it

Hail Mary — @AstrobioMike if you could be there too also @stefan_green ?

@chm2042 - would anyone from the Metasub world be interested/available to join this Microbes AWG meeeing to provide feedback on the AmpliconSeq pipeline?

Mtg invite for Microbes AWG:

https://awg.osdr.space/t/microbes-awg-meeting/917

Cheers all

rtscott2001 · March 28, 2025, 1:52am

Thank you @emmanuel.gonzalez for the feedback, and the future users of the data thank you too

The time and effort is sincerely appreciated

AstrobioMike · March 28, 2025, 2:07am

I’ll be there if I can!

AstrobioMike · March 28, 2025, 2:21am

Heya, Emmanuel!

I’d push back a little that basic taxonomic visualizations need to be transformed in a way that accounts for the compositional nature of the data. So long as nothing statistical is being done with them, high-level summary visualizations with intuitive interpretations can be desirable for some folks.

Just to get that thought out there

Hope all is well in your world!

olabiyi · March 28, 2025, 5:31pm

Hi Emmanuel,

Thank you for your valuable feedback. Please see my response below:

Taxonomy visualization needs compositional transformation: The purpose of the taxonomy summary plots is to show the relative proportion of microbes within a sample or group, and since no comparisons or statistical analyses are being performed with these data at this step, accounting for the compositional nature of the data seems unnecessary. We can discuss this further at the Microbes AWG meeting next week.
Missing VST validation checkpoint: we can add this too.
DESeq2 needs sparsity assessment: This sounds good.
Method selection needs a decision framework: “Adding adaptive method selection or at least guidance on interpretation” - good point. We agree that no two datasets are the same but again there are many methods to analyze this type of data. In the current pipeline, we provide 3 options. We have applied ANCOMBC 1 and 2 because of the compositional nature of the microbiome with both tools being composition aware. As to why use both, according to the authors, ANCOMBC 2 is an improvement and adds new features like regularization of variance, multigroup and in-built pairwise comparison between groups. Nonetheless, ANCOMBC 1 is still widely used and appears to be more stable than ANCOMBC 2. We added DESeq2 because it is a popular choice (and highly demanded) that assumes a negative binomial model for the dataset. Moreover, it has been shown to do a good job (low FDR and high power) particularly when the number of samples are low, only ANCOM and ANCOMBC performed better according to the authors of ANCOMBC(Analysis of compositions of microbiomes with bias correction | Nature Communications). Please see figure 4 of the paper. We can add some guidance to users under each of these options to indicate which one is most appropriate given the data. Let’s discuss how to direct users during the microbe AWG meeting next week.

If there is a strong consensus among the group that the analyses we provide are not optimal, we can discuss changing these for this pipeline release. However, if the group considers these options sufficient, we can move forward with these options for now and consider adding additional options in subsequent pipeline versions.

olabiyi · April 9, 2025, 6:44pm

Hi @emmanuel.gonzalez, we made the suggested VST and ASV sparsity QC plots but wonder how to interpret them. Could you please provide interpretation for the plots below? When is DESeq2 or VST transformation inappropriate or appropriate for a dataset based on the plots?

In addition, could you provide a little blurb of guidance for when to use the ancombc(), ancombc2(), or deseq2() differential abundance analyses based on these QC plots.

Thanks!

emmanuel.gonzalez · April 10, 2025, 1:09am

Hey folks,

Apologies for missing the meeting today, I’ve been down with the flu. Briefly:

The choice of transformation really depends on what you’re trying to do. For example, if you’re using DESeq2 for differential abundance, then raw counts are used directly, i.e. no transformation needed. But DESeq2’s negative binomial model does struggle with sparse data. Looking at your sparsity plot, most ASVs show dominance in only one sample (ratios near 1), which suggests high sparsity and could limit DESeq2’s reliability here.

On the other hand, if you’re doing clustering with a method that assumes equal variance across features (e.g. PCA), you’d need a transformation that does that. The mean-SD plot should ideally show a flat trend if variance is stabilized. In your case, the increasing trend indicates that VST didn’t fully correct the mean-variance relationship, so caution is needed for downstream analyses that rely on homoscedasticity.

About choosing between ancombc(2) and deseq2. In short: it depends on your design, sparsity, and what biases you’re trying to control.

Hope that helps! Happy to expand later if needed.

Emmanuel

nicholas.brereton · April 10, 2025, 8:59am

Thanks Emmanuel. Ideally, this would inform iterative pre-processing (particularly around sparsity filtering), but just providing it (fig) would already help users see the issue.

It was mentioned in the Microbes meeting last night that the team (understandably) only has scope to provide a pipeline and not interpret data. This type of critical bioinformatics issue (sparsity) is where that becomes tricky. The same is true in other omics, but it is particularly obvious here, as processing decision-making really needs to be biologically informed and ideally iterative. As discussed, strong guidance documentation and the decision framework suggested by Emmanuel seems the best solution.

For the DECIPHER suffix, the ASV could have a taxonomy unique ID based on the lowest annotated taxon (e.g. Corynebacterium_3), which could be in your best tax column or just another unique label next to the ASV column. This will ultimately be helpful for multiomics, WMS, and cross-experiment comparison and integration. The suffix can then be removed from all the taxonomy columns, allowing for easy collapse for reduced complexity figures etc.

It looks great. Thanks!

olabiyi · April 10, 2025, 6:25pm

Hi @emmanuel.gonzalez, thanks for your explanations. We’ll incorporate the VST plot interpretation at the top of the beta diversity and sparsity plot interpretation at the top of the differential abundance sections to give users some guidance. Thanks!

olabiyi · April 10, 2025, 6:29pm

Thanks for your feedback @nicholas.brereton. We’ll leave the numeric suffix in for the ASVs in the taxonomy table, but remove it for the taxonomies associated with each ASV in the differential abundance tables. Thanks!

Topic		Replies	Views
Prokaryotic (bulk) RNAseq pipeline - AWG review needed OSDR Feedback omics , rna-seq , new-pipeline , microbes	14	200	March 10, 2025
Updated RNAseq pipeline - AWG review needed OSDR Feedback omics , rna-seq , pipeline-update	1	116	February 12, 2025
ANCOMBC2 pairwise comparison: How does it decide which features to keep? Microbial AWG Topics	7	117	January 24, 2025
GeneLab Intern AmpSeq Viz Announcements/Jobs internship , visualization	0	56	March 15, 2024
RNAseq - Arabidopsis DGE & functional analysis Plant AWG Topics plant	8	116	June 23, 2024

Updated AmpliconSeq pipeline - AWG review needed

Taxonomy visualization needs compositional transformation

Missing VST validation checkpoint

DESeq2 needs sparsity assessment

Method selection needs a decision framework

Related topics