Prokaryotic (bulk) RNAseq pipeline - AWG review needed

asaravia · February 24, 2025, 8:51pm

Hi AWG members,
I’m excited to announce that we have created a (bulk) RNAseq pipeline for prokaryotic organisms, that is ready for your review. Please click on the link to review the pipeline and provide your feedback ASAP and no later than Monday, 3/10/2025.

Pipeline document detailing each step of the pipeline:
https://github.com/nasa/GeneLab_Data_Processing/blob/DEV_RNAseq_vG/RNAseq/Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md

Example outputs from raw counts through DGE from OSD-185:
GL-DPPD-7115_Prokaryotic_RNAseq_OSD-185_Example_Outputs

Differences from the eukaryotic pipeline, GL-DPPD-7101-G:

Bowtie2 is used for alignment instead of STAR
featureCounts is used for gene quantification instead of RSEM
rRNA genes are removed from featureCounts results on a dataset-wide basis and rRNA removal logs are all reported in the same file
Raw counts data are imported into R with the read.csv() function instead of tximport, and the dds object is created with the DESeqDataSetFromMatrix() function instead of the DESeqDataSetFromTximport() function

Specific questions, we would like your feedback on:

In step 4a, the Bowtie2 default --end-to-end mode is currently run such that only the primary alignment is reported and if multiple alignments have the same best score, Bowtie2 chooses one at random.
- Is this sufficient, or should we modify the parameters to report multiple alignments?
In step 4c/d, is it sufficient to only publish the sorted bam file, or do you also want the unsorted bam file?
In step 7c (counting with featureCounts):
- Should the -G and -J options be added to improve read counting for exon-exon junctions?
- If we modify the Bowtie2 command to return multiple alignments, answer the following questions:
  - Should the --primary option be specified to only count the primary alignment (and ignore secondary, tertiary, etc. alignments)?
  - Or should multi-mapped reads be considered when counting and if so, how?
    - Do not count multi-mapped reads
    - Randomly select which alignment to count
    - Count each alignment as 1
    - Count each alignment equally as a fraction of the total number of alignments (e.g. if there are 2 alignments, each would be counted as 0.5)

A big thank you to @alexis.torres and @crystal.han on the Data Processing Team for leading this effort!

@MicrobesAWG @MultiOmicsAWG @jessica.a.lee @daniela.bezdan

ccnaney · February 25, 2025, 1:19pm

Hi, this is awesome! However I am trying to run wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP-G_2.0.0/NF_RCP-G_2.0.0.zip

from hyperion server home folder and when I use “–no-check-certificate” I get this error " WARNING: cannot verify github.com’s certificate, issued by ‘CN=Sectigo ECC Domain Validation Secure Server CA,O=Sectigo Limited,L=Salford,ST=Greater Manchester,C=GB’:

Unable to locally verify the issuer’s authority.

HTTP request sent, awaiting response… 404 Not Found

2025-02-25 05:11:08 ERROR 404: Not Found."

Please advise? @asaravia @alexis.torres @crystal.han

Best,
Charley

asaravia · February 25, 2025, 8:25pm

Hi @ccnaney,
The workflow is not ready yet, but should be in a couple weeks (once we receive feedback from you all). The workflow is just an implementation of the pipeline that’s detailed in the pipeline document: https://github.com/nasa/GeneLab_Data_Processing/blob/DEV_RNAseq_vG/RNAseq/Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md

Once we’ve proposed a pipeline, we ask the AWGs to give us feedback on the pipeline, including the outputs we plan to publish on the OSDR repo (which are the files listed in bold under “Output Data” in each step). That way we can integrate all feedback before finalizing the pipeline and then once the pipeline is baselined, we’ll update the workflow accordingly.

@alexis.torres, if you have a working version of the workflow files available (that will run the prokaryotic pipeline as it is currently detailed in the pipeline document), will you please upload it to the google drive so Charles can start playing with it?

Thanks,
Amanda

ccnaney · February 25, 2025, 11:12pm

Thank you, ma’am!

Best,
Charley

alexis.torres · February 26, 2025, 6:31pm

Hi @ccnaney, workflow draft is now in the workflow_code/ folder in the Google Drive. Note that validation and verification, md5sum generation, and publishing of the final outputs are still not implemented, but everything else should work fine. You can run the workflow using the --mode microbes parameter, setting an accession as --accession GLDS-# or OSD-#, and specifying local or slurm executor and docker or singularity for containerization:

nextflow run workflow_code/main.nf \
    --mode microbes --accession GLDS-185 \
    -profile [local|slurm],[docker|singularity] \
    -resume

You can also use this command to show the help menu:

nextflow run workflow_code/main.nf --help

If you need to pull the singularity images, you can use:

bash workflow_code/bin/prepull_singularity.sh workflow_code/conf/by_docker_image.config $NXF_SINGULARITY_CACHEDIR

ccnaney · February 26, 2025, 6:47pm

Thank you, sir!

Best,
Charley

ccnaney · February 26, 2025, 6:51pm

Hi Alexis,

Literally as was sending last message it finally clicked your name is Alexis not Alex. I apologize for the confusion!

Best,
Charley

ccnaney · February 27, 2025, 2:57am

Thanks again!

Best,
Charley

ccnaney · March 5, 2025, 6:37pm

Just confirming if I can post response, if works, will upload script name here – still reviewing my history to find it.

asaravia · March 5, 2025, 6:41pm

During the Microbes AWG monthly meeting on 3/5/2025, @nicholas.brereton and @daniela.bezdan, suggested to modify step 4a, the Bowtie2 alignment command, to include multi-mapped reads in the alignment file. Then to modify step 7c, the featureCounts command, to include the --primary option so that only the primary alignment is quantitated.

Are there any objections to this proposed modification?

emmanuel.gonzalez · March 6, 2025, 2:35am

Hi Amanda,

Thanks for sharing the pipeline! The overall structure looks solid, and here are some specific feedback on the technical questions you asked. Looking forward to seeing this pipeline in action!

QUESTION1 :
Multiple alignments should be reported to avoid underestimating gene expression. Reporting only the primary alignment can lead to incorrect quantification, particularly in bacterial genomes, where gene duplications, horizontal gene transfer, and operon structures affect mapping. It also misses transcripts from horizontal gene transfer and distorts quantification of polycistronic transcripts. You can use Bowtie2 with -k xxx or -a for better accuracy.

Note that reference-based metatranscriptomics is also possible with your pipeline using a selected set of reference genomes. In this context, allowing multiple alignments becomes even more important, as shared sequences across different species in the reference set can introduce alignment ambiguity.

Note: The -a option may be limited by available computational resources, especially in genomes with many repeated sequences…

QUESTION 2 :
In 95% of the cases, publishing only the sorted BAM file is sufficient. They’re more compact, easier to use, and essential for visualization. Most downstream analyses require sorted BAMs anyway. Unsorted files might help with debugging but aren’t necessary for publication.

QUESTION 3 :
Neither option is needed. Bacteria are not expected to have exon-exon junctions, so -J is irrelevant. The -G option isn’t necessary either, as your pipeline already specifies -a reference_annotation.gff, which handles bacterial gene structures adequately. You’re good.

QUESTION 4 :
I wouldn’t use –primary when handling multi-mapped reads in bacterial RNA-seq. This option will potentially discard valuable information about alternative mapping locations. For bacterial genomes with duplicated genes, transposons, and horizontally transferred elements, using –primary artificially reduces expression estimates in these regions. Remember that bacterial genomes are compact and efficient where many sequences serve multiple functions or exist in multiple copies. Constraining your analysis to primary alignments creates a biased representation of transcriptional activity.
Your approach depends on your analysis goals. For high-confidence expression quantification, counting only primary alignments makes sense. For repetitive regions or comprehensive coverage, consider all alignments.

How should multi-mapped reads be handled?
Ignoring multi-mapped reads causes underestimation. Random selection introduces stochastic errors. Counting each alignment as one inflates expression values. I would say that the best approach is counting fractionally (1/N per alignment), which prevents overestimation while accounting for all valid mapping locations.

Cheers!

nicholas.brereton · March 6, 2025, 9:36am

Great Emmanuel - so much clearer than my attempt to contribute at the awg last night.

We also discussed some choices that make large differences to statistical outcomes, including experiment-specific count thresholds (briefly) and structural zeros. This is probably overly-complicated or usually less relevant here but, alongside sparsity control, would be essential for a metatranscriptomics pipeline if there’s a genelab one in the future.

For this ref-based pipeline, it’s a shame to lose operon info. Is it complicated to add read-through transcription analysis (eg. Rockhopper) as an additional output? It’s really cool for the biological interpretation.

emmanuel.gonzalez · March 6, 2025, 3:58pm

Thanks Nick!
Totally, a tool like rockhopper would be a great add-on. Deciding what to include in a ready-made pipeline is always a challenge where every addition improves interpretability but also increases complexity. To complicate things further, there is often no single “right” answer to a problem. For example, while I recommended reporting multiple alignments above, it may not always be ideal. Multiple alignments can inflate expression values, introduce noise in differential expression, and lead to misassignments in taxonomically diverse datasets etc.

One general note after reading your post: depending on the statistical framework, sparsity and structural zeros may or may not be an issue. Some models handle them inherently, while others require additional filtering steps. It all depends on what you’re trying to explore and the statistical tools used to answer the question.

These are the kinds of decisions we deal with daily when designing pipelines: balancing accuracy, usability, and flexibility while ensuring users understand key decisions made during development.

I also edited my last post to clarify why I brought up metatranscriptomics in a reference-based pipeline. While this pipeline is reference-based, metatranscriptomics can still be performed with a curated set of reference genomes. However, reference selection is very critical against mapping bias. If the reference set is too limited, many reads will fail to align. If too broad, mapping ambiguity increases. Classic conundrum.

asaravia · March 7, 2025, 10:18pm

Thank you so much for the additional feedback, @emmanuel.gonzalez and @nicholas.brereton!

We have made the following proposed changes:

Added the -a option to step 4a, instructing Bowtie2 to report all alignments
Added the -M option to step 7a, instructing featureCounts to count multi-mapped reads
Added the --fraction option to step 7a, instructing featureCounts to assign fractional counts to multi-mapped reads

Please let me know if anyone disagrees with the changes made.

Although I agree that adding a tool like rockhopper to perform de novo assembly on the unmapped reads would be nice to have, I think this would take some additional development work and testing (and would likely also require deeper read depths than what studies on OSDR have). So, would it be okay with all of you if we move forward with the current pipeline (with the recent changes incorporated) for now and we can consider adding a step for de novo assembly of the unmapped reads in a future version of the pipeline?

In the meantime, the current pipeline does provide the unmapped reads in FASTQ format so if folks are interested, they can take those unmapped fastq files and run them through a tool like rockhopper.

nicholas.brereton · March 10, 2025, 11:14am

Thanks for the update!
Rockhopper has a ref-based mode using mapped reads (FASTQ, BAM + GTF) to infer operon structures based on intergenic distances and expression correlation (I think). Rockhopper User Guide - Emmanuel would know better than me (I’ve only done the interpretation before).
Maybe just something worth exploring as an optional step in a future version though - the pipeline looks great!

Topic		Replies	Views
Updated RNAseq pipeline - AWG review needed OSDR Feedback omics , rna-seq , pipeline-update	1	123	February 12, 2025
Updated AmpliconSeq pipeline - AWG review needed OSDR Feedback omics , pipeline-update , microbes , amplicon-seq	11	252	April 10, 2025
RNAseq - Arabidopsis DGE & functional analysis Plant AWG Topics plant	8	142	June 23, 2024
RNA Metrics Feedback Requested OSDR Feedback rna-seq , metrics , qc	0	71	October 23, 2024
OSDR Tutorials Are Now Available! 🎉 OSDR Feedback data-mining , visualization , tutorial , radlab , eda , osdr-study , nasa-guest-account	6	322	October 4, 2024

Prokaryotic (bulk) RNAseq pipeline - AWG review needed

Related topics