Microbial RNASeq alignment filter

Hi @MicrobesAWG,

As we process the 5 prioritized microbial RNASeq datasets (OSD-95, 138, 145, 185, 554) identified for prioritization, we’ve noticed that the alignment rate for some samples is quite poor. In the eukaryotic RNASeq pipeline, we apply a >=60% RSEM unalignable filter. Any samples that don’t pass the filter remain in the raw counts table, but are excluded from the count normalization and different expression analysis.We are curious if we should be applying a similar cutoff to microbial RNAseq data.

The closest equivalent to the eukaryotic pipeline threshold in the microbial pipeline appears to be Bowtie 2 “overall_alignment_rate” <= 40%. At this cutoff, only OSD-95 and OSD-554 would be affected. However, it’s not clear to us if that threshold is appropriate for microbial data since it is based on eukaryotic data.

Table of overall alignment rates for each dataset:

Dataset N N with 1 - overall_alignment_rate >=50 N with >=60 N with >=70 N with >=80 N with >=90 Min overall_alignment_rate Median overall_alignment_rate Mean overall_alignment_rate
OSD-95 21 7 6 4 4 3 2.42 70.43 60.08
OSD-138 18 0 0 0 0 0 99.18 99.30 99.31
OSD-145 26 0 0 0 0 0 99.16 99.31 99.33
OSD-185 6 0 0 0 0 0 56.75 99.44 89.87
OSD-554 66 46 38 26 13 7 2.73 36.25 38.70

For those of you with experience in differential gene expression in microbial RNASeq, are there thresholds typically applied for filtering samples based on alignment rate?

@nicholas.brereton @ben.sikes @daniela.bezdan @jaume.puig — iirc these were the datasets you asked for data processing :blush:

2 Likes

Hi Barbara,

I would investigate this instead of using a cut off (you could then maybe add some extra QC steps in future). Low mapping could mean contamination, either rRNA, human, reagent or other bacteria in the culture, but it could also mean the ref is further away than we hoped.

To investigate, maybe try to annotate the unmapped reads (kraken2 or something else) as well as check for low complexity/rRNA. Assembly of the unmapped reads would be useful, and just blast the most abundant contigs.

I’m not really sure what mapping rate to expect exactly but it should be high (>80% or higher really) - no @emmanuel.gonzalez?

Hope that’s helpful,
Nick

2 Likes

That is very helpful. I think we’ll leave filtering off for now and plan to investigate this as a possible pipeline update after we process more datasets and see how they behave.

2 Likes