Hello All! I’m seeking help with analyzing a transcriptomics plant dataset retrieved from OSDR on the UseGalaxy.org platform. I am running into an error when running the fgsea tool, and it seems to be an issue with an input (either my ranked genes list or the ontology file), but I have been unable to pinpoint what is going on despite reaching out to the UseGalaxy help forum. If you run analyses using Galaxy or just want to help me troubleshoot, please let me know. Thanks!
Hi Jennifer
I tend to run fgsea though R-shiny applications or R-studio… But let’s see if we can troubleshoot the error you’re encountering in galxay. Here are a few things to check:
- Ranked Genes List Format: Double-check the format of your ranked genes list. fgsea expects a specific format, I find i usually get errors due to incopantable gene IDs in the first column (Does it require TAIR ID’s, Entrez, is another commonly used?).
- Non-signnificant model: Is this error confined to a specific study or comparison? If so it is possible that the there is not significant enrichment (delivering a empty results folder), is there a log file that shows any evidence of an actual error with the analysis?
- Ontology File Format: Similar to the ranked genes list, ensure your ontology file adheres to the expected format by fgsea. Galaxy provides some pre-loaded ontologies, but this is where you might find evidence related to the gene ID requirements.
- Error Message: Can you share the specific error message you’re getting when running fgsea? The message might provide clues about the issue with your input files.
Happy to touch bases to find solutions and please pencil me in to support the GL4HS planty team(s) if there are any this summer?
Hi Richard,
Here are screenshots that show an excerpt of the ranked file list (to show the gene IDs) and a screenshot excerpt of the contents of the ontology file:
We referenced the protocols section in the OSDR to check for versions, etc, used in the pipelines and in the dataset. This particular dataset is OSD-218, BUT I have had the problem with other Arabidopsis sets too. I believe that the gene sets are in the same ID format as my ranked list (Entrez?)?
In Galaxy, the actual error reads:
In Sys.setlocale(“LC_MESSAGES”, “en_US.UTF-8”) :
- OS reports request to set locale to “en_US.UTF-8” cannot be honored*
*Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : * - scan() expected ‘a real’, got ‘F’*
Calls: read.table → scan
Might you happen to have a set of files that you input into your fgsea that I may test on galaxy?
Jennifer
Hi Jennifer,
I’m not sure, but there may be a problem with the input data. Ranked genes data, a two-column file containing a ranked list of genes is required. The first column must contain the gene identifiers and the second column the statistic used to rank. Gene identifiers must be unique (not repeated) within the file and must be the same type as the identifiers in the Gene Sets file. For example: Symbol → VDR and Ranked Stat → 58.1.
For gene sets; this can be a tabular file in Gene Matrix Transposed (GMT) format. In GMT format, each row represents a gene set, with the set name in the first column, a description in the second, then the identifiers of the genes in the set in the following columns. GMT files with any identifiers (e.g. Entrez IDs, Symbols) can be used but the same type of identifiers must be present in the Ranked Genes file.
I am sharing the galaxy training RNA-seq genes to pathway tutorial that I think might help.
best,
Zerrin
The first screenshot shows gene name/symbol…
The second appears to be the Esenmble/TAIR ID’s.
PCA doesn’t show any cause for concern…
There are a lot of potential different comparisons, these 4 linear DESeq model appear to create reasonable-sized DEG lists for analysis
Here’s the differential comparison for from DESeq2,for 4 models, so these are lists i’d perform enrichment analysis apon.
Col04daySpaceFlight-Col04dayGroundControl
Col08daySpaceFlight-Col08dayGroundControl
WS4daySpaceFlight-WS4dayGroundControl
WS8daySpaceFlight-WS8dayGroundControl
Hi Richard,
For our DGE, we had originally tested WS 8 day - SF vs GC and got a plot that looked like this:
[image]
When you mentioned this:
The first screenshot shows gene name/symbol…
The second appears to be the Esenmble/TAIR ID’s.
Do you mean that perhaps my gene name/symbols aren’t using the Ensembl or TAIR ID in the correct format?
Hi Zerrin,
Thanks for the note and the link to the tutorial. We’re using the standard RNAseq pipeline, and we have had no issues analyzing mouse and drosophila, but have this problem unique to when we’re running the arabidopsis sets, so our issue is somewhere within the files themselves, and that’s what we’re hoping to pinpoint.
Were there any extra annotation Columns in the plant files that weren’t in the other organisms results? Say TAIR ID?
I think the software may wan the gene names provided using symbols as the primary identifier instead of using the TAIR ID?
Perhaps you just need to remove the TAIR ID column?
@dr.richard.barker huge thanks for spotting the issue that resolved the problem. Fixing the gene id vs symbol in the table helped!