Data augmentation discussion

cecileherbermann · September 10, 2024, 5:16pm

During the AI/ML AWG meeting, Lauren mentioned that there are currently some issues with augmenting transcriptomics data. I am currently writing a review about minimal sample size for training machine learning tools and how we could use data augmentation for human omics data. Are there any experts that can explain more about the issues with GANs for mice transcriptomics? Why are the synthetic samples not represented at a gene-level and what would be needed to solve this?

@AIMLawg @lauren.sanders @angel

lauren.sanders · September 10, 2024, 6:04pm

Hi Cecile - your review sounds very interesting and timely. I’m tagging James Casaletto here who has done some work generating synthetic transcriptomics data, and mentored a student over the summer who used generative methods for single cell RNAseq data and did gene-level validation.

@james.casaletto

james.casaletto · September 10, 2024, 8:53pm

Hi Cecile
As Lauren mentioned, we’ve done some work with using synthetic data (GAN and VAE) to augment transcriptomic datasets. We find that the overall variance is captured with high fidelity (as seen in PCA plots e.g.). But when we use tools like DESeq2 to find differentially expressed genes, there’s not much fidelity (between the results when run on real data vs the results when run on synthetic data). When we look at specific distributions of gene counts, we see that the synthetic data generating process approximates the real data discrete counts with smooth continuous counts. That’s an issue

We haven’t dug deeper to see what we might do to fix that. It may well have to do with how we’re measuring the loss during training. For the GAN on bulk RNA, I used wasserstein loss. For the VAE on single cell data, our student used KL divergence as an objective (the latent space is a probability distribution).

cheers
-james

rebecca.ringuette · September 17, 2024, 6:16pm

Linking some potentially useful work on a standard protocol for comparing the performance of generative/prediction methods. Perhaps these ideas applied to your topic will be useful. GitHub - SwRI-IDEA-Lab/sw-forecast-protocol

Topic		Replies	Views
Interesting Bio/Omics AI/ML Papers and Code AI/ML AWG Topics	4	74	January 7, 2025
AWG Quantum Synthetic Data API AI/ML AWG Topics aiml , api , quantum	4	78	December 16, 2024
ML/AI and statistical model for multi-omics datasets AI/ML AWG Topics aiml , data-mining , visualization	7	227	June 19, 2024
AI-Readiness in Space Omics Data AIML AWG Open Projects aiml , issop , ai-ready	2	316	November 30, 2024
Causal Inference Sub Group 2025 AIML AWG Open Projects aiml	60	869	June 18, 2025

Data augmentation discussion

Related topics