During the AI/ML AWG meeting, Lauren mentioned that there are currently some issues with augmenting transcriptomics data. I am currently writing a review about minimal sample size for training machine learning tools and how we could use data augmentation for human omics data. Are there any experts that can explain more about the issues with GANs for mice transcriptomics? Why are the synthetic samples not represented at a gene-level and what would be needed to solve this?
Hi Cecile - your review sounds very interesting and timely. I’m tagging James Casaletto here who has done some work generating synthetic transcriptomics data, and mentored a student over the summer who used generative methods for single cell RNAseq data and did gene-level validation.
Hi Cecile
As Lauren mentioned, we’ve done some work with using synthetic data (GAN and VAE) to augment transcriptomic datasets. We find that the overall variance is captured with high fidelity (as seen in PCA plots e.g.). But when we use tools like DESeq2 to find differentially expressed genes, there’s not much fidelity (between the results when run on real data vs the results when run on synthetic data). When we look at specific distributions of gene counts, we see that the synthetic data generating process approximates the real data discrete counts with smooth continuous counts. That’s an issue
We haven’t dug deeper to see what we might do to fix that. It may well have to do with how we’re measuring the loss during training. For the GAN on bulk RNA, I used wasserstein loss. For the VAE on single cell data, our student used KL divergence as an objective (the latent space is a probability distribution).
cheers
-james
Linking some potentially useful work on a standard protocol for comparing the performance of generative/prediction methods. Perhaps these ideas applied to your topic will be useful. GitHub - SwRI-IDEA-Lab/sw-forecast-protocol