New AI/ML Subgroup for Genetic Perturbation Predictive Modeling (GPPM)

Genetic Perturbation Predictive Modeling

We are excited to introduce a new subgroup within the AI/ML AWG centered around predictive modeling using omic data created from genetic perturbation. We already have an existing project the involves training ML algorithms on Perturb-Seq scRNA-Seq data to make predictions in unrelated single and bulk RNA-seq datasets, namely those with from spaceflight and spaceflight simulation. The premise is that, by training on the profile of transcriptional changes created by an upstream perturbation, then the origins of other widespread transcriptional changes (such as those which occur in humans during spaceflight) can be traced to their sources. This approach is unique because it can identify an upstream source gene or cluster even if the source itself does not undergo a significant change in expression. This project already has a manuscript on biorxiv, but needs to be submitted to a journal, possibly BMC Bioinformatics, and will likely require some revision and expansion. Would like to see this published. From there we can explore avenues to expand upon the existing paradigm with new datasets and algorithms. There are several possible directions for this already on paper, but we are interested in any new avenues or ideas you might bring to the table.

Active Project Pre-Print:

https://doi.org/10.1101/2024.11.28.625741

Interest Response Form:

We are looking for members to fill the following roles/areas of attention:

Computation – Data processing and model training

Output Analysis – Gene set enrichment analysis and literature validation.

Research – Searching for new datasets, algorithms to apply, and studies to validate predictions against. Involves extensive literature review.

Code Organization – Maintaining and synchronizing versions, GitHub and Hugging Face maintenance. Familiarity with google collab notebooks would be helpful as we need to pivot towards that system.

Submission Experience – Expertise in navigating journal submissions.

Resources we are looking for:

Compute – The existing models were created using a limited dataset that fit within 164 gigs of memory, but we will need a larger server configured for remote access via collab. We have a 96 gig server which may be available for this as a last resort, but we would like to find a better solution.

OSDR human spaceflight RNA seq datasets - The existing project was created with the recently compiled human datasets in mind, but has never been used on them due to access limitations.

@AIMLawg @MultiOmicsAWG

22 Likes

@rachelcrivero and myself might want to set up a meeting with you all regarding some new datasets we have been working on… :slight_smile:

5 Likes

Hi all,

Congratulations on developing this framework!
I would be very interested in discussing how this can be expanded towards viral evolution, genomic epidemiology, and public health purposes.
Please let me know if this would be of interest, and when would be a good time to connect.
I’ve also registered to become part of your subgroup.

Many thanks,
Nidia

@liamfj17 @lauren.sanders

3 Likes

Hello everyone :folded_hands:

I’m Chalermchai from Saraburi Thailand

The project details of the Ai/ML sub-team are close to my imagination. I would like to present them in case any experts want to add some ideas :pushpin: to the project. :victory_hand: I named the project: “Astro-Symbiote: AI-Powered Bio-Adaptive Habitat Management”** * :pushpin: Concept: :backhand_index_pointing_right::backhand_index_pointing_right::backhand_index_pointing_right:Create an AI system that acts as a “symbiotic organism” (Symbiote) with a closed-loop ecosystem in a spacecraft or base on another planet. The AI will learn and adapt to maintain the balance of living things (plants, microorganisms, captive insects) and resources (water, air, nutrients) as appropriate in real time. * AI/ML uses: * Reinforcement Learning: AI learns from experiments to adjust various factors (light, temperature, humidity, watering/nutrients) and receives feedback from biological sensors. (Plant growth rate, microbial health) to find the best “policy” to maintain the balance. * Computer Vision & Anomaly Detection: AI analyzes images from microscopes and regular cameras to detect plant diseases, pest/microbial outbreaks, or ecosystem abnormalities early on. * Predictive Modeling: Forecast future resource needs, oxygen/food production, and waste management so the system can adapt in advance. * Novelty: It’s not just about controlling the environment, but about creating AI that “understands” and “nurtures” complex biological systems to grow and sustain themselves in limited environments.

Thank you :folded_hands:

Chalermchai

@Anatta

4 Likes

Just a reminder that the first meeting for the GPPM subgroup is today at 2:00 PM pacific, hope to see you there!

Video call link: https://meet.google.com/kgu-oxpk-hee

2 Likes

Did you all see this paper?

https://www.nature.com/articles/s41592-025-02772-6

@liamfj17 @lauren.sanders @nidiatrovao

6 Likes

Hello! I’d love to join this subgroup. I’ve just filled out the form linked above, looking forward to hearing from you all! ML models and scRNA-seq are two topics I’ve worked firsthand with, so I hope I can bring valuable insight and skill to this team!

@liamfj17 @lauren.sanders

1 Like

Still a space for a data engineer? @liamfj17 @lauren.sanders

1 Like

Topic→New PR for Perturbation Theory

URL: https://github.com/liamfj17/Perturb-Seq-Transfer-ML-for-Prediction/pull/1

Reviewers assigned→ @liamfj17 cced @lauren.sanders

Content: Improved Readme with best practices for pull request - merge. (@liamfj17 you will need to block main for now to avoid automatic merges)

Main action item: Establish access to public datasets for local testing. Next Ticket…?

Dev: Felipe Pineda

2 Likes

Hiya - The Github Repo is ready to begin contributing! For now the process is to:

A. Clone the Repo-Request access. :rescue_worker_s_helmet:
B. Jump to the branch UAT using

git checkout UAT

C. Creating a new branch out of UAT with:

git checkout -b {featureName}FeatureDev

D. Once you have done all the changes and are happy with them you can commit and push them on your branch and open a pull request to UAT! That would be it. :satellite:

If you have any questions feel free to reach out. Let the improvements begin!

cc. @nidiatrovao @sriram.susarla @Anatta @liamfj17

2 Likes

AnnData_Dummy_Guide.pdf (2.9 KB)
Hi everyone,

Before we dive in, I wanted to share some context about why we are using AnnData and how it differs from working with Pandas or traditional data warehousing approaches (tables, views, hierarchical structures used in BI or quarterly reporting). You will encounter this classic schemas in data driven organisations - but not AnnData.

AnnData is particularly powerful because it allows us to expand a 2D table into multidimensional expressions. While this is especially useful for genetic data, it can also be applied to other compressed values that benefit from additional metadata to enrich characterisation. This means ingestion of subclassifications or expressions of data is simpler - eventually we will have to ingest this multidimensional arrays into AI models.

I’ve attached a simple document to review before our next meeting. Understanding AnnData and its dimensional structure will be crucial for moving this project forward! Have a Good Friday everyone!

cc. @liamfj17

3 Likes

:rocket: Our very first Automated Pipeline is now LIVE in UAT!
Check it out here :backhand_index_pointing_right: experimentalPipeline.ipynb

The AWG Tutorial subfolder is designed to give new members a smooth, step-by-step intro — from pulling genomics data, preparing a multilayer perceptron model, and finishing with clean UMAP visualizations (just a visualization trick, no physical dimensions here :wink:).

:bullseye: The goal of this short script it’s to help you grasp the bigger picture of what the project objective is all about.

:backhand_index_pointing_right: To get access, reach out to @liamfj17 to be added to the repo.

Next up on my plate: standardizing the URL pipelines so we can seamlessly pull in our own genomics datasets for this project. Stay tuned.

cc. @nilufarali @liamfj17

@AIMLawg

8 Likes

Hi Liam – I’m getting a 404 on the .ipynb link. Could you make sure that’s still valid?

cheers

-james

Thank you very much!

2 Likes

Me as well…Felipe or Liam could you please reshare??

Same with me, Sir. 404. Kindly help

Hi everyone, If you are interested checking it out please message me with your github username and I will invite you. The 404 seems to be a side effect of it being a private repo for now (as it is still under active construction and at the moment largely a static proof of concept). We moving to develop this rather rapidly though so stay tuned and check back as well!

@james.casaletto @Shankia1985 @nilufarali @Dhanalakshmi

2 Likes

I am grateful for this opportunity. My github name is: Shankia001

Thank you. @liamfj17

Dr. M. Dhanalakshmi

Hi @liamfj17 , I’m unable to access .ipynb link. Can you kindly help me with it. my github is MehreenAshraf332 (Mehreen Ashraf332) · GitHub.