We will meet today to discuss progresses made towards building a digital twin model of the retina. For the past few weeks, we focused on assembling datasets and building a robust imputation pipeline. @vaishnavi.nagesh has made some good progress on that front. We continue the discussions on this and support her work.
Some changes:
We are making rapid progress and so we will switch to a weekly meeting format starting next week (every Thursday 10 am PST). We aim to push this work into a publication quickly and we can use all the help we can from you. Some tasks in mind are: (a) literature review, (b) coding, (c) making figures/diagrams, (d) building an interactive dashboard & web interface.
We will organize a bi-weekly working session focused on paired/group coding, using VSCode’s Live Share extension. This meeting will be every other Friday 11 am PST. This meeting is intended to get members up to speed with accessing and manipulating data, so that all of us can feel comfortable using developer tools. Come to code with us and ask questions.
Within the project, we will self-organize into two sub-group cohorts. One coding cohort, and the other data cohort. The coding cohort’s job is to test new features & models. The data cohort’s focus is on assembling more relevant dataset and putting everything together so that the dataset can be used by the coding cohort. You are welcome to join either or both. Our goal is to build this out into a foundational model of the retina. We will have some in-depth discussions on foundational models. I can tell you that this is really exciting!
Hi Jian,
Hope the meeting went well and I am terribly sorry for not being able to make it today. I’ll listen to the recording.
No drastic updates from my end, very minor ones:
Imputation code has been merged and I have started documenting methods and results from imputation. I’ll upload the doc into google drive.
Started comparing our imputation approach to the reference publication of GANs that was posted on this forum.
In general have started reading about GANs for next steps, but happy to pivot to anything that needs more attention.
Today and tomorrow joining meeting + coding session is a little bit difficult since I need to tend to a health care issue.
We had a short meeting today and I did not end up recording it but will write an update here instead (faster this way).
Evaluation of MICE. We looked over the mechanism of MICE and reasoned through the conditions when it should be most effective. This exploration is summarized in these powerpoint slides. Please take a look.
Moving forward, I think we need to focus on some evaluations of the imputation, i.e., more cross validation tests. One test that @jakubm had suggested in our last meeting is to artificially remove some datasets (even though it is very limited; we need to be highly selective in doing this) and compare the imputed value with the real value (artificially removed). There might also be other statistical measures. We need a good literature review and have a table of all of these measures (need help!).
Think about sequential imputation: we evaluated and know that MICE currently uses a “random” imputation approach: filling data rows randomly. I think we can improve this by doing a non-random, sequential approach by first evaluating existing information content at each row. Then filling data first at the rows that have the most information already. However this is an hypothesis, we can test this by making a direct comparison. @vaishnavi.nagesh, does the library currently support this?
See you all tomorrow at the coding session. I will go over the github repo and we can talk more about the dataset & the dashboard that is being developed.
A Github account. If you don’t already have one, it might be a good idea to get one now. It takes a few steps and some minutes to set it up (involving authentication with your phone). Then, you want to log into your Github account from within VS Code.
Get the Live Share extension within VS Code. This is easy: one click of a button.
Hi @vaishnavi.nagesh and all, I am still working on my actions from last meeting, which I summarize below. I have completed Steps 1 and 2, so here is the link to a text file with the list of genes that come from the gene sets that are related to the phenotypes. V, you could go ahead and impute these genes and see how it performs!
Step 1: Identify a list of gene sets most closely related to each phenotype. See list of gene sets here.
Step 2. Combine all the genes from the phenotype-related gene sets in a non-redundant manner.
Step 3. Use linear regression to identify which genes are most predictive of each phenotypic measurement.
Step 4. Identify how much overlap there is between the regression genes and the gene set genes. If there’s a lot of overlap (eg > 70%), use that list. If there isn’t much overlap (eg < 50%) then impute both gene lists and see which imputation performs better.