🧬 AWG Update: Benchmarking AI Scientific Integrity (v5.0.1)

GasMan · February 26, 2026, 3:03am

Hi everyone,

Better late than never, I’m thrilled to be diving more here. As a follow-up, some of you may know me from the Knowhax event specifically, Team 60s work during the SPOKE challenge, I’ve been building a framework to stress-test the LLMs we use for multi-modal integration.

I’ve launched the v5.1 Non-Deterministic Gauntlet for OSD-679.

Why it matters for our group: I’ve implemented a weighted logic shuffle ($P(10,3)$) that randomizes clinical hurdles per run. This stops “benchmark hacking” and forces the models to adjudicate the SANS Paradox (IOP vs. TRT) in real-time.

The Benchmarks: Gemini vs. Anthropic

I’ve been running some head-to-head audits between these 2 for now:

Anthropic (Claude 4.x/Opus): Extremely proficient at the “Forensic Join,” but currently struggling with the adversarial logic traps (hitting the 20% compliance floor).
Google (Gemini 3 Pro/Flash): Showing higher resilience in scientific reasoning (~66-70%), but still prone to “polite” hallucinations when faced with conflicting terrestrial hypotheses.

I’m hoping to expand this “Gauntlet” to other OSDs from our SPOKE roadmap soon. If anyone in the AI/ML group has a specific dataset they want me to “harden” or if you want to collaborate on the scoring weights, let’s talk!

Heres the leader board between all and most of Gemini and Anthropic models(18): SANS Multi-Modal Integration Challenge | Kaggle

If anyone has any other suggestions I’m all ears, It’s pretty cool to take the OSD I had last year and re-highlight it again for this benchmark.

-Gaston D.

@AIMLawg

rtscott2001 · February 26, 2026, 3:58am

Interesting, are @vaishnavi.nagesh or @lauren.sanders aware of this?

This may be the first OSDR data-AWG challenge in Kaggle, maybe?

-Ryan

GasMan · February 26, 2026, 4:10am

Honestly I could ask around on my end, but yeah you might be on to something there @rtscott2001

GasMan · February 26, 2026, 4:15am

Looks like the Kaggle team featured my code as well, cool stuff. I’ve been killing it over there on the coding side all month!

AliReza-H · February 26, 2026, 4:31am

Hi,

Nice work!

I’ve been experimenting with retinal age gap on OSD-679 for a while, as well as IOP estimation, and I know it pretty well.

If there’s anything I can help with, I’d be happy to.

GasMan · February 26, 2026, 5:51am

Hey @AliReza-H That’s awesome, if you’ve been deep in IOP estimation & retinal age gap, your perspective is exactly what I need. I’m zoning in on this for the rest of the month before jumping to the next project, so I’m moving kind of fast.

I’d love for you to take a look at my Weighted Logic Bank in the code, specifically how I’m weighting the relationship between IOP shifts & retinal morphology. Since you know OSD-679 well, I’d value your take on whether my adversarial ‘traps’ are hitting the right clinical nuances, or if you have any recommendations in any areas is also welcomed!

Topic		Replies	Views
Benchmarking SAA Separation & Sensor Integrity (Project SquidSaa) AIML AWG Open Projects aiml	0	23	February 12, 2026
4.9 % reached on ARC-AGI2 benchmark AI/ML AWG Topics	1	53	September 2, 2025
Quantum Natural Language processing model prototype AIML AWG Open Projects	2	100	August 1, 2025
Artificial Medical Super Intelligence AI/ML AWG Topics	3	86	January 1, 2025
AI/ML Subgroup for Genetic Perturbation Predictive Modeling (GPPM) AIML AWG Open Projects omics , news , subgroups , perturb-seq , transcriptomics	43	1493	March 26, 2026

🧬 AWG Update: Benchmarking AI Scientific Integrity (v5.0.1)

Related topics