🧬 AWG Update: Benchmarking AI Scientific Integrity (v5.0.1)

Hi everyone,

Better late than never, I’m thrilled to be diving more here. As a follow-up, some of you may know me from the Knowhax event specifically, Team 60s work during the SPOKE challenge, I’ve been building a framework to stress-test the LLMs we use for multi-modal integration.

I’ve launched the v5.1 Non-Deterministic Gauntlet for OSD-679.

Why it matters for our group: I’ve implemented a weighted logic shuffle ($P(10,3)$) that randomizes clinical hurdles per run. This stops ā€œbenchmark hackingā€ and forces the models to adjudicate the SANS Paradox (IOP vs. TRT) in real-time.

The Benchmarks: Gemini vs. Anthropic

I’ve been running some head-to-head audits between these 2 for now:

  • Anthropic (Claude 4.x/Opus): Extremely proficient at the ā€œForensic Join,ā€ but currently struggling with the adversarial logic traps (hitting the 20% compliance floor).

  • Google (Gemini 3 Pro/Flash): Showing higher resilience in scientific reasoning (~66-70%), but still prone to ā€œpoliteā€ hallucinations when faced with conflicting terrestrial hypotheses.

I’m hoping to expand this ā€œGauntletā€ to other OSDs from our SPOKE roadmap soon. If anyone in the AI/ML group has a specific dataset they want me to ā€œhardenā€ or if you want to collaborate on the scoring weights, let’s talk!

Heres the leader board between all and most of Gemini and Anthropic models(18): SANS Multi-Modal Integration Challenge | Kaggle

If anyone has any other suggestions I’m all ears, It’s pretty cool to take the OSD I had last year and re-highlight it again for this benchmark.

-Gaston D.

@AIMLawg

5 Likes

Interesting, are @vaishnavi.nagesh or @lauren.sanders aware of this?

This may be the first OSDR data-AWG challenge in Kaggle, maybe?

-Ryan

3 Likes

Honestly I could ask around on my end, but yeah you might be on to something there @rtscott2001

Looks like the Kaggle team featured my code as well, cool stuff. I’ve been killing it over there on the coding side all month!

2 Likes

Hi,

Nice work!

I’ve been experimenting with retinal age gap on OSD-679 for a while, as well as IOP estimation, and I know it pretty well.

If there’s anything I can help with, I’d be happy to.

2 Likes

Hey @AliReza-H That’s awesome, if you’ve been deep in IOP estimation & retinal age gap, your perspective is exactly what I need. I’m zoning in on this for the rest of the month before jumping to the next project, so I’m moving kind of fast.

I’d love for you to take a look at my Weighted Logic Bank in the code, specifically how I’m weighting the relationship between IOP shifts & retinal morphology. Since you know OSD-679 well, I’d value your take on whether my adversarial ā€˜traps’ are hitting the right clinical nuances, or if you have any recommendations in any areas is also welcomed!

1 Like