Hi everyone,
Better late than never, Iām thrilled to be diving more here. As a follow-up, some of you may know me from the Knowhax event specifically, Team 60s work during the SPOKE challenge, Iāve been building a framework to stress-test the LLMs we use for multi-modal integration.
Iāve launched the v5.1 Non-Deterministic Gauntlet for OSD-679.
Why it matters for our group: Iāve implemented a weighted logic shuffle ($P(10,3)$) that randomizes clinical hurdles per run. This stops ābenchmark hackingā and forces the models to adjudicate the SANS Paradox (IOP vs. TRT) in real-time.
The Benchmarks: Gemini vs. Anthropic
Iāve been running some head-to-head audits between these 2 for now:
-
Anthropic (Claude 4.x/Opus): Extremely proficient at the āForensic Join,ā but currently struggling with the adversarial logic traps (hitting the 20% compliance floor).
-
Google (Gemini 3 Pro/Flash): Showing higher resilience in scientific reasoning (~66-70%), but still prone to āpoliteā hallucinations when faced with conflicting terrestrial hypotheses.
Iām hoping to expand this āGauntletā to other OSDs from our SPOKE roadmap soon. If anyone in the AI/ML group has a specific dataset they want me to āhardenā or if you want to collaborate on the scoring weights, letās talk!
Heres the leader board between all and most of Gemini and Anthropic models(18): SANS Multi-Modal Integration Challenge | Kaggle
If anyone has any other suggestions Iām all ears, Itās pretty cool to take the OSD I had last year and re-highlight it again for this benchmark.
-Gaston D.
