Hi everyone,
I’ve been deep in the “engine room” for the last few weeks focusing on the Space Sensor Honesty and Anomaly Response side of our work.
With the recent release of the reduced SAA dataset, I’ve submitted a proposal to the Brainwriting Doc for a Multi-LLM Benchmarking Framework. My focus is on moving beyond static numerical thresholds and establishing a standardized “Referee” system to evaluate how top-tier models (Gemini 1.5 Flash, GPT-4o, etc.) reason through radiation flux morphology.
Current Status:
-
Infrastructure: I’m currently collaborating with the Kaggle EAP team to resolve a platform-level environment bug (
NbConvertApp) that is blocking the final task-indexing. -
Availability: While the “live” benchmark link is pending that fix, the dataset and the logic are ready for initial review.
The goal is to provide the AWG community with a “Leaderboard” to identify which AI architectures handle telemetry most reliably before the full official database goes live.
Resources:
-
Dataset: https://www.kaggle.com/datasets/gastondana/spacedos
-
Referee Logic: https://www.kaggle.com/code/gastondana/squidsaa-ink-stinct-level-1-referee
Looking forward to catching up with the various workstreams and getting your eyes on the benchmarking logic!
All the Best,
Gaston D.