Relevant Article on Algorithm Biases, Open Data, and Reuse

rtscott2001 · August 28, 2024, 12:47pm

Was especially thinking of @evartsb @james.casaletto @asaravia @lauren.sanders @jgong @mboerrigter @brian.russell when reading this article:

Abstract:
Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data-processing pipelines. We describe two processing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for MRI reconstruction: compressed sensing, dictionary learning, and DL. Our results demonstrate that all these algorithms yield systematically biased results when they are naively trained on seemingly appropriate data: The normalized rms error improves consistently with the extent of data processing, showing an artificial improvement of 25 to 48% in some cases. Because this phenomenon is not widely known, biased results sometimes are published as state of the art; we refer to that as implicit “data crimes.” This work hence aims to raise awareness regarding naive off-label usage of big data and reveal the vulnerability of modern inverse problem solvers to the resulting bias.

Shimron, E., Tamir, J. I., Wang, K., & Lustig, M. (2022). Implicit data crimes: Machine learning bias arising from misuse of public data. Proceedings of the National Academy of Sciences, 119(13), e2117203119.

https://doi.org/10.1073/pnas.2117203119

PatS · August 30, 2024, 3:12am

This is great, thanks for sharing! MRI focused- curious to see if similar issues are in other domains. Paper makes excellent points, but It’s difficult to implement their recommendations especially because getting raw data is def an issue with electronic health record data - most raw data is full of errors and inaccessible, then whatever is available is anonymized and aggregated.

rtscott2001 · August 30, 2024, 3:42am

Glad you liked it too! I felt it was a VERY important article connected to best practices for ML & computational biomedical analysis. Indeed, the implications are far beyond MRI for sure

Topic		Replies	Views
Causal Inference Sub Group 2025 AIML AWG Open Projects aiml	52	710	May 12, 2025
Interesting Bio/Omics AI/ML Papers and Code AI/ML AWG Topics	4	71	January 7, 2025
AI-Readiness in Space Omics Data AIML AWG Open Projects aiml , issop , ai-ready	2	283	November 30, 2024
Data augmentation discussion AI/ML AWG Topics aiml	3	123	September 17, 2024
Join the Human AWG OSDR-SPOKE Fabric Subgroup Knowledge Graph & ML Human AWG Open Projects aiml , data-mining , knowledge-graph	8	405	February 6, 2025

Relevant Article on Algorithm Biases, Open Data, and Reuse

Related topics