Introduction
Hi everyone, I recently joined the Proteomics AWG Subgroup Project and I’m studying the pipeline so I can contribute by helping new members with the first steps. This post is my attempt to summarize how our data is generated, hosted in OSDR, and incorporated into the pipeline. I’d love feedback to check if I’ve captured it correctly, or if there are details I should adjust.
Proteomics Pipeline Overview
- Raw Files: .wiff, .raw, .d from TripleTOF, Orbitrap, timsTOF
- Conversion: ProteoWizard → .mzML
- Quality Control (QC): Check signal, retention time, MS/MS coverage
- Decision Point:
- Include → move to analysis
- Exclude → flagged or removed
- Downstream Analysis:
- Search Engines: MSFragger, X!Tandem, MSGF+
- Identification: Peptides → Proteins
- Quantification: Label-free or labeled
- Statistical Analysis: Differential expression, pathway enrichment
Input Data Formats in OSDR
Scientific research instruments generate raw data in different proprietary formats:
- .wiff + .wiff.scan → native format from SCIEX QTOF instruments (e.g., NASA OSDR dataset OSD-581).
- .raw → native format from Thermo Hybrid Orbitrap instruments (often used in metabolomics).
- .d → native format from Bruker QTOF + ion mobility instruments.
These files are raw input data hosted in OSDR. They differ from clinical/medical instruments because research MS systems are designed to explore unknowns, discover new peptides/proteins, and map complex mixtures. They are more flexible and tunable, often with one or more collision cells for fragmentation.
Collision Cells and Fragmentation
Mass spectrometers record data as m/z vs intensity — essentially a graph of ion mass-to-charge ratios against their signal strength. The way this data looks depends on whether the precursor ion is fragmented before detection:
- Without fragmentation → the detector mainly sees the intact precursor ion, giving limited information.
- With fragmentation → collision cells break the precursor into multiple fragment ions, each producing its own m/z peak. This greatly enriches the spectrum and makes peptide/protein identification possible.
Different instruments handle fragmentation differently:
Collision Cells and Fragmentation
- Mass spectrometers with a single collision cell:
Precursor ions are fragmented once before entering the analyzer. This produces a straightforward set of fragment ions that can be measured directly. - Mass spectrometers with multiple collision cells:
Precursor ions can undergo sequential fragmentation. The first collision cell fragments the precursor ions, and subsequent cells can further fragment those product ions into additional fragments. This layered fragmentation provides richer structural information and allows different fragmentation styles to be applied.
Collision Cells and Fragmentation (Visual Examples)
MS with one collision cell :
- Precursor ion

- Fragmented into
,
, 
- Detector records m/z vs intensity for each fragment
MS with one collision cell
Precursor ion ![]()
[Q1]
l fragment
| fragment
| fragment ![]()
[Detector]
m/z vs intensity
[
]
[
]
[
]
- It has a single collision cell where precursor ions are fragmented before entering the analyzer.
- But it uses advanced ion optics to maximize transmission and fragmentation efficiency
MS with two collision cells:
- Precursor ion
enters Q1 - Fragmented further in Q2 (CID + HCD) into
,
,
,
,
,
depending on pathway - Detector records m/z vs intensity with multiple fragmentation styles
MS with [Q1 + Q2] collision cells:
Precursor ion ![]()
[Q1]
fragment ![]()
[Q2] fragment
| fragment
| fragment ![]()
[Detector]
m/z vs intensity
|[
]
[
]|
[
]
Precursor ion ![]()
[Q1]l
fragment ![]()
[Q2]
fragment
| fragment
| fragment ![]()
[Detector]
m/z vs intensity
[
]
[
]
[
]
- Has two collision cells: a linear ion trap (CID) and a higher-energy collisional dissociation (HCD) cell.
- This allows choice between fragmentation styles depending on the experiment.
Importance of collision cells
Flexibility in fragmentation: Different cells can be optimized for different collision energies or fragmentation methods (CID, HCD, ETD, etc.).
Parallel processing: Some instruments allow simultaneous fragmentation of different ion populations.
Improved sensitivity and resolution: By separating fragmentation stages, the instrument can better control ion transmission and reduce background noise.
Each fragment retains the information of its precursor
What is a .wiff file?
- Proprietary raw data format (e.g., TripleTOF).
- Contains metadata: instrument settings, acquisition methods, partial structures.
- Acts as a container/header file that organizes the experiment.
What is a .wiff.scan file?
- Companion file storing the actual spectral data (m/z values and intensities).
- Must be used together with .wiff to reconstruct the dataset.
- Without .wiff.scan, you only have metadata; without .wiff, you can’t interpret the scans.
Other related files:
- .wiff.mtd → method metadata.
- .wiff.analysis → processing/analysis info.
Together, .wiff + .wiff.scan = the raw data package.
What is a .mzML file?
- An open, standardized XML-based format.
- Think of it as a well-organized digital notebook
:
- Each spectrum is a “page” with m/z values, intensities, and metadata.
- Unlike proprietary formats, .mzML can be read by many tools (OpenMS, MaxQuant, XCMS, MZmine, Skyline).
- It preserves both numbers and context, ensuring reproducibility.
Conversion Example (bash):
msconvert sample.wiff --mzML
Raw Data Example (XML):
<spectrum id="scan=101">
<binaryDataArrayList count="2">
<binaryDataArray>
<cvParam name="m/z array"/>
<binary>100.1 101.2 102.3 103.4 104.5</binary>
</binaryDataArray>
<binaryDataArray>
<cvParam name="intensity array"/>
<binary>5 10 2 50 3</binary>
</binaryDataArray>
</binaryDataArrayList>
</spectrum>
- Peak at 103.4 m/z (intensity 50) = real signal.
- Small intensities (2–3) = noise, still preserved in .mzML.
Step-by-Step Summary
- Raw File Conversion → proprietary formats (.wiff, .raw, .d) → .mzML via ProteoWizard.
- Quality Control (QC) → assess signal intensity, peak shape, retention time, MS/MS coverage.
- Decision Point → include if sufficient quality, exclude if not.
Why This Matters
- Ensures only high-quality, analyzable data enters the shared pipeline.
- Supports cross-lab reproducibility, especially vital for spaceflight experiments.
- Builds trust in results shared through OSDR and downstream publications.
Closing
This is my current understanding of the subgroup’s pipeline. Did I capture the main steps correctly? Are there details I should refine? I’d love your feedback so I can contribute more effectively and help new members with the first step.
If at any point this post is considered inappropriate, distracting, incorrect, or problematic for the group, please feel free to remove it — I’ll gladly adjust and learn from your guidance.
My goal is to learn and contribute by helping new members with the first step of the pipeline.