Understanding Input Data and First Steps in the Proteomics Pipeline

Maria · February 14, 2026, 12:33am

Introduction
Hi everyone, I recently joined the Proteomics AWG Subgroup Project and I’m studying the pipeline so I can contribute by helping new members with the first steps. This post is my attempt to summarize how our data is generated, hosted in OSDR, and incorporated into the pipeline. I’d love feedback to check if I’ve captured it correctly, or if there are details I should adjust.

Proteomics Pipeline Overview

Raw Files: .wiff, .raw, .d from TripleTOF, Orbitrap, timsTOF
Conversion: ProteoWizard → .mzML
Quality Control (QC): Check signal, retention time, MS/MS coverage
Decision Point:
- Include → move to analysis
- Exclude → flagged or removed
Downstream Analysis:
- Search Engines: MSFragger, X!Tandem, MSGF+
- Identification: Peptides → Proteins
- Quantification: Label-free or labeled
- Statistical Analysis: Differential expression, pathway enrichment

Input Data Formats in OSDR
Scientific research instruments generate raw data in different proprietary formats:

.wiff + .wiff.scan → native format from SCIEX QTOF instruments (e.g., NASA OSDR dataset OSD-581).
.raw → native format from Thermo Hybrid Orbitrap instruments (often used in metabolomics).
.d → native format from Bruker QTOF + ion mobility instruments.

These files are raw input data hosted in OSDR. They differ from clinical/medical instruments because research MS systems are designed to explore unknowns, discover new peptides/proteins, and map complex mixtures. They are more flexible and tunable, often with one or more collision cells for fragmentation.

Collision Cells and Fragmentation
Mass spectrometers record data as m/z vs intensity — essentially a graph of ion mass-to-charge ratios against their signal strength. The way this data looks depends on whether the precursor ion is fragmented before detection:

Without fragmentation → the detector mainly sees the intact precursor ion, giving limited information.
With fragmentation → collision cells break the precursor into multiple fragment ions, each producing its own m/z peak. This greatly enriches the spectrum and makes peptide/protein identification possible.
Different instruments handle fragmentation differently:

Collision Cells and Fragmentation

Mass spectrometers with a single collision cell:
Precursor ions are fragmented once before entering the analyzer. This produces a straightforward set of fragment ions that can be measured directly.
Mass spectrometers with multiple collision cells:
Precursor ions can undergo sequential fragmentation. The first collision cell fragments the precursor ions, and subsequent cells can further fragment those product ions into additional fragments. This layered fragmentation provides richer structural information and allows different fragmentation styles to be applied.

Collision Cells and Fragmentation (Visual Examples)
MS with one collision cell :

Precursor ion
Fragmented into , ,
Detector records m/z vs intensity for each fragment

MS with one collision cell
Precursor ion
[Q1] l fragment | fragment | fragment
[Detector]

m/z vs intensity

[ ]
[ ]
[ ]

It has a single collision cell where precursor ions are fragmented before entering the analyzer.
But it uses advanced ion optics to maximize transmission and fragmentation efficiency

MS with two collision cells:

Precursor ion enters Q1
Fragmented further in Q2 (CID + HCD) into , , , , , depending on pathway
Detector records m/z vs intensity with multiple fragmentation styles

MS with [Q1 + Q2] collision cells:

Precursor ion
[Q1] fragment
[Q2] fragment | fragment | fragment
[Detector]

m/z vs intensity

|[ ]
[ ]|
[ ]

Precursor ion
[Q1]l fragment
[Q2] fragment | fragment | fragment
[Detector]

m/z vs intensity

[ ]
[ ]
[ ]

Has two collision cells: a linear ion trap (CID) and a higher-energy collisional dissociation (HCD) cell.
This allows choice between fragmentation styles depending on the experiment.

Importance of collision cells

Flexibility in fragmentation: Different cells can be optimized for different collision energies or fragmentation methods (CID, HCD, ETD, etc.).
Parallel processing: Some instruments allow simultaneous fragmentation of different ion populations.
Improved sensitivity and resolution: By separating fragmentation stages, the instrument can better control ion transmission and reduce background noise.

Each fragment retains the information of its precursor

What is a .wiff file?

Proprietary raw data format (e.g., TripleTOF).
Contains metadata: instrument settings, acquisition methods, partial structures.
Acts as a container/header file that organizes the experiment.

What is a .wiff.scan file?

Companion file storing the actual spectral data (m/z values and intensities).
Must be used together with .wiff to reconstruct the dataset.
Without .wiff.scan, you only have metadata; without .wiff, you can’t interpret the scans.

Other related files:

.wiff.mtd → method metadata.
.wiff.analysis → processing/analysis info.

Together, .wiff + .wiff.scan = the raw data package.

What is a .mzML file?

An open, standardized XML-based format.
Think of it as a well-organized digital notebook :
- Each spectrum is a “page” with m/z values, intensities, and metadata.
Unlike proprietary formats, .mzML can be read by many tools (OpenMS, MaxQuant, XCMS, MZmine, Skyline).
It preserves both numbers and context, ensuring reproducibility.

Conversion Example (bash):

msconvert sample.wiff --mzML

Raw Data Example (XML):

<spectrum id="scan=101">
  <binaryDataArrayList count="2">
    <binaryDataArray>
      <cvParam name="m/z array"/>
      <binary>100.1 101.2 102.3 103.4 104.5</binary>
    </binaryDataArray>
    <binaryDataArray>
      <cvParam name="intensity array"/>
      <binary>5 10 2 50 3</binary>
    </binaryDataArray>
  </binaryDataArrayList>
</spectrum>

Peak at 103.4 m/z (intensity 50) = real signal.
Small intensities (2–3) = noise, still preserved in .mzML.

Step-by-Step Summary

Raw File Conversion → proprietary formats (.wiff, .raw, .d) → .mzML via ProteoWizard.
Quality Control (QC) → assess signal intensity, peak shape, retention time, MS/MS coverage.
Decision Point → include if sufficient quality, exclude if not.

Why This Matters

Ensures only high-quality, analyzable data enters the shared pipeline.
Supports cross-lab reproducibility, especially vital for spaceflight experiments.
Builds trust in results shared through OSDR and downstream publications.

Closing
This is my current understanding of the subgroup’s pipeline. Did I capture the main steps correctly? Are there details I should refine? I’d love your feedback so I can contribute more effectively and help new members with the first step.
If at any point this post is considered inappropriate, distracting, incorrect, or problematic for the group, please feel free to remove it — I’ll gladly adjust and learn from your guidance.
My goal is to learn and contribute by helping new members with the first step of the pipeline.

@joel.steele @asaravia

Topic		Replies	Views
Proteomics AWG Subgroup Project Multi-Omics AWG Open Projects proteomics , pipeline-development	16	579	October 7, 2025
Check out OSDR's Config List for current and upcoming assays ALSDA AWG Topics configs , curation , osdr-data-submission	5	205	May 8, 2025
Monthly Animal AWG meeting, Wednesday March 26th, Noon-1:30pm EST Animal AWG Topics	0	59	March 25, 2025
Proteomics AWG Subgroup - Metadata Templates Submission/Curation Question? proteomics , metadata , assay-table , sample-table	6	111	July 2, 2025
DNA Damage Microscopy Image Classification Project AIML AWG Open Projects aiml , radiation	79	1834	March 13, 2026

Understanding Input Data and First Steps in the Proteomics Pipeline

Related topics