Downloading data files from OSDR

yenkai.chen.id · July 1, 2025, 3:42am

Hi all,

I’m wondering if people have tips, or standard bash codes for downloading files from OSDR?
Potentially something similar to SRA-tools’s fastq dump.

Best,
Yen-Kai

asaravia · July 1, 2025, 3:51pm

Hi @yenkai.chen.id,
Datasets on OSDR can be access via our API:

Some example API calls can be found here:

OSDR files can also be downloaded programmatically through our public s3 bucket.
Tutorial: Use the OSDR Public AWS S3 Bucket — OSDR Tutorials

s3 registry: NASA Space Biology Open Science Data Repository (OSDR) - Registry of Open Data on AWS

s3 bucket browser: http://nasa-osdr.s3-website-us-west-2.amazonaws.com/

If you’re unfamilar with AWS CLI, you can find out more here: Getting started with the AWS CLI - AWS Command Line Interface

Please note that when downloading files from our public s3 bucket, you will need to specify the OSD version number. The current version can be found to the right of the OSD # on each study page. If you are looking for a version of a study that is not available in the public s3 bucket, please let me know.

yenkai.chen.id · July 3, 2025, 12:33am

Hi @asaravia ,

Thank you for this.
I also found: Using genelab-utils to download GLDS data - HackMD from AstrobioMike. Although it’s not very scalable for my needs.

I’ll check out the listed tutorial, hopefully no AWS sign in is required.

Best ,
Yen-Kai Chen

stephen.keegan · July 3, 2025, 2:49am

it’s been a bit since I’ve used any data but here’s a script I’ve used before, probably best to just use the API though

import boto3
from botocore import UNSIGNED
from botocore.config import Config
import os
import pandas as pd

# === USER SETTINGS ===
STUDY_CODE = "OSD-811" #Put whatever study you want here
DATA_DIR = f"{STUDY_CODE}_s3_data"
os.makedirs(DATA_DIR, exist_ok=True)

# === INIT S3 CLIENT (anonymous access) ===
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED), region_name='us-west-2')
BUCKET = "nasa-osdr"
PREFIX = f"{STUDY_CODE}/"

# === LIST OBJECTS ===
print(f"🔍 Searching NASA OSDR bucket for prefix: {PREFIX}")
resp = s3.list_objects_v2(Bucket=BUCKET, Prefix=PREFIX)

if "Contents" not in resp:
    raise ValueError(f"No files found in S3 for study '{STUDY_CODE}'.")

# === DOWNLOAD ALL FILES ===
for obj in resp["Contents"]:
    key = obj["Key"]
    fname = os.path.basename(key)
    if not fname:  # skip folders
        continue
    out_path = os.path.join(DATA_DIR, fname)
    if os.path.exists(out_path):
        print(f"✅ Already exists: {fname}")
        continue
    print(f"⬇️ Downloading: {fname}")
    s3.download_file(BUCKET, key, out_path)

print(f"\n✅ Download complete: {len(resp['Contents'])} files saved to {DATA_DIR}")

yenkai.chen.id · July 6, 2025, 1:37pm

Hi,

I tried installing aws-cli on my uni’s hpc. But I’m not allowed to use sudo in the commands for the hpc.

Are there alternatives, are there no way to have a conda environment instead of a AWS client?

Best,

asaravia · July 7, 2025, 1:27pm

Hi @yenkai.chen.id,
You should be able to install AWS CLI through conda: https://anaconda.org/conda-forge/awscli

Here’s the conda command:

conda install conda-forge::awscli

Although genelab-utils is a bit out dated and uses our old API, it should still work since our current API is backwards compatible.

That said, I am working on a script to allow you to download files programmatically that uses our current API but I’m not quite done with it yet. I should have that finished by the end of the day tomorrow. I’ll let you know what that’s ready to try.

asaravia · July 10, 2025, 6:13pm

Hi @yenkai.chen.id,

I created a python script for this on GitHub here:

Let me know if anything is not clear or if there is more functionality you want me to add.

Also, please note that the --exclude-ext option is not currently working. There needs to be an update to the API to support this. I’ll let you know what that is functioning.

yenkai.chen.id · July 14, 2025, 12:10am

Hey @asaravia ,

Thanks for this.

I gave it a try. I think it would be good to be able to filter via string instead of file type.
Good example is that I’m only allowed to search fastq.gz when I want to filter to just raw files via raw.fastq.gz and there’s clearly *raw.fastq.gz files when searching fastq.gz.

The other function useful is the output that comes out of --print-only in Genelab-utils. Having the file names and a corresponding download link would be quite useful for the task below. Perhaps the --pattern function from Genelab-utils would be useful as stated above.

Really, what I really would like to be able to do is download the sample files individually (each paired files would be downloaded together) instead of a downloading the entire dataset in one go. Since I would like to parallelise the downloads and it would also allow me to parrallelise the data processing.

Best,

asaravia · July 25, 2025, 8:30pm

Hi @yenkai.chen.id,
Sorry for the delay. I just updated the script with a --search parameter that will allow to match files using a string. I also added an output tsv file when you run the script in --list mode, which contains the file names and respective download links.

Please try the script again and let me know if you have any issues.

This should be pretty easy to parallelize by passing through individual sample names using the new --search parameter.

https://github.com/asaravia-butler/OSDR_File_Downloader/blob/main/README.md

Topic		Replies	Views
OSDR Biological Data API Release Announcements/Jobs api , space-biology , bioinformatics , news , osdr , software , genelab , alsda	5	193	April 18, 2025
Opportunity with the OSDR Project - Two Software Engineers Announcements/Jobs job , osdr , software , developer	0	163	December 16, 2024
OSDR Spring 2025 Newsletter Announcements/Jobs newsletter , osdr	0	132	April 30, 2025
New Database Publication for OSDR "Open Science for Life in Space" - great overview to read if new to the AWG & OSDR New OSDs/data/articles metadata-standards , data-standards , database	3	286	November 26, 2024
Check out OSDR's Config List for current and upcoming assays ALSDA AWG Topics configs , curation , osdr-data-submission	5	148	May 8, 2025

Downloading data files from OSDR

Related topics