Hi all,
I’m wondering if people have tips, or standard bash codes for downloading files from OSDR?
Potentially something similar to SRA-tools’s fastq dump.
Best,
Yen-Kai
Hi all,
I’m wondering if people have tips, or standard bash codes for downloading files from OSDR?
Potentially something similar to SRA-tools’s fastq dump.
Best,
Yen-Kai
Hi @yenkai.chen.id,
Datasets on OSDR can be access via our API:
Some example API calls can be found here:
OSDR files can also be downloaded programmatically through our public s3 bucket.
Tutorial: Use the OSDR Public AWS S3 Bucket — OSDR Tutorials
s3 registry: NASA Space Biology Open Science Data Repository (OSDR) - Registry of Open Data on AWS
s3 bucket browser: http://nasa-osdr.s3-website-us-west-2.amazonaws.com/
If you’re unfamilar with AWS CLI, you can find out more here: Getting started with the AWS CLI - AWS Command Line Interface
Please note that when downloading files from our public s3 bucket, you will need to specify the OSD version number. The current version can be found to the right of the OSD # on each study page. If you are looking for a version of a study that is not available in the public s3 bucket, please let me know.
Hi @asaravia ,
Thank you for this.
I also found: Using genelab-utils to download GLDS data - HackMD from AstrobioMike. Although it’s not very scalable for my needs.
I’ll check out the listed tutorial, hopefully no AWS sign in is required.
Best ,
Yen-Kai Chen
it’s been a bit since I’ve used any data but here’s a script I’ve used before, probably best to just use the API though
import boto3
from botocore import UNSIGNED
from botocore.config import Config
import os
import pandas as pd
# === USER SETTINGS ===
STUDY_CODE = "OSD-811" #Put whatever study you want here
DATA_DIR = f"{STUDY_CODE}_s3_data"
os.makedirs(DATA_DIR, exist_ok=True)
# === INIT S3 CLIENT (anonymous access) ===
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED), region_name='us-west-2')
BUCKET = "nasa-osdr"
PREFIX = f"{STUDY_CODE}/"
# === LIST OBJECTS ===
print(f"🔍 Searching NASA OSDR bucket for prefix: {PREFIX}")
resp = s3.list_objects_v2(Bucket=BUCKET, Prefix=PREFIX)
if "Contents" not in resp:
raise ValueError(f"No files found in S3 for study '{STUDY_CODE}'.")
# === DOWNLOAD ALL FILES ===
for obj in resp["Contents"]:
key = obj["Key"]
fname = os.path.basename(key)
if not fname: # skip folders
continue
out_path = os.path.join(DATA_DIR, fname)
if os.path.exists(out_path):
print(f"✅ Already exists: {fname}")
continue
print(f"⬇️ Downloading: {fname}")
s3.download_file(BUCKET, key, out_path)
print(f"\n✅ Download complete: {len(resp['Contents'])} files saved to {DATA_DIR}")
Hi,
I tried installing aws-cli on my uni’s hpc. But I’m not allowed to use sudo in the commands for the hpc.
Are there alternatives, are there no way to have a conda environment instead of a AWS client?
Best,
Hi @yenkai.chen.id,
You should be able to install AWS CLI through conda: https://anaconda.org/conda-forge/awscli
Here’s the conda command:
conda install conda-forge::awscli
Although genelab-utils is a bit out dated and uses our old API, it should still work since our current API is backwards compatible.
That said, I am working on a script to allow you to download files programmatically that uses our current API but I’m not quite done with it yet. I should have that finished by the end of the day tomorrow. I’ll let you know what that’s ready to try.
Hi @yenkai.chen.id,
I created a python script for this on GitHub here:
Let me know if anything is not clear or if there is more functionality you want me to add.
Also, please note that the --exclude-ext
option is not currently working. There needs to be an update to the API to support this. I’ll let you know what that is functioning.
Hey @asaravia ,
Thanks for this.
I gave it a try. I think it would be good to be able to filter via string instead of file type.
Good example is that I’m only allowed to search fastq.gz when I want to filter to just raw files via raw.fastq.gz and there’s clearly *raw.fastq.gz files when searching fastq.gz.
The other function useful is the output that comes out of --print-only in Genelab-utils. Having the file names and a corresponding download link would be quite useful for the task below. Perhaps the --pattern function from Genelab-utils would be useful as stated above.
Really, what I really would like to be able to do is download the sample files individually (each paired files would be downloaded together) instead of a downloading the entire dataset in one go. Since I would like to parallelise the downloads and it would also allow me to parrallelise the data processing.
Best,