← Back to homepage

Research Computing

RRBS/WGBS HPC Genomic ETL Pipelines

A research-computing project focused on turning raw sequencing inputs into clean, downstream-ready outputs through reproducible, scalable HPC workflows for RRBS and WGBS data.

Project Lead August 2024 - Present Python / Bash / SLURM

What it does

This work covers the genomic ETL chain from raw sequencing reads through trimming, alignment, BAM generation, biological feature extraction, and the production of analysis-ready outputs for downstream biological analysis and machine learning.

The main engineering focus has been on scaling execution, preserving reproducibility across stages, and making the pipeline practical for research settings where sample counts and compute demands are large enough that ad hoc workflows stop being viable.

Core stack

  • Python orchestration and data handling
  • Bash-based workflow execution
  • SLURM scheduling on HPC infrastructure
  • Genomic preprocessing and transformation steps
  • QC and downstream-ready structured outputs

What I built

  • Pipeline structure for serialized and parallel HPC execution paths
  • Preprocessing and transformation logic for large-scale genomic inputs
  • Output conventions that support cleaner downstream analysis handoff
  • Workflow patterns that are more reproducible and easier to extend over time

Availability

This work reflects internal research/lab engineering and is summarized here at a high level. It is included on this site as representative current work even though the underlying code is not posted publicly.