News & Articles
How do you set a bioinformatics speed record? – Color
Jeremy Ginsberg
One of the least appreciated aspects of modern software infrastructure is the sheer volume of processing and computation that are required to satisfy even the most ordinary search queries. 15 years ago this month, when Google published a high-level overview of its architecture, it was revolutionary in many ways, including the unprecedented scale and design of its computer clusters, each containing “more than 15,000 commodity-class PCs.”
Thanks to infrastructure like Amazon Web Services (AWS) and Google Cloud, harnessing the power of 15,000 computers today is much easier — no need to build your own data centers or design computer clusters from scratch. Every major modern online service — Airbnb, Dropbox, Facebook, Netflix, Pinterest, Square, Stripe, Twitter, Uber, etc. — relies extensively on both shared and customized distributed compute infrastructure.
The healthcare industry isn’t yet taking full advantage of these trends. Legacy infrastructure and technology remain the norm in most US healthcare settings. For example, the most widely used electronic health record systems, apart from being costly and difficult to configure, aren’t designed around modern cloud-based infrastructure.
But bioinformatics is one of the best suited applications for modern software infrastructure: it requires enormous reams of data and relies on complex algorithms to find the most important variations in each person’s DNA. Quality, reliability, speed, and cost aren’t just “nice to have” optimizations; timely delivery of results matters to Color’s clients, and controlling costs helps us make genetics more accessible to all.
Before joining Color, senior engineers on our team collectively spent decades working on advanced distributed compute systems at companies like Google, Yahoo, Twitter, and Dropbox. As we developed our bioinformatics pipeline, we learned that standard off-the-shelf software packages often struggle to run quickly or effectively in a modern compute environment. From alignment to variant calling to variant interpretation, we found opportunities to pioneer new techniques and optimize how popular algorithms/packages like GATK and BWA can be deployed in a high-scale, high-throughput production system.
When Color launched in 2015 with a 19-gene hereditary cancer panel, our bioinformatics pipeline often took more than 4 days to process a small batch of samples, which isn’t unusual with next-generation sequencing data. And in recent years, our infrastructure challenge has only grown:
- We expanded our assay to sequence a broader set of hundreds of genes
- We expanded our sensitivity (ability to find rare/complex variants) by incorporating a number of industry-standard callers for complex structural variants (CNVs) and designing novel variant callers/segmenters when appropriate (some of which we shared at ASHG 2017 here and here)
- We increased our supported batch size (how many samples we process at a time) by an order of magnitude, to support the latest sequencing hardware like Illumina’s NovaSeq
- Our overall business volume and throughput has grown exponentially since 2015
Any one of these challenges might result in a substantially slower pipeline. Instead, we introduced optimizations which addressed these challenges and resulted in a consistent runtime of about 2 hours. Today we begin processing sequencer data for a particular batch (uploaded from our lab to Amazon’s S3 cloud storage) immediately, before the sequencer hardware itself has finished its internal quality control. By the time the sequencer has finished certifying the run’s quality, which takes a few hours, we’ve typically completed the entire bioinformatics pipeline — effectively adding zero net latency to our processing/handling of client samples!
During our internal “Hack Week” last December, the bioinformatics engineering team tried running the pipeline on whole-genome sequencing (WGS) data: massively broad, covering more than 98% of the human genome, with a read-depth of 30x. Our pipeline had never been tested on such data before, so we expected long days of debugging/tuning/configuring before we could process a single WGS sample; after all, Color’s pipeline has been custom-engineered for high-throughput commercial panel testing. Instead, we were pleased to discover that not only could we successfully analyze whole-genome data, in fact we could process a single 30x WGS sample in less than 19.5 minutes — notionally besting the current published world record with only a few days’ effort.
Measuring speed records like this might seem a bit silly, but in fact, we’re measuring the future. Looking ahead a few years, given the ongoing rapid decrease in sequencing costs, deep WGS may well become the standard for all genetic tests, which means Color and other leading genetics operations will need to process enormous volumes of WGS samples. We now know our bioinformatics pipeline will be up to the task!
Stay tuned, as next week one of our senior engineers will share more technical details about Color’s pipeline and some of the design decisions we made along the way.