IBM
and CLC bio provide an accelerated genomics research platform to convert
sequencer data to usable genomic insight.
Imagine
a world where medical diagnoses and treatment regimens are based on a person’s
specific genetic makeup—reducing side effects and improving patient outcomes.
That’s the promise of personalized medicine, which is rapidly becoming a
reality through advances in genomic sequencing and analysis.
APPLYING GENOMIC SEQUENCING TO THERAPEUTICS
Dr. Lukas Wartman has firsthand experience with the
power of genomic sequencing. A genetics researcher at Washington University in
St. Louis, Missouri, Dr. Wartman ended up contracting the very disease he was
studying: adult acute lymphoblastic leukemia. His condition deteriorated
rapidly, and there was no known treatment for the cancer.
His
colleagues decided to fully sequence the genes of both his cancerous cells and
healthy cells using the High Performance Computing cluster housed in the Genome
Institute at Washington University. They discov-ered something completely
unexpected: one of Dr. Wartman’s normal genes, FLT3, was malfunctioning,
producing massive quantities of a protein that was feeding the cancer.
The team found a drug typically used to control the
overactive FLT3 gene in patients with kidney cancer. Dr. Wartman became the
first person to take this drug for leukemia, and his cancer is now in
remission. Dr. Wartman’s case demonstrates how genomic sequencing enables
researchers to understand the role of genes in fueling a specific cancer.
Consequently, cancer treatment could be customized with drugs that tar-get a
gene rather than the tumor or tissue where the cancer first appears.
IBM Technical
Computing
ESTABLISHING HIGH-THROUGHPUT PERFORMANCE
Because each human genome comprises over three
billion base pairs, whole genomic sequencing requires tremendous process-ing
power and storage capacity in order to correlate the variants in the genome
with the relevant patient symptoms. Facing increased demand for sequencing, the
industry is challenged
to
drive down cost while speeding up the assembly, mapping and analysis involved
in the sequencing process.
To
address these issues, IBM and CLC bio have undertaken a joint effort to develop
the IBM Application Ready Solution for CLC bio, a next-generation sequencing
(NGS) platform. The system was built for practitioners, requiring little IT
administration, yet it is scalable, flexible and extendable. This end-to-end
solution integrates a computing cluster built on advanced IBM hardware and
software, CLC Genomics Server
software for high-throughput sequencing,
and CLC Genomics Workbench client/desktop software for analyzing and
visualiz-ing NGS data.
The
cluster compute nodes consist of IBM® Flex System™ x240 powered by Intel® Xeon®
E5-2680v2 processors. These nodes are connected to an IBM Storwize® V7000
Unified network attached storage system that consolidates block and file
workloads. The Storwize V7000 Unified system also has a single, easy-to-use
management interface that supports both block and file storage, helping to
simplify administration.
Storwize
V7000 Unified system supports file data storage using the IBM General Parallel
File System (GPFS™). With its leading file system performance and its ability
to scale based on customer needs, GPFS is used in the world’s largest
high-performance computing (HPC) installations in addition to mainstream
technical computing environments. Plus, CLC bio software uses a shared-disk
file management solution that provides fast, reliable access to NGS data for
optimizing performance.
Life Sciences
To
simplify the deployment and management of the cluster, IBM Platform™ HPC
provides a complete set of technical and high performance computing (HPC)
management capabilities in a single product. The rich set of out-of-the-box
features reduces the complexity and cost of managing and running an optimized
genomics sequencing cluster. Integrated workload management features have been
designed to help improve time-to-results and asset utilization.
PROVIDING A SCALABLE, TURNKEY SOLUTION
IBM Application Ready Solution for CLC bio has been
developed in partnership with CLC bio to deliver a scalable, high performance
genomics sequencing platform based on an IBM reference architecture. A turnkey
solution is available from IBM business partner Re-Store, LLC. It comes
pre-integrated with CLC Genomics Server and CLC Genomics Workbench and includes
global support and service. The solution is easy
to
deploy and use, simplifying IT administration and boosting productivity. It has
also been designed to scale as workloads expand over time. The solution
provides up to 90 TB of effective storage capacity, and administrators can
easily add storage extensions and more compute nodes as necessary.
These
three analytics solutions have been benchmarked for their mapping, variant
calling and filtering performance.
CLC
Genomics Workbench 6.5 and Platform HPC enabled Genomics Server 5.5 were
installed on an IBM server under Storwize V7000 Unified and GPFS. The benchmark
was executed using the 37x coverage human genome data set (1,415,483,596 reads,
100 bp/read) and 150x coverage Exome reads (NA12878) from Illumina Genome
Analyzer II. Benchmarking showed that the change to Analytics Solutions will
perform as follows (see Figures 1 on page 3).
2
Life Sciences
|
|||||||
IBM Technical
Computing
|
|||||||
Turnkey
solution options:
|
|||||||
Small Analytics
Solution
|
Medium Analytics
Solution
|
Large Analytics
Solution
|
|||||
Workload size per
week
|
15 human genome (37x) or
|
30 human genome (37x) or
|
60 human genome (37x) or
|
||||
120 human exome (150x)
|
240 human exome (150x)
|
480 human exome (150x)
|
|||||
Applications
|
CLC Genomics Server 5.5x,
|
CLC Genomics Server 5.5x,
|
CLC Genomics Server 5.5x,
|
||||
CLC Genomics Workbench:
|
CLC Genomics Workbench:
|
CLC Genomics Workbench:
|
|||||
9 static licenses
|
12 static licenses
|
15 static licenses
|
|||||
Application
maintenance
|
Three years of full maintenance
|
Three years of full maintenance
|
Three years of full maintenance
|
||||
(support and all upgrades)
|
(support and all upgrades)
|
(support and all upgrades)
|
|||||
on CLC bio software
|
on CLC bio software
|
on CLC bio software
|
|||||
Management software
|
IBM® Platform™ HPC
|
IBM Platform HPC
|
IBM Platform HPC
|
||||
System rack
|
One 25U rack
|
One 25U rack
|
One 42U rack
|
||||
System switch
|
Top-of-rack network switch
|
Top-of-rack network switch
|
Top-of-rack network switch
|
||||
System manage-
|
One IBM Flex System x240 with
|
One IBM Flex System x240 with
|
One IBM Flex System x240 with
|
||||
ment node
|
16 CPU cores and 64 GB RAM
|
16 CPU cores and 64 GB RAM
|
16 CPU cores and 64 GB RAM
|
||||
System compute nodes
|
Three IBM Flex System x240 with
|
Six IBM Flex System x240 with
|
Twelve IBM Flex System x240 with
|
||||
60 CPU cores and 384 GB RAM
|
120 CPU cores and 768 GB RAM
|
240 CPU cores and 1536 GB RAM
|
|||||
CPU/compute node
|
2 Intel Xeon 10C Processor Model
|
2 Intel Xeon 10C Processor Model
|
2 Intel Xeon 10C Processor Model
|
||||
E5-2680v2 115W
|
E5-2680v2 115W
|
E5-2680v2 115W
|
|||||
2.8GHz/1866MHz/25MB
|
2.8GHz/1866MHz/25MB
|
2.8GHz/1866MHz/25MB
|
|||||
Memory/compute node
|
128 GB DDR3
|
128 GB DDR3
|
128 GB DDR3
|
||||
System internal
storage
|
6 TB, 7,200 rpm NL SAS
|
6 TB, 7,200 rpm NL SAS
|
6 TB, 7,200 rpm NL SAS
|
||||
Storwize 7000
Unified
|
20 TB effective storage capacity
|
55 TB effective storage capacity
|
90 TB effective storage capacity
|
||||
System maintenance
|
3 Year Onsite Repair 24x7, 4 Hour
|
3 Year Onsite Repair 24x7, 4 Hour
|
3 Year Onsite Repair 24x7, 4 Hour
|
||||
Response
|
Response
|
Response
|
|||||
HH:MM:SS
|
|||||||
0:00:00
|
|||||||
21:36:00
|
|||||||
19:12:00
|
|||||||
16:48:00
|
|||||||
14:24:00
|
|||||||
12:00:00
|
|||||||
9:36:00
|
|||||||
7:12:00
|
|||||||
4:48:00
|
|||||||
2:24:00
|
|||||||
0:00:00
|
37x Coverage WGS
|
150x Coverage WEX
|
|||||
filtering
|
0:19:32
|
0:14:31
|
|||||
variant calling
|
16:33:33
|
1:27:21
|
|||||
mapping
|
5:56:04
|
0:42:05
|
Figure
1.
NGS Workflow benchmark performance of 37x coverage whole human genome reads and
150x coverage whole human exome reads on IBM single compute
node. The workflow includes read mapping, variant calling to filter variants
against known database (common SNAPs/INDELs database).
3
PROVIDING A FOUNDATION FOR FULL-GENOME ANALYSIS
In
the future, a person’s entire genome sequence will become part of his or her
electronic medical records. A full individual genome can be compared to a
reference human genome, which previously could take weeks or months to
assemble, map and analyze. But benchmarking shows that the exceptional
performance of IBM Application Ready Solution for CLC bio integrated with CLC
Genomics Server enables researchers to obtain this critical information in a
matter of days, even hours. The solution provides a scalable, flexible,
high-performance platform that helps accelerate genomic research and leads
to
a deep understanding of the associations between genetic variations and
diseases—and potential cures.