Recorded: 08 May 2012
Well, one of the immediate tasks now in cancer genomics is to gather enough data so that we can detect the complex molecular patterns that distinguish all the different sub-types and sub-sub types of cancer. So, it turns out that the current classification of cancer is based upon tissue of origin is wholly inadequate for the understanding of cancer and that the true understanding will come from the classifications of the molecular changes that are occurring through the mutational processes and possibly some epigenetic processes that are actually the true classification of cancer. They will get the true classification of cancer and that will classify it into not one, not hundreds, but maybe thousands of different conditions. Now, you will not understand that level of complexity from just a few genomes, in fact, no single medical center could ever hope to collect enough patient data in order to start tackling these major cancers. When soon as you start looking at a sub-sub type of cancer then you cannot just depend on special patients that just happen to walk through your, your door. We have to work together, the system is unfortunately set up right now in ways that favor data siloing and data hoarding. And that’s partially because of course we have to protect the data and there are legitimate rules on the book so that data privacy is ensured and the simplest way to ensure that is to never let it leave your medical campus. The other thing that’s happening is that these data are huge. Every cancer genome we get is three-hundred gigabytes, three-hundred billion bites.
So the Cancer Genome Atlas will study twenty adult cancers with five hundred cases from each cancer, that’s a total of ten thousand cancer cases. For each case, they’re going to get a genome of the cancer from the biopsy of the cancer, they’re going to get a germ-line genome from other tissue so that we can understand what the normal genome of the patient looks like and what the patient’s cancer genome look like and then we can understand what mutations specifically happened in the cancer cells and not in the normal cells. They’re also looking at RNA-seq, which is a way of looking at the genes that are expressed and you can, in cancer you can have utterly abnormal transcripts that aren’t – that are fusions of two different genes, so we want to be able to read all of the unusual genes that are being expressed, plus we are looking at methylation patterns and other parts of this. So, there is actually an enormous amount of data from each of these ten-thousand individuals potentially, potentially could be up to a terabyte, a trillion bytes of data for each individual, that would be if you sequenced the whole genome to sixty-depth coverage on the tumor, thirty-depth coverage on the normal, did a full RNA-seq, and so forth, you would get up to about a trillion bytes of data. This is the kind of big physics data, it’s the kind of large hadron collider and other types of projects are already at this data. Biology is not used to dealing with big data of this type.
Computational Biology is brand new, right out of the chute. And you already are being faced with these enormous problems, but luckily there are other fields that have approached these big data problems; so, there’s a surge – there’s a surge of interest now in genomics as a big data problem. One of the things we’ve done is created a database called Cancer Genomics Hub, we just announced this a few days ago, and it stores the large files that are created by genome sequencing for the National Cancer Institutes Genome Sequencing Projects. The flagship project is the Cancer Genome Atlas Project with ten thousand cases. There’s also a childhood cancer project called TARGET, sequencing thousands of cases from the five-major childhood cancers and then there’s the cancer, another cancer genome project called CGCI, which is sequencing different types of cancers, cancers from people who have HIV-AIDS and any other types of cancers – those together may reach in excess of ten thousand cases, which I argue is the absolute minimum.
So, in, in May 2012, just a few days ago, we announced the Cancer Genomics Hub. And I’m very passionate about this because it represents an opportunity to collect the data rather than having it siloed off into all of the different medical centers to collect a large amount of cancer genomics data in one database. And this will give us the statistical power that we need to find the complex subtle patterns in cancer. You absolutely will not separate out the signal from the noise, you won’t find those subtle repeating patterns that distinguished all of the different subtypes of cancer unless you can gather together thousands of different cancer genomes and look at them in an integrated analysis. So, only large database in engineering support of the type that we hope to get from Cancer Genomics Hub will make that possible.
David Haussler (born 1953) is an American bioinformatician known for his work leading the team that assembled the first human genome sequence in the race to complete the Human Genome Project and subsequently for comparative genome analysis that deepens understanding the molecular function and evolution of the genome. He is a Howard Hughes Medical Institute Investigator, professor of biomolecular engineering and director of the Center for Biomolecular Science and Engineering at the University of California, Santa Cruz, director of the California Institute for Quantitative Biosciences (QB3) on the UC Santa Cruz campus, and a consulting professor at Stanford University School of Medicine and UC San Francisco Biopharmaceutical Sciences Department.