Recorded: 08 May 2012
It became clear that the assembly problem itself was going to be a major problem.
I went to visit the National Center for Biotechnology Information where one of the groups that was working on the assembly was and looked at their progress and it was clear that they were having difficulty and I met Richa Agarwala who was the lead on that and she looked, she told me all about the issues that were coming up with this. I was also aware there was a team at EBI that was working on this and it was our problem in the sense that we cannot do the analysis of the gene sequences in the DNA sequences until there was some assembly. At this point, the pieces of DNA were far too small to hold an entire gene in one contiguous segment of sequence.
Ok, in the first few months of the year 2000, we worked very hard to prepare the software to analyze the genes, but at the same time, started thinking about the assembly problem. I brought in actually a post-doctoral student who I worked with before, who I knew to be very, very clever, Nick Littlestone and he started using a method of [unintelligible] linear programming, very complex program to try to solve some of the difficult optimization problems that were encountered in assembling fragments of DNA and at that time, the team consisted of Nick working on this and Jim Kent working on mapping all of the EST information and all of the information about the genetic maps and radiation hybrid maps and so forth that were being produced. And David Kulp working on the gene-findings software that he had already designed and was actually successfully applied with Celera to the fly genome. So this program Genie…but it’s clear that the data in order to come together required the assembly. And I remember the key meetings at Cold Spring Harbor, actually when we were talking about what, how we would denote the assembly we invented the term ‘the golden path’ to describe the single path through all fragments as we overlapped them and ordered and oriented them to make whole chromosomes.
And at that point, it was not clear that Santa Cruz would still have a role in the project because we were supposed to analyze the genes and the major issue had become the assembly at this point. So, I remember talking with Eric Lander and showing him some of the work that Nick Littlestone had done on the assembly and saying ‘Well, maybe there’s a different way of doing this’ but in-fact the assembly method Nick had been working on was far too inefficient to actually do the entire genome and the clock was ticking at that point. So, at the most critical time Jim Kent jumped in and threw his hat in the ring and I remember getting an email from him saying: ‘I think I have the requisite pieces from all of the different sources of information’ – there were actually thirteen sources of information that he had now computer analysis of, including the RNA sequences the EST sequences, all of the raw fragments and so forth and he said ‘I think I can use a greeting[?] method to piece these pieces together and no one is coming forth with an assembly that’s, that’s going to be competitive with what we think Celera is producing at this point and so we better go ahead.’ And I wrote him back a message that said essentially: ‘God’s speed – go for it.’ And of course he did and produced a really amazing, amazing piece of work. It’s true, I remember going to his house where he was working and he literally had to ice his wrists because he was coding so fast and so continuously day and night during that period of time. He literally produced tens of thousands of lines of code during a four-week period in May - April and May timeframe of the year 2000.
You know, you would think – it was a project that you could imagine assigning to a professional team of software engineers, maybe ten people. And I would have estimated it would have taken two years probably to get done but with only a month, the only possible way is to take one computer genius and have them do it all by themselves. The complexity of the task is such that the communication would slow down the project too much if you had more than one programmer on it. So everything - you had to have one person who could hold all of the information in their head at once and had this incredible talent to piece together things and really it was unbelievable, for the project it was unbelievably that Jim came along at the right time.
David Haussler (born 1953) is an American bioinformatician known for his work leading the team that assembled the first human genome sequence in the race to complete the Human Genome Project and subsequently for comparative genome analysis that deepens understanding the molecular function and evolution of the genome. He is a Howard Hughes Medical Institute Investigator, professor of biomolecular engineering and director of the Center for Biomolecular Science and Engineering at the University of California, Santa Cruz, director of the California Institute for Quantitative Biosciences (QB3) on the UC Santa Cruz campus, and a consulting professor at Stanford University School of Medicine and UC San Francisco Biopharmaceutical Sciences Department.