Recorded: 31 May 2003
It’s funny sometimes; people say I assembled the human genome. I was one of three people that assembled the human genome.
So the first person, who in some ways, I think, had the hardest job, was Cari Soderland who assembled the genomes from fingerprint bands in a way. Her program was, let’s see, what is it called? FPC. I think she wrote it primarily when she was at Sanger. And now she’s at the University of Arizona, I believe, she’s working on the final frontiers of genomes because the human genome was hard because it has a lot repeats, but it’s easy compared to the plant genome. That’s where she’s going; she’s real sequencer.
And so, she had written the software to take data from, I don’t know, if I had to explain all the technical details we would be here for about two hours, but we grow up things in bacterial artificial chromosomes which are called BAC’s and I hear today they have human artificial chromosomes which obviously would be HAC’s, but I don’t know. But I guess eventually there will be genome hacking,(HACing), but anyway, that’s not for today, thanks goodness. But anyway these BAC clones are about a hundred thousand [or] two hundred thousand bases each. And you can digest them with restriction enzymes and build a map of the relative order of those, but it’s kind of tricky. The data we have is fragmentary in more than one way. But it actually is quite helpful, and it was key to the project in two ways: it helped to figure out which of these, I guess, they take in the whole human genome and shredded it up into pieces of about of these BAC’s about two hundred those long and they had—it was completely random. One problem was to figure out which of these BAC’s to sequence. If you sequence them all, they’d be sequencing the genome ten times and that would be prohibitively expensive. So they had the problem of which ones to pick. And these things are growing up in real cells. I mean at this point it’s not digital perfect information and the cells will recombine and delete pieces of the BAC’s. And so one of the things that the fingerprint map which Cary did the software for and then Bob Waterston and his group, the path finding group at Washington University really applied, it served two purposes: both to let them know which BACs they could pick so that you would only sequence the genome, nobody would be perfect so we wouldn’t’ sequence it once, but hopefully we sequence it one and a quarter times in stead of ten times. So to pick things from that and then also the fingerprint map was useful in detecting BACs that had basically kind of been bent around by the bacteria they were growing in. And you could detect when a BAC had undergone rearrangement. And a substantial number did something like twenty percent of them. And you didn’t want to sequence those… in the fingerprint which would sequence that out.
So then I guess the overall stage of the project is that you would do, you’d take your BAC from the map, and then you’d sequence the BAC using shotgun methods just inside that one BAC, and then that gives you an assembly problem, too which is relatively formidable that Phil Green was the main person to solve. And that program is called PHRAP and it works very closely with a program called PHREB that he wrote. The two really worked very well together. People were all the time saying, oh, PHRAP isn’t so good because—and they just didn’t know how to run it because PHRAP not only gives you what the bases are themselves, but it gives you a relative probability of these bases being correct from the experimental data. And that was really key. I guess, yesterday that Phil was talking about what he was doing and he was saying it wasn’t rocket science, but a simple idea, but it was really key to getting it to work.
So Cary, Phil, me and then actually there is a fourth. There’s Richa Agrawala at NCBI. And she was, as I said a while back, there were actually initially two groups working on the assembly, one at Sanger and one at NCBI and then at Santa Cruz, then we ended up with two ourselves. I was sort of the second one. And Richa programmed; it took her longer to develop it. But eventually she took over the assembly and people were working off my high level of assembly which is again built on top of Cary’s and Phil’s. They were working with mine for probably about a year and a half. And then we switched over to Richa’s and part of it was this was a continuously evolving problem and this was one of the thing than made it challenging was that, well, first of all there was this sort stone soup data aspect where people would, oh, could you use this little extra bit of data and,of course we could, but then it was more programs, more code to write to actually knit in together. But then also as the project went along, and in a certain sense it got easier because the sequence qualities that you started with, the sequence you started with was a higher quality and longer contigs, but it also changed because it got longer, it got bigger and so you had to deal with bigger chunks all at once. And that was, I think it was about, I don’ know. There was at some point—most of the time I was sort of like able to assemble things faster than Richa, which was one of the reasons that they used ours. And I stuck with the map a little bit more. I was embracing new data and she was trying to do everything mostly from just the sequence itself and RNAs.
But the sequence itself, as I say, was getting better and better and so her approach was getting better and better. And there was at one point, I guess, it was really—I mean there was some competition in the group. I didn’t feel it so much, but I guess NCBI felt they were sort of a national center and they should be responsible for it. And David was very proud of our work. And I always kind of thought that I’m at an obscure state university. I have no business being in charge of the human genome. I don’t know, but I guess I was for about a year in terms of sort of a central point of collecting all the data. But eventually the NCBI program, I guess to the short of the story after about a year and a half it got to the point where it was doing as good as a job as ours. And they wanted it really badly, and I was sort of was always—I was like the little boy with his finger on the dike as far as I was concerned with this. It wasn’t what I planned to do. It needed doing and they got it so that they were happy, give it to him.
Jim Kent is a research scientist at the University of California, Santa Cruz's Center for Biomolecular Science and Engineering. After a stint working in the computer animation industry, he entered the Molecular, Cell, and Developmental Biology Ph.D. program at Santa Cruz. While completing his degree, he became increasingly interested in bioinformatics. Concurrently, the human genome was being sequenced, accumulating in the databases and was scheduled to be released in one month’s time—however, still no technology was in place to assemble its many sequences. In one month, Jim Kent created a computer program called the GigAssembler and computationally compiled for the first time, the entire human genome so that it could be released to the public at its intended deadline.
Jim Kent focuses on understanding the way in which genes are turned on and off to create varying outcomes.