James Kent on Involvement in Genomics: Developing Assembly Software
  James Kent     Biography    
Recorded: 31 May 2003

Yeah, so the lab I was in was working on alternative splicing, which is an interesting problem where, an interesting phenomena where you get multiple products out of a single gene. And the raw material for that really was—we had all this mRNA sequence that was in the database and we had all the genome sequence that was in the database and really to study the alternative splicing, you just needed to put them together and see where they lined up.

And there were some programs available that came out with the C. elegans genome that weren’t really designed for studying alternative splicing in mind. They would sort of put all of the transcripts right on top of each other and so. What you look for in splicing is that you will line up the RNA and you’ll have two exons here and here in one form and then you’ll have another one here, here and here in another form. But they’ll just put those two lines right on top of each other so you couldn’t see if there was alternative splicing. And so my first project was just to kind of unravel that so we could just see from the data if it was there. And that led to the first worm browser that I did.

And then it ended up that a lot of those—so and then I wasn’t quite satisfied with the way the mRNA:DNA alignments were done. So, then my next phase was to write my own program to do them just a little bit tighter, I thought. And it turned out that those problems are a similar problem to actually what was needed for the genome assembly because you’re doing to RNA to DNA it’s almost an exact match. And so in the problem of the human assembly, the sort of second step of the hierarchal assembly that was used in the draft was also needed to align things very quickly that were nearly exact matches.

And so David Haussler had gotten called in to do gene predictions on it because his group had been involved with doing the gene predictions on the fly. The drosophila ones, the sort of first tier gene predictor on Drosophila melanogaster was Genie, which was something that had come out of David’s lab. And so I guess Eric Lander called him up to do human genome predictions. And at that point David and I had probably been working together for about—we’d been working together, I guess, for about four or five months then. And it had always been really good because we sort of filled in each other’s gaps just wonderfully. What I was not so good at, he was great at. Where he was not so good at, I was great at. We liked each other, so it was great.

But, anyway, so he got called in and I guess realized that with this working draft that it was fragmented. That the average size piece of the genome was only going to be about maybe 10KB and the average human gene was closer to 50KB. So how are going to do gene predictions on this, you know we could do maybe gene fragment predictions, but we couldn’t do gene predictions. So it was very clear to him that some assembly was going to be necessary. And there were actually three groups then who seemed to be doing assembly—actually at the time, let me rephrase that, at the time there were two groups; there was one at NCBI, and then there was one at Sanger.

From my point of view, I guess I was thinking, oh, you know these are good people. So I’ll keep working on my mRNA alignment because that was already a hard problem, because we had gone to—I’d written these programs to do every worm RNA against the worm genome. And it took about twelve days to run. And at this point I had been in the biology lab long enough that twelve days didn’t seem like so long, you know. That’s okay. I mean for a computer scientist they get very impatient. But I guess, wet lab work, you know, twelve days is practically instant. That was alright. But then the problem was that the human genome was thirty times bigger and it had ten times more EST’s, and so that twelve days would have been twelve years if I had used the same technique. So I thought, well, I need to do something about this. So I was busy writing stuff to make that go fast enough. And was pretty busy with that, I guess, the first months of the year 2000.

And then David was worried about the assembly. And so he hired one of his old students to also be sort of a third group working on the assembly. And I was initially just doing kind of part of it. I was kind of like setting things up for him because I had already this piece, the piece that would do the alignments, but that’s just kind of like part of the puzzle. If there were no repeats in the genome. You know, if it was sort of more like prose and less like, I don’t know, a song. I give the example when I’m explaining assembly to people like its, well, sort of like you take a book and you put it through a shredder and then you try to paste it back together. And that’s actually not such a big problem. But if you took like a song like “Mary Had a Little Lamb” and put it through the shredder and tried to put it together, it would be much harder because you had this “had a little lamb”, “ and this had a little lamb” and “this had a little lamb” and where does the “had a little lamb go?” And that is what the human genome was like. So I hadn’t dealt with that at all. I just had the stuff that would just say “this had a little lamb” goes with “this had a little lamb” and not which verse. But that was needed. It was the first step in assembly and so I had sort of built the directories as input for the other projects.

And I guess then I went off and did my oral exams in May. And then at that time it was getting real close and nobody’s was ready. And I guess from my point of view they were all trying to do something really fancy. And that’s good to do really fancy stuff when you have time; but we wanted to do gene predictions and at that point it was looking like even a very simple not total state of the art assembly, but just one that got it 85 percent right would be a big improvement over having little 10KB chunks. And so I decided that I would just write something quick and simple that would do it. And it did do it. It was pretty quick to come up. It was harder than I thought at first. It took, I guess, it took about a week before the very first—from when I started to the very, very first thing that worked at all did. And for me that’s actually a long time. Usually when I write a piece of program, most programs will have the very first skeletal thing that will take about a day or two because I like to sort of build it so that you’ve got the skeleton first, and then you kind of layer stuff on top of it. But it’s much easier to test the program if you always have a little something working. And then sort of add a little bit more to do and a little bit more to it. And so I always try and get the first thing that’s kind of working and close very quickly. So it took longer than I thought.

And then we just kept adding stuff to it. I mean I was working on it for a year and I guess it started off using two inputs; it used the genome sequence itself and it used RNA because that was one of our hopes was that we already had a bit of RNA in the database. And one of the reasons why we’re sequencing the genome rather than just the EST and the RNA is that because the regulatory information in the genome; a gene has two parts really; it has a part that kind of says what the gene makes and it has another part, the promoter or the regulatory elements that says where it should make it. In your body that’s why your blood is red and the lens of your eye is clear. And you wouldn’t want to mix those up. And it’s fascinating how exactly this goes; how exactly you get another, I guess now we’re thinking [that] there’s about twenty five thousand genes and you’ve got maybe two hundred and fifty types of cells all intricately arranged and the switching mechanism has a lot to do with that.

Anyway, just by tying the bits of the genome that, sort of tying little fragments of genome to the exons that we already had the RNA libraries was already a big advance because then you could sort of – basically you could see the regulatory regions associated with the genes. And that was something that I was very fascinated with. At first, you know, we just had—I’m kind of going off on tangents all over the place, but that’s why we were interested in it so many of us; why we wanted to sequence the genome, which is so vast as opposed to just the coding regions which are relatively small. Maybe one or two percent of it.

So that was good. We had two inputs at first, the RNA and the genes. And then it was just sort of like stone soup; people kept on saying, and we have this little bit of data that would be useful. Then I said, oh, well, I’ll see if I can add that to the program and I did. By the end of the day we had—what a lot of times with today’s sequencing technology they can take two reads off of a plasmid and basically you get two little chunks of sequence. That’s about five hundred bases that is then separated by, you don’t exactly how much, but you know its roughly a thousand bases that they are separated by or maybe two thousand bases. And when we were trying to sort of put these fragments in order having a paired read was invaluable and indeed the paired reads was the basis of Celera’s assembly.

And anyway, there was some of that information here and there. And there was a project for the single nucleotide polymorphisms where they had done their reads. The main purpose of the project was to figure out places where humans are different from each other, but that involved sequencing little random bits and they had done this in pairs. And so we had a bunch of pairing information and that was probably—that wasn’t in the first public assembly, but I think it was in the second one and that made a big difference. At that point it was probably about maybe eighty percent ordered because it wasn’t everything, but it was a big step forward.

At that point then there was useful enough that you could actually start—if you had the very first assemblies was basically what was added was that it added regulatory information to known genes. And then you can sort of go fishing for exons. But then with the second one a lot of times then you could actually start predicting some genes. Genes that you didn’t know because it was relatively well ordered. And then we ended up having very close collaboration with the University of Washington who was doing the map.

Jim Kent is a research scientist at the University of California, Santa Cruz's Center for Biomolecular Science and Engineering. After a stint working in the computer animation industry, he entered the Molecular, Cell, and Developmental Biology Ph.D. program at Santa Cruz. While completing his degree, he became increasingly interested in bioinformatics. Concurrently, the human genome was being sequenced, accumulating in the databases and was scheduled to be released in one month’s time—however, still no technology was in place to assemble its many sequences. In one month, Jim Kent created a computer program called the GigAssembler and computationally compiled for the first time, the entire human genome so that it could be released to the public at its intended deadline.

Jim Kent focuses on understanding the way in which genes are turned on and off to create varying outcomes.