1000 Genomes Project
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
AceDB is a genome database system developed since 1989 primarily by Jean Thierry-Mieg (CNRS, Montpellier) and Richard Durbin (Sanger Institute). It provides a custom database kernel, with a non-standard data model designed specifically for handling scientific data flexibly, and a graphical user interface with many specific displays and tools for genomic data. AceDB is used both for managing data within genome projects, and for making genomic data available to the wider scientific community.
AceDB was originally developed for the C.elegans genome project , from which its name was derived: AC. elegans DataBase. However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man. It is also increasingly used for databases with non-biological content.
ACM Digital Library
The ACM Digital Library (DL) is the most comprehensive collection of full-text articles and bibliographic records in existence today covering the fields of computing and information technology. The full-text database includes the complete collection of ACM's publications, including journals, conference proceedings, magazines, newsletters, and multimedia titles and currently consists of:
- 407,367 Full-text articles
- 2.0+ Million Pages of full-text articles
- 18,000+ New full-text articles added each year
- 44+ High Impact Journals with 2-3 new journals being launched each year
- 275+ Conference Proceedings Titles added each year
- 2,000+ Proceedings Volumes
- 8 Magazines (including the flagship Communications of the ACM, the most heavily cited publication in the field of computing according to Thomson-Reuters)
- 37 Technical Newsletters from ACM's Special Interest Groups (SIGs)
- 6,500+ Video files
- 594 Audio files
In addition to the full-text database, the ACM Digital Library is heavily integrated with and includes unrestricted access to the Guide to Computing Literature bibliography.
The ACM Digital Library includes reference linking though CrossRef, integration with the ACM Computing Reviews database, index terms using ACM's 2012 Computing Classification Scheme (CCS), alerting and TOC services, and all export formats including BibTex, Endnote, and ACM Ref, as well as OpenURL compliance, and COUNTER III and SUSHI Compliant usage statistics.
Allen Brain Atlas
The Allen Brain Atlas resources are a growing collection of online public resources integrating extensive gene expression and neuroanatomical data, complete with a novel suite of search and viewing tools. This portal gives you access to each of these resources by clicking on the button for a particular project or by clicking the project from the banner tab or drop-down menu.
ARTS: Accurate Recognition of Transcription Starts in Human
(now at cBio@mskcc)
Started in August 1991, arXiv.org (formerly xxx.lanl.gov) is a highly-automated electronic archive and distribution server for research articles. Covered areas include physics, mathematics, computer science, nonlinear sciences, quantitative biology and statistics. arXiv is maintained and operated by the Cornell University Library with guidance from the arXiv Scientific Advisory Board and the arXiv Member Advisory Board, and with the help of numerous subject moderators.
BAMS - an online resource for information about neural circuitry.
This rapidly expanding set of inference engines currently has 5 interrelated modules: Brain Parts (gray matter regions, major fiber tracts, and ventricles), Cell Types, Molecules, Connections (between regions and cell types), and Relations (between parts identified different neuroanatomical atlases).
Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory
The Berkeley Drosophila Genome Project (BDGP) is a consortium of the Drosophila Genome Center, funded by the National Human Genome Research Institute, National Cancer Institute, and Howard Hughes Medical Institute, through its support of work in the Gerald Rubin, Allan Spradling, Roger Hoskins, Hugo Bellen, Susan Celniker, and Gary Karpen laboratories.
The goals of the Drosophila Genome Center are to finish the sequence of the euchromatic genome of Drosophila melanogaster to high quality and to generate and maintain biological annotations of this sequence. In addition to genomic sequencing, the BDGP is 1) producing gene disruptions using P element-mediated mutagenesis on a scale unprecedented in metazoans; 2) characterizing the sequence and expression of cDNAs; and 3) developing informatics tools that support the experimental process, identify features of DNA sequence, and allow us to present up-to-date information about the annotated sequence to the research community.
BIOBASE Knowledge Library (BKL) Proteome
C. elegans Gene Expression Consortium
The objective of this project is to define the RNA expression profiles in specific tissues and cells, and developmental stages of C. elegans. Two complementary approaches are being applied: serial analysis of gene expression (SAGE), and the construction of promoter::GFP fusions for in vivo analysis of gene expression.
SAGE is a sensitive and specific method for obtaining qualitative and quantitative information on expressed RNAs. SAGE will also allow us to identify non-protein-coding genes, and provide insight into alternatively spliced mRNA isoforms and their relative abundance between tissues.
We are examining total mRNA populations in all developmental stages, both in whole worms and in specific cells and tissues. We have generated 17 SAGE libraries, which include all developmental stages, mutation-specific populations, and specific tissues and cells, totalling approximately 1.8 million observed tags. Tissue- and cell-specific libraries were generated from FACS-sorted cells marked by expression of specific promoter::GFP fusions. To date, we have SAGE libraries for purified embryonic muscle, gut, and a subset of neurons.
Monitoring in vivo expression of the fusion constucts in transgenic worms allows the determination of the developmental stage, tissue, and in some cases, the cells where a particular gene is expressed. Our goal is to build promoter::GFP fusion constructs for C. elegans genes that have human orthologues. Of the over 5000 genes that fall into this category, 2000 are being targeted by the C. elegans Gene Knockout Consortium. Fusion constructs are being created for the same set of genes, with a focus on genes expressed in muscle and nerve tissues. When coupled with SAGE and knockout data, this will provide valuable and more complete expression profiles for cells, tissues, and developmental stages.
Cancer Gene Census
All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage upon the cells in which they have occurred. There is a vast amount of information available in the published scientific literature about these changes. COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers.
Types of data
There are two types of data in COSMIC: Expert manual curation data and systematic screen data. It is useful to understand the differences of these data types and use them appropriately.
Expert curation data
- Manually input from peer reviewed publications by COSMIC expert curators
- Consists of comprehensive literature curation of selected Census genes at release, followed by subsequent updates (Cancer Gene Census)
- Includes additional data points relevant to each disease and publication
- Provides accurate frequency data as mutation negative samples are specified
- Also called non-systematic or targeted screen data
Genome-wide screen data
- Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC
- Provides unbiased molecular profiling of diseases while covering the whole genome
- Provides objective frequency data by interpreting non mutant genes across each genome
- Facilitates finding novel driver genes in cancer
Cancer Genome Atlas
There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA – its genome – and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.
The Cancer Genome Atlas (TCGA) began as a three-year pilot in 2006 with an investment of $50 million each from the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI). The TCGA pilot project confirmed that an atlas of changes could be created for specific cancer types. It also showed that a national network of research and technology teams working on distinct but related projects could pool the results of their efforts, create an economy of scale and develop an infrastructure for making the data publicly accessible. Importantly, it proved that making the data freely available would enable researchers anywhere around the world to make and validate important discoveries. The success of the pilot led the National Institutes of Health to commit major resources to TCGA to collect and characterize more than 20 additional tumor types.
Learn more about the important role of tissue samples to TCGA.
Each cancer will undergo comprehensive genomic characterization and analysis. The comprehensive data that have been generated by TCGA’s network approach are freely available and widely used by the cancer community through the TCGA Data Portal and the Cancer Genomics Hub (CGHub).
Learn more about the components of the TCGA Research Network by selecting a link below:
Biospecimen Core Resource (BCR) – Tissue samples are carefully cataloged, processed, checked for quality and stored, complete with important medical information about the patient.
Genome Characterization Centers (GCCs) – Several technologies will be used to analyze genomic changes involved in cancer. The genomic changes that are identified will be further studied by the Genome Sequencing Centers.
Genome Sequencing Centers (GSCs) – High-throughput Genome Sequencing Centers will identify the changes in DNA sequences that are associated with specific types of cancer.
Proteome Characterization Centers (PCCs) – The centers, a component of NCI’s Clinical Proteomic Tumor Analysis Consortium, will ascertain and analyze the total proteomic content of a subset of TCGA samples.
Data Coordinating Center (DCC) – The information that is generated by TCGA will be centrally managed at the DCC and entered into the TCGA Data Portal and Cancer Genomics Hub as it becomes available. Centralization of data facilitates data transfer between the network and the research community, and makes data analysis more efficient. The DCC manages the TCGA Data Portal.
Cancer Genomics Hub (CGHub) – Lower level sequence data will be deposited into a secure repository. This database stores cancer genome sequences and alignments.
Genome Data Analysis Centers (GDACs) – Immense amounts of data from array and second-generation sequencing technologies must be integrated across thousands of samples. These centers will provide novel informatics tools to the entire research community to facilitate broader use of TCGA data.
Cancer Genome Project
Cancer Literature in PubMed
The macaque macroconnectivity database CoCoMac has been initiated and built up by prof. Rolf Kötter, first at the C. & O. Vogt Institute of Brain Research at the Henrich Heine University in Düsseldorf, later at the Donders Institute for Brain, Cognition and Behaviour at the Radboud University Nijmegen, ending with a brief period at the Jülich Research Institute. While working on a major overhaul of the database and connectivity mapping engine, Rolf was struck by a tumor, which took his life after three years of battling with the disease. He passed away on June 9th 2010.
The ongoing work on the database has been adopted by the German INCF node (G-Node) and the Computational and Systems Neuroscience group of the Juelich Research Institute. The new database engine features an extensive search wizard and interactive browser. It also powers the Scalable Brain Atlas visual connectivity tool. A web-based data entry system is under development. Contact person for the current developments is dr. Rembrandt Bakker.
Cold Spring Harbor Mammalian Promoter Database
In the post-genome era, characterization of gene regulation networks has become an important part of genomic research. To succeed in such studies in any organism, a high-quality and comprehensive database of genes and their promoters, transcription factor binding sites, and other cis-regulatory elements is much desired if not a must.
Cold Spring Harbor Laboratory mammalian promoter database (CSHLmpd) used all known transcripts, integrating with predicted transcripts, to construct gene set of human, mouse and rat genomes. For promoter information, we collected known promoter information from multiple resources, together with predicted ones. These promoters were mapped to genome, and linked to related genes. We also compared promoters of orthologous gene groups to detect the sequence conservation in promoter regions.
We expect CSHLmpd to be helpful for research in gene regulation networks by providing guidance for experimental studies such as DNA microarray and chromatin IP. It will also facilitate the building of a foundation upon which we expand our insights into the structure of mouse genome through continued data collection, intelligent data analysis and integration.
Copy Number Variation Project
Genetic diseases are caused by mutations in DNA sequences. The Copy Number Variation (CNV) Project investigates the impact on human health of CNVs - gains and losses of large chunks of DNA sequence consisting of between ten thousand and five million letters. We already know that many inherited genetic diseases result from structural mutations or CNVs; we also know that there are Copy Number Variants that protect against HIV infection and malaria. The contribution of CNV to the common, complex diseases, such as diabetes and heart disease, is currently less well understood.
More pages on the CNV project:
Database of Genomic Variants
A curated catalogue of human genomic structural variation
The objective of the Database of Genomic Variants is to provide a comprehensive summary of structural variation in the human genome. We define structural variation as genomic alterations that involve segments of DNA that are larger than 50bp. The content of the database is only representing structural variation identified in healthy control samples.
The Database of Genomic Variants provides a useful catalog of control data for studies aiming to correlate genomic variation with phenotypic data. The database is continuously updated with new data from peer reviewed research studies. We always welcome suggestions and comments regarding the database from the research community.
For data sets where the variation calls are reported at a sample by sample level, we merge calls with similar boundaries across the sample set. Only variants of the same type (i.e. CNVs, inversions) are merged, and gains and losses are merged separately. In addition, if several different platforms/approaches are used within the same study, these datasets are merged separately. Sample level calls that overlap by >= 70% are merged in this process.
Database of Transcriptional Start Sites
o support transcriptional regulation studies, we have constructed the DBTSS (DataBase of Transcriptional Start Sites), which represents exact positions of transcriptional start sites (TSSs) in the genome based on our unique experimentally validated TSS sequencing method, TSS-seq.
This database includes TSS data of a major part of human adult and embryonic tissues are covered. DBTSS now contains 491 million TSS tag sequences for collected from a total of 20 tissues and 7 cell cultures. We also integrated our newly generated RNA-seq data of subcellular- fractionated RNAs and ChIP-seq data of histone modifications, RNA polymerase II and several transcriptional regulatory factors in cultured cell lines. We also included recently accumulating external epigenomic data, such as chromatin map of the ENCODE project.
In this update, we further associated those TSS information with public and original SNV data, in order to identify single nucleotide variations (SNVs) in the regulatory regions.
It is believed that single nucleotide variations (SNVs) in the transcriptional regulatory regions are responsible for many human diseases, including cancers. However, it remains difficult to identify functionally relevant SNVs from those having no explicit biological consequences. In this version of DBTSS, we attempt to associate SNVs with the omics information of the surrounding regions. We used SNVs which we identified from genomic analyses of various types of cancers, including somatic mutations of 100 lung adenocarcinoma and lung small cell carcinoma. For germline variations, we used SNVs in dbSNP as well as our unique dataset of variations in 1000 Japanese individuals. We integrated those SNV information with our original datasets of TSS-seq, RNA-seq, ChIP-seq of representative histone modifications and Bisulfite Sequencing of cytosine methylations of DNA. Particular, we present multi-omics data of 26 lung adenocarcinoma cells line for which TSS-seq, RNA-seq, ChIP-seq and BS-seq together with whole genome sequencing are collected from the same materials. We further connected the multi-omics data of model organisms by genome-genome alignment. We provide a unique data resource to investigate what genomic features are observed in a particular genomic coordinates in a wide variety of samples.
These data can be browsed in our new viewer which also supports versatile search conditions of users. We believe new DBTSS is helpful to understand biological consequences of the massively identified TSSs and identify human genetic valuations which are associated with disordered transcriptional regulations.
dbCASE (database of classified alternative splicing events) is derived by high-quality transcript/genome sequence alignment. "Transcripts" means RefSeq, cDNA (or mRNA) and EST sequences. The alignments for each gene were then converted into a data structure directed acyclic graph (DAG) or splicing graph. This data structure is a natural representation of multiple transcripts, which are not linear any more. Splicing graph allows to detect AS patterns and the supporting evidence efficiently. We have several important extensions in splicing graph to, e.g., estimate exon inclusion rate and transcript coverages.