cons5way Conservation Multiz Alignment & Conservation (5 Species) Comparative Genomics Description This track shows a measure of evolutionary conservation in human, chimp, mouse, rat, and chicken based on a phylogenetic hidden Markov model (phylo-HMM). The following multiz alignments were used to generate the annotation: human July 2003 (NCBI34/hg16) (hg16) chimpanzee Nov. 2003 (panTro1) mouse Feb. 2003 (mm3) rat Jun. 2003 (rn3) chicken Feb. 2004 (galGal2) In "full" visibility mode, this track displays pairwise alignments of chimp, mouse, rat, and chicken, each aligned to the human genome. The pairwise alignments are displayed in the standard UCSC browser "dense" mode using a greyscale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display; however, this does not remove them from the conservation score display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods Best-in-genome blastz pairwise alignments of human-mouse and human-rat were multiply aligned using a program called humor (HUman-MOuse-Rat), which is a special variant of the Multiz program. Multiz was used first to align the humor results with reciprocal best human-chimp alignments, and then to align the human-chimp-mouse-rat multiple alignment with best-in-genome blastz human-chicken alignments. The resulting human-chimp-mouse-rat-chicken multiple alignments were then assigned conservation scores by phylo-HMM. A phylo-HMM is a probabilistic model that describes both the process of DNA substitution at each site in a genome, and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2003, Siepel and Haussler 2004). A phylo-HMM can be thought of as a machine that generates a multiple alignment, in the same way that an ordinary hidden Markov model (HMM) generates an individual sequence. While the states of an ordinary HMM are associated with simple multinomial probability distributions, the states of a phylo-HMM are associated with more complex distributions defined by probabilistic phylogenetic models. These distributions can capture differences in the rates and patterns of nucleotide substitution observed in different types of genomic regions (e.g., coding or noncoding regions, conserved or nonconserved regions). To compute a conservation score, we use a k-state phylo-HMM, whose k associated phylogenetic models differ only in overall evolutionary rate (Felsenstein and Churchill 1996, Yang 1995). In the image at right, there are three k states, S1, S2, and S3, but in practice we use k = 10. A phylogenetic model is estimated globally, using the discrete gamma model for rate variation (Yang 1994), then a scaled version of the estimated model is associated with each state in a phylo-HMM. There is a separate "rate constant", ri, for each state i, which is multiplied by all branch lengths in the globally estimated model. The transition probabilities between states allow for autocorrelation of substitution rates, i.e., for adjacent sites to tend to exhibit similar overall substitution rates. A single parameter, lambda, describes the degree of autocorrelation and defines all transition probabilities. Here, we have estimated the rate constants from the data, similarly to Yang (1995) (Siepel and Haussler 2003), but have allowed lambda to be treated as a tuning parameter. For the conservation score, we use the posterior probability that each site was "generated" by the state having the smallest rate constant. Because of the way the rate categories are defined, the plotted values can be thought of as approximately representing the posterior probability that each site is among the 10% most conserved sites in the data set (allowing for autocorrelation of substitution rates). In this case, the general reversible (REV) substitution model was used in parameter estimation, and lambda was set to 0.9. Alignment gaps were treated as missing data, which sometimes has the effect of producing undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps. Credits This track was created at UCSC using the following programs: Blastz and multiz from Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. AxtBest, axtChain, chainNet, netSyntenic, and netClass developed by Jim Kent at UCSC. Tree estimation and phylo-HMM software by Adam Siepel at Cornell University. "Wiggle track" plotting software by Hiram Clawson at UCSC. The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Phylo-HMMs and phastCons Felsenstein, J. and Churchill, G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13, 93-104 (1996). Siepel, A. and Haussler, D. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York (2005). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics, 139, 993-1005 (1995). Chain/Net: Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Multiz: Blanchette, M., Kent, W.J., Riemer, C., Elnitski, .L, Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4), 708-15 (2004). Blastz: Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). Phylogenetic Tree: Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). cons5wayViewalign Multiz Alignments Multiz Alignment & Conservation (5 Species) Comparative Genomics mzPt1Mm3Rn3Gg2_pHMM Multiz Align Multiz Alignments of 5 Species Comparative Genomics Description This track shows a measure of evolutionary conservation in human, chimp, mouse, rat, and chicken based on a phylogenetic hidden Markov model (phylo-HMM). The following multiz alignments were used to generate the annotation: human July 2003 (NCBI34/hg16) (hg16) chimpanzee Nov. 2003 (panTro1) mouse Feb. 2003 (mm3) rat Jun. 2003 (rn3) chicken Feb. 2004 (galGal2) In "full" visibility mode, this track displays pairwise alignments of chimp, mouse, rat, and chicken, each aligned to the human genome. The pairwise alignments are displayed in the standard UCSC browser "dense" mode using a greyscale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display; however, this does not remove them from the conservation score display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the Gaps line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods Best-in-genome blastz pairwise alignments of human-mouse and human-rat were multiply aligned using a program called humor (HUman-MOuse-Rat), which is a special variant of the Multiz program. Multiz was used first to align the humor results with reciprocal best human-chimp alignments, and then to align the human-chimp-mouse-rat multiple alignment with best-in-genome blastz human-chicken alignments. The resulting human-chimp-mouse-rat-chicken multiple alignments were then assigned conservation scores by phylo-HMM. A phylo-HMM is a probabilistic model that describes both the process of DNA substitution at each site in a genome, and the way this process changes from one site to the next (Felsenstein and Churchill 1996, Yang 1995, Siepel and Haussler 2003, Siepel and Haussler 2004). A phylo-HMM can be thought of as a machine that generates a multiple alignment, in the same way that an ordinary hidden Markov model (HMM) generates an individual sequence. While the states of an ordinary HMM are associated with simple multinomial probability distributions, the states of a phylo-HMM are associated with more complex distributions defined by probabilistic phylogenetic models. These distributions can capture differences in the rates and patterns of nucleotide substitution observed in different types of genomic regions (e.g., coding or noncoding regions, conserved or nonconserved regions). To compute a conservation score, we use a k-state phylo-HMM, whose k associated phylogenetic models differ only in overall evolutionary rate (Felsenstein and Churchill 1996, Yang 1995). In the image at right, there are three k states, S1, S2, and S3, but in practice we use k = 10. A phylogenetic model is estimated globally, using the discrete gamma model for rate variation (Yang 1994), then a scaled version of the estimated model is associated with each state in a phylo-HMM. There is a separate "rate constant", ri, for each state i, which is multiplied by all branch lengths in the globally estimated model. The transition probabilities between states allow for autocorrelation of substitution rates, i.e., for adjacent sites to tend to exhibit similar overall substitution rates. A single parameter, lambda, describes the degree of autocorrelation and defines all transition probabilities. Here, we have estimated the rate constants from the data, similarly to Yang (1995) (Siepel and Haussler 2003), but have allowed lambda to be treated as a tuning parameter. For the conservation score, we use the posterior probability that each site was "generated" by the state having the smallest rate constant. Because of the way the rate categories are defined, the plotted values can be thought of as approximately representing the posterior probability that each site is among the 10% most conserved sites in the data set (allowing for autocorrelation of substitution rates). In this case, the general reversible (REV) substitution model was used in parameter estimation, and lambda was set to 0.9. Alignment gaps were treated as missing data, which sometimes has the effect of producing undesirably high posterior probabilities in gappy regions of the alignment. We are looking at several possible ways of improving the handling of alignment gaps. Credits This track was created at UCSC using the following programs: Blastz and multiz from Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. AxtBest, axtChain, chainNet, netSyntenic, and netClass developed by Jim Kent at UCSC. Tree estimation and phylo-HMM software by Adam Siepel at Cornell University. "Wiggle track" plotting software by Hiram Clawson at UCSC. The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Phylo-HMMs and phastCons Felsenstein, J. and Churchill, G.A. A hidden Markov model approach to variation among sites in rate of evolution. Mol Biol Evol 13, 93-104 (1996). Siepel, A. and Haussler, D. Phylogenetic hidden Markov models. In R. Nielsen, ed., Statistical Methods in Molecular Evolution, pp. 325-351, Springer, New York (2005). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). Yang, Z. A space-time process model for the evolution of DNA sequences. Genetics, 139, 993-1005 (1995). Chain/Net: Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Multiz: Blanchette, M., Kent, W.J., Riemer, C., Elnitski, .L, Smit, A.F.A., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., Haussler, D., Miller, W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 14(4), 708-15 (2004). Blastz: Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). Phylogenetic Tree: Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). cons5wayViewphastcons Element Conservation (phastCons) Multiz Alignment & Conservation (5 Species) Comparative Genomics mzPt1Mm3Rn3Gg2_pHMM_wig 5 Species Cons 5 Species Conservation by PhastCons Comparative Genomics gap Gap Gap Locations Mapping and Sequencing Description This track depicts gaps in the assembly. Most of these gaps - with the exception of intractable heterochromatic, centromeric, telomeric, and short-arm gaps - have been closed during the finishing process, although a small number still remain. Gaps are represented as black boxes in this track. If the relative order and orientation of the contigs on either side of the gap is known, it is a bridged gap. In this case, a white line is drawn through the black box representing the gap and the gap is labeled "yes". This assembly contains the following types of gaps: Fragment - gaps between the contigs of a draft clone (size varies). (In this context, a contig is a set of overlapping sequence reads.) Clone - gaps between clones in the same map contig (size varies). Contig - gaps between map contigs (size varies). Heterochromatin - gaps from large blocks of heterochromatin (size varies). Centromere - gaps from centromeres (3,000,000 Ns). Short_arm - large gaps in the short (p) arm (size varies). Telomere - gaps from telomeres (size varies). mrna Human mRNAs Human mRNAs from GenBank mRNA and EST Description The mRNA track shows alignments between human mRNAs in GenBank and the genome. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods GenBank human mRNAs were aligned against the genome using the blat program. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 hg17Kg Known Genes II UCSC Known Genes II (June 05, Based on hg17 UCSC Known Genes) Genes and Gene Predictions Description The Known Genes II track was built based on UCSC Known Genes data set of hg17 (Human May 2004 Assembly). Clicking the "Outside Link" entry above will bring you to the gene details page of hg17 (Human May 2004 Assembly). The original "Known Genes" track of hg16 (built in March, 2004) is somewhat outdated, but still available. Methods The hg17 UCSC Known Genes was built by a new process, KG II, as described below. UniProt protein sequences (including alternative splicing isoforms) and mRNA sequences from RefSeq and GenBank were aligned against the base genome using BLAT. RefSeq alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. GenBank mRNA alignments having a base identity level within 0.2% of the best and at least 97% base identity with the genomic sequence were kept. Protein alignments having a base identity level within 0.2% of the best and at least 80% base identity with the genomic sequence were kept. Then the genomic mRNA and protein alignments were compared, and protein-mRNA pairings were determined from their overlaps. mRNA CDS data were obtained from RefSeq and GenBank data and supplemented by CDS structures derived from UCSC protein-mRNA BLAT alignments. The initial set of UCSC Known Genes candidates consists of all protein-mRNA pairs with valid mRNA CDS structures. A gene-check program (similar to the one used for the Consensus CDS (CCDS) project) is used to remove questionable candidates, such as those with in-frame stop codons, missing start or stop codons, etc. From each group of gene candidates that share the same CDS structure, the protein-mRNA pair having the best ranking and protein-mRNA alignment score is selected as a UCSC Known Gene. The ranking of a gene candidate depends on its gene-check quality measures. When all else is equal, a preference is given to RefSeq mRNAs and next to MGC mRNAs. Similarly, preference is given to gene candidates represented by Swiss-Prot proteins. The protein-mRNA alignment score is calculated based on a protein-to-mRNA alignment using TBLASTN, plus weighted sub-scores according to the date and length of the mRNA. Credits The UCSC Known Genes track was produced using protein data from UniProt and mRNA data from NCBI RefSeq and GenBank. Data Use Restrictions The UniProt data have the following terms of use, UniProt copyright(c) 2002 - 2004 UniProt consortium: For non-commercial use, all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. For commercial use, all databases and documents in the UniProt FTP directory except the files ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found at the UniProt License & disclaimer page. From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. Kent WJ. BLAT - The BLAST-Like Alignment Tool. Genome Res. 2002 Apr;12(4):656-64. rmsk RepeatMasker Repeating Elements by RepeatMasker Variation and Repeats Description This track was created by using Arian Smit's RepeatMasker program, which screens DNA sequences for interspersed repeats and low complexity DNA sequences. The program outputs a detailed annotation of the repeats that are present in the query sequence (represented by this track), as well as a modified version of the query sequence in which all the annotated repeats have been masked (generally available on the Downloads page). RepeatMasker uses the Repbase Update library of repeats from the Genetic Information Research Institute (GIRI). Repbase Update is described in Jurka (2000) in the References section below. Some newer assemblies have been made with Dfam, not Repbase. You can find the details for how we make our database data here in our "makeDb/doc/" directory. Display Conventions and Configuration In full display mode, this track displays up to ten different classes of repeats: Short interspersed nuclear elements (SINE), which include ALUs Long interspersed nuclear elements (LINE) Long terminal repeat elements (LTR), which include retroposons DNA repeat elements (DNA) Simple repeats (micro-satellites) Low complexity repeats Satellite repeats RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA) Other repeats, which includes class RC (Rolling Circle) Unknown The level of color shading in the graphical display reflects the amount of base mismatch, base deletion, and base insertion associated with a repeat element. The higher the combined number of these, the lighter the shading. A "?" at the end of the "Family" or "Class" (for example, DNA?) signifies that the curator was unsure of the classification. At some point in the future, either the "?" will be removed or the classification will be changed. Methods Data are generated using the RepeatMasker -s flag. Additional flags may be used for certain organisms. Repeats are soft-masked. Alignments may extend through repeats, but are not permitted to initiate in them. See the FAQ for more information. Credits Thanks to Arian Smit, Robert Hubley and GIRI for providing the tools and repeat libraries used to generate this track. References Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. http://www.repeatmasker.org. 1996-2010. Repbase Update is described in: Jurka J. Repbase Update: a database and an electronic journal of repetitive elements. Trends Genet. 2000 Sep;16(9):418-420. PMID: 10973072 For a discussion of repeats in mammalian genomes, see: Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999 Dec;9(6):657-63. PMID: 10607616 Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996 Dec;6(6):743-8. PMID: 8994846 acembly Acembly Genes AceView Gene Models With Alt-Splicing Genes and Gene Predictions Description This track shows AceView gene models constructed from mRNA, EST and genomic evidence by Danielle and Jean Thierry-Mieg and Vahan Simonyan using the Acembly program. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. Gene models that fall into the "main" prediction class are displayed in purple; "putative" genes are displayed in pink. The track description page offers the following filter and configuration options: Gene Class filter: Select the main or putative option to filter the display by prediction class. Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Click the "Codon coloring help" link on the track description page for more information about this feature. Methods AceView attempts to find the best alignment of each mRNA/EST against the genome, and clusters the alignments into the least possible number of alternatively spliced transcripts. The reconstructed transcripts are then clustered into genes by simple transitive contact. To see the evidence that supports each transcript, click the "Outside Link" on an individual transcript's details page to access the NCBI AceView web site. Each AceView transcript model has a gene cluster designation (alternate name) that is categorized into a prediction class of either main or putative. Prediction Class: main Class of genes that includes the protein coding genes (defined here by CDS > 100 amino acids) and all genes with at least one well-defined standard intron, i.e., an intron with a GT-AG or GC-AG boundary, supported by at least one clone matching exactly, with no ambiguous bases, and the 8 bases on either side of the intron identical to the genome. Genes with a CDS smaller than 100 amino acids are included in this class if they meet one of the following conditions: they have a NCBI RefSeq sequence (NM_#) or an OMIM identifier, or they encode a protein with BlastP homology (< 1e-3) to a cDNA-supported nematode AceView protein. Prediction Class: putative Class of genes that have no standard intron and do not encode CDS of more than 100 amino acids, yet may be sufficiently useful to justify not disregarding them completely. Putative genes may be of two types: either those supported by more than six cDNA clones or those that encode a putative protein with an interesting annotation. Examples include a PFAM motif, a BlastP hit to a species other than itself (< 1e-3), a transmembrane domain or other rare and meaningful domains identified by Psort2, or a highly probable localization in a cell compartment (excluding cytoplasm and nucleus). Credits Thanks to Danielle and Jean Thierry-Mieg at NIH for providing this track. References Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006;7 Suppl 1:S12.1-14. blatFr1 Fugu Blat Takifugu rubripes (Aug. 2002/fr1) Translated Blat Alignments Comparative Genomics Description This track shows blat translated protein alignments of the Fugu (Aug. 2002 (JGI 3.0/fr1)) genome assembly to the human genome. The v3.0 Fugu whole genome shotgun assembly was provided by the US DOE Joint Genome Institute (JGI). The strand information (+/-) for this track is in two parts. The first + or - indicates the orientation of the query sequence whose translated protein produced the match. The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --; nor is +- the same as -+. Methods The alignments were made with blat in translated protein mode requiring two nearby 4-mer matches to trigger a detailed alignment. The human genome was masked with RepeatMasker and Tandem Repeat Finder before running blat. Credits The 3.0 draft from JGI was used in the UCSC Fugu blat alignments. These data were provided freely by the JGI for use in this publication only. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 refSeqComposite NCBI RefSeq RefSeq genes from NCBI Genes and Gene Predictions Description The NCBI RefSeq Genes composite track shows human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use coordinates provided by RefSeq, except for the UCSC RefSeq track, which UCSC produces by realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences between the annotation coordinates provided by UCSC and NCBI. For RNA-seq analysis, we advise using NCBI aligned tables like RefSeq All or RefSeq Curated. See the Methods section for more details about how the different tracks were created. Please visit NCBI's Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track is a composite track that contains differing data sets. To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to hide. Note: Not all subtracts are available on all assemblies. The possible subtracks include: RefSeq aligned annotations and UCSC alignment of RefSeq annotations RefSeq All – all curated and predicted annotations provided by RefSeq. RefSeq Curated – subset of RefSeq All that includes only those annotations whose accessions begin with NM, NR, NP or YP. (NP and YP are used only for protein-coding genes on the mitochondrion; YP is used for human only.) RefSeq Predicted – subset of RefSeq All that includes those annotations whose accessions begin with XM or XR. RefSeq Other – all other annotations produced by the RefSeq group that do not fit the requirements for inclusion in the RefSeq Curated or the RefSeq Predicted tracks, as they do not have a product and therefore no RefSeq accession. More than 90% are pseudogenes, T-cell receptor or immunoglobulin segments. The few remaining entries are gene clusters (e.g. protocadherin). RefSeq Alignments – alignments of RefSeq RNAs to the human genome provided by the RefSeq group, following the display conventions for PSL tracks. RefSeq Diffs – alignment differences between the human reference genome(s) and RefSeq transcripts. (Track not currently available for every assembly.) UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track. RefSeq Select+MANE (subset) – Subset of RefSeq Curated, transcripts marked as RefSeq Select or MANE Select. A single Select transcript is chosen as representative for each protein-coding gene. This track includes transcripts categorized as MANE, which are further agreed upon as representative by both NCBI RefSeq and Ensembl/GENCODE, and have a 100% identical match to a transcript in the Ensembl annotation. See NCBI RefSeq Select. Note that we provide a separate track, MANE (hg38), which contains only the MANE transcripts. RefSeq HGMD (subset) – Subset of RefSeq Curated, transcripts annotated by the Human Gene Mutation Database. This track is only available on the human genomes hg19 and hg38. It is the most restricted RefSeq subset, targeting clinical diagnostics. The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq HGMD, RefSeq Select/MANE and UCSC RefSeq tracks follow the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), or reviewed (dark), as defined by RefSeq. Color Level of review Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff. Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. The item labels and codon display properties for features within this track can be configured through the check-box controls at the top of the track description page. To adjust the settings for an individual subtrack, click the wrench icon next to the track name in the subtrack list . Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name or OMIM identifier instead of the gene name, show all or a subset of these labels including the gene name, OMIM identifier and accession names, or turn off the label completely. Codon coloring: This track has an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. The RefSeq Diffs track contains five different types of inconsistency between the reference genome sequence and the RefSeq transcript sequences. The five types of differences are as follows: mismatch – aligned but mismatching bases, plus HGVS g. to show the genomic change required to match the transcript and HGVS c./n. to show the transcript change required to match the genome. short gap – genomic gaps that are too small to be introns (arbitrary cutoff of < 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n. showing differences. shift gap – shortGap items whose placement could be shifted left and/or right on the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region in transcript. Here, thin and thick lines are used -- the thin line shows the span of the repetitive sequence, and the thick line shows the rightmost shifted gap. double gap – genomic gaps that are long enough to be introns but that skip over transcript sequence (invisible in default setting), with HGVS c./n. deletion. skipped – sequence at the beginning or end of a transcript that is not aligned to the genome (invisible in default setting), with HGVS c./n. deletion HGVS Terminology (Human Genome Variation Society): g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence. When reporting HGVS with RefSeq sequences, to make sure that results from research articles can be mapped to the genome unambiguously, please specify the RefSeq annotation release displayed on the transcript's Genome Browser details page and also the RefSeq transcript ID with version (e.g. NM_012309.4 not NM_012309). Methods Tracks contained in the RefSeq annotation and RefSeq RNA alignment tracks were created at UCSC using data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and converted to the genePred and PSL table formats for display in the Genome Browser. Information about the NCBI annotation pipeline can be found here. The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments. The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks. RefSeq RNAs were aligned against the human genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Data Access The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the REST API, Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing. The previous track versions are available in the archives of our downloads server. You can also access any RefSeq table entries in JSON format through our JSON API. The data in the RefSeq Other and RefSeq Diffs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows: genePred format: RefSeq All - ncbiRefSeq RefSeq Curated - ncbiRefSeqCurated RefSeq Predicted - ncbiRefSeqPredicted RefSeq HGMD - ncbiRefSeqHgmd RefSeq Select+MANE - ncbiRefSeqSelect UCSC RefSeq - refGene PSL format: RefSeq Alignments - ncbiRefSeqPsl The first column of each of these tables is "bin". This column is designed to speed up access for display in the Genome Browser, but can be safely ignored in downstream analysis. You can read more about the bin indexing system here. The annotations in the RefSeqOther and RefSeqDiffs tracks are stored in bigBed files, which can be obtained from our downloads server here, ncbiRefSeqOther.bb and ncbiRefSeqDiffs.bb. Individual regions or the whole set of genome-wide annotations can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system from the utilities directory linked below. For example, to extract only annotations in a given region, you could use the following command: bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg16/ncbiRefSeq/ncbiRefSeqOther.bb -chrom=chr16 -start=34990190 -end=36727467 stdout You can download a GTF format version of the RefSeq All table from the GTF downloads directory. The genePred format tracks can also be converted to GTF format using the genePredToGtf utility, available from the utilities directory on the UCSC downloads server. The utility can be run from the command line like so: genePredToGtf hg16 ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access section. A file containing the RNA sequences in FASTA format for all items in the RefSeq All, RefSeq Curated, and RefSeq Predicted tracks can be found on our downloads server here. Please refer to our mailing list archives for questions. Previous versions of the ncbiRefSeq set of tracks can be found on our archive download server. Credits This track was produced at UCSC from data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 refGene UCSC RefSeq UCSC annotations of RefSeq RNAs (NM_* and NR_*) Genes and Gene Predictions Description The RefSeq Genes track shows known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Please visit the Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions, submit additions and corrections, or ask for help concerning RefSeq records. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods RefSeq RNAs were aligned against the human genome using BLAT. Those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 snp SNPs Simple Nucleotide Polymorphisms (SNPs) Variation and Repeats Description This track consolidates all the Simple Nucleotide Polymorphisms (SNPs) into a single track. This represents data from dbSnp and commercially-available genotyping arrays. Please be aware that some mapping inconsistencies are known to exist in the dbSnp data set. If you encounter information that seems incorrect on the details page for a variant, we advise you to verify the record information on the dbSnp website using the provided link. In some known instances, the size of the variant does not match the size of its genomic location; UCSC is working with dbSnp to correct these errors in the data set. Interpreting and Configuring the Graphical Display Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases. When the start coordinate for a SNP is shown as chromStart = chromEnd+1 on the SNP's details page, this is generally not an error; rather, it indicates that the variant is an insertion at this genomic position. In these instances, the location type will be set to "between". Note that insertions are represented as chromStart = chromEnd in the snp table accessible from the Table Browser or downloads server, due to the half-open zero-based representation of data in the underlying database. The colors of variants in the display may be changed to highlight their source, molecule type, variant class, validation status, or functional classification. Variants can be excluded from the display based on these same criteria or if they fall below the user-specified minimum average heterozygosity. The track configuration options are located at the top of the SNPs track description page. By default variants are colored by functional classification, with SNPs likely to cause a phenotype in red (non-synonymous and splice site mutations). The following configuration categories reflect the following definitions defined in the document type definition (DTD) that describes the dbSnp XML format. Source: Origin of this data dbSnp - From the current build of dbSnp Affymetrix Genotyping Array 10K - SNPs on the commercial array Affymetrix Genotyping Array 10K v2 - SNPs on the commercial array Affymetrix Genotyping Array 50K HindIII - SNPs on the commercial array Affymetrix Genotyping Array 50K XbaI - SNPs on the commercial array Molecule Type: Sample used to find this variant Unknown - sample type not known Genomic - variant discovered using a genomic template cDNA - variant discovered using a cDNA template Mitochondrial - variant discovered using a mitochondrial template Chloroplast - variant discovered using a chloroplast template Variant Class: Variant classification Unknown - no classification provided by data contributor Single Nucleotide Polymorphism - single nucleotide variation: alleles of length = 1 and from set of {A,T,C,G} Insertion/deletion - insertion/deletion variation: alleles of different length or include '-' character Heterozygous - heterozygous (undetermined) variation: allele contains string '(heterozygous)' Microsatellite - microsatellite variation: allele string contains numbers and '(motif)' pattern Named - insertion/deletion of named object (length unknown) No Variation - no variation asserted for sequence Mixed - mixed class Multiple Nucleotide Polymorphism - alleles of the same length, length > 1, and from set of {A,T,C,G} Validation Status: Method used to validate the variant (each variant may be validated by more than one method) Unknown - no validation has been reported for this refSNP Other Population - at least one ss in cluster was validated by independent assay By Frequency - at least one subsnp in cluster has frequency data submitted By Cluster - cluster has 2+ submissions, with 1+ submissions assayed with a non-computational method By 2 Hit/2 Allele - all alleles have been observed in 2+ chromosomes By HapMap - validated by HapMap project By Genotype - at least one genotype reported for this refSNP Function: Predicted functional role (each variant may have more than one functional role) Unknown - no known functional classification Locus Region - variation in region of gene, but not in transcript Coding - variation in coding region of gene, assigned if allele-specific class unknown Coding - Synonymous - no change in peptide for allele with respect to contig seq Coding - Non-Synonymous - change in peptide with respect to contig sequence mRNA/UTR - variation in transcript, but not in coding region interval Intron - variation in intron, but not in first two or last two bases of intron Splice Site - variation in first two or last two bases of intron Reference - allele observed in reference contig sequence Exception - variation in coding region with exception raised on alignment. This occurs when protein with gap in sequence is aligned back to contig sequence. Variations that are on the 3' side of the gap have undefined functional inference. Location Type: Describes how a segment of the reference assembly must be altered to represent the variant SNP allele Unknown - undefined or error Range - a range of two or more bases in the reference assembly must be altered. This occurs, for example, when the variant allele is a deletion of two or more bases relative to the allele represented by the reference assembly. Exact - one base in the reference assembly must be altered. This occurs when the variant allele is a single-base substitution relative to the reference genome or when the variant allele is a deletion of a single base. Between - no reference assembly bases must be altered. This occurs when the variant allele is an insertion of one or more bases relative to the allele represented by the reference assembly. Large Scale SNP Annotation at UCSF LS-SNP is a database of functional and structural SNP annotations with links to protein structure models. Annotations are based on a variety of features extracted from protein structure, sequence, and evolution. Currently only coding non-synonomous SNPs are included. LS-SNP at UCSF. Data Filtering The SNPs in this track include all known polymorphisms available in the current build of dbSnp that can be mapped against the current assembly. The version of dbSnp from which these data were obtained can be found in the SNP track entry in the Genome Browser release log. There are two reasons that some variants may not be mapped and/or annotated in this track: Submissions are completely masked as repetitive elements. These are dropped from any further computations. This set of reference SNPs is found in chromosome "rs_chMasked" on the dbSNP ftp site. Submissions are defined in a cDNA context with extensive splicing. These SNPs are typically annotated on refSeq mRNAs through a separate annotation process. Effort is being made to reverse map these variations back to contig coordinates, but that has not been implemented. For now, you can find this set of variations in "rs_chNotOn" on the dbSNP ftp site. The heuristics for the non-SNP variations (i.e. named elements and short tandem repeats (STRs)) are quite conservative; therefore, some of these are probably lost. This approach was chosen to avoid false annotation of variation in inappropriate locations. Credits and Data Use Restrictions Thanks to the SNP Consortium and NIH for providing the public data, which are available from dbSnp at NCBI. Thanks to Affymetrix, Inc. for developing the genotyping arrays. Please see the Terms and Conditions page on the Affymetrix website for restrictions on the use of their data. For more details on the Affymetrix genotyping assay, see the supplemental information on the Affymetrix 10K SNP and Affymetrix Genotyping Array products. Additional information, including genotyping data, is available on those pages. Karchin, R., Diekhans, M., Kelly, L., Thomas, D.J., Pieper, U., Eswar, N., Haussler, D. and Sali, A. LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 21:2814-2820; April 12, 2005. intronEst Spliced ESTs Human ESTs That Have Been Spliced mRNA and EST Description This track shows alignments between human expressed sequence tags (ESTs) in GenBank and the genome that show signs of splicing when aligned against the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. To be considered spliced, an EST must show evidence of at least one canonical intron (i.e., the genomic sequence between EST alignment blocks must be at least 32 bases in length and have GT/AG ends). By requiring splicing, the level of contamination in the EST databases is drastically reduced at the expense of eliminating many genuine 3' ESTs. For a display of all ESTs (including unspliced), see the human EST track. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, darker shading indicates a larger number of aligned ESTs. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, human ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence are displayed in this track. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 stsMap STS Markers STS Markers on Genetic (blue) and Radiation Hybrid (black) Maps Mapping and Sequencing Description This track shows locations of Sequence Tagged Site (STS) markers along the draft assembly. These markers have been mapped using either genetic mapping (Genethon, Marshfield, and deCODE maps), radiation hybridization mapping (Stanford, Whitehead RH, and GeneMap99 maps) or YAC mapping (the Whitehead YAC map) techniques. Since August 2001, this track no longer displays fluorescent in situ hybridization (FISH) clones, which are now displayed in a separate track. Genetic map markers are shown in blue; radiation hybrid map markers are shown in black. When a marker maps to multiple positions in the genome, it is shown in a lighter color. Methods Positions of STS markers are determined using both full sequences and primer information. Full sequences are aligned using blat, while isPCR (Jim Kent) and ePCR are used to find locations using primer information. Both sets of placements are combined to give final positions. In nearly all cases, full sequence and primer-based locations are in agreement, but in cases of disagreement, full sequence positions are used. Sequence and primer information for the markers were obtained from the primary sites for each of the maps, and from NCBI UniSTS (now part of NCBI Probe). Using the Filter The track filter can be used to change the color or include/exclude a set of map data within the track. This is helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: In the pulldown menu, select the map whose data you would like to highlight or exclude in the display. By default, the "All Genetic" option is selected. Choose the color or display characteristic that will be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display data from the map selected in the pulldown list. If "include" is selected, the browser will display only data from the selected map. When you have finished configuring the filter, click the Submit button. Credits This track was designed and implemented by Terry Furey. Many thanks to the researchers who worked on these maps, and to Greg Schuler, Arek Kasprzyk, Wonhee Jang, and Sanja Rogic for helping process the data. Additional data on the individual maps can be found at the following links: Genethon map Marshfield map deCODE map GeneMap99 GB4 and G3 maps Stanford TNG (Center has closed) Whitehead YAC and RH maps chimpDels Chimp Deletions Deletions in Chimp (Nov. 2003/panTro1) Relative to Human Comparative Genomics Description This track displays regions of the human genome assembly (hg16) that are deleted in the chimpanzee draft assembly (panTro1). Only regions of between 80 and 12000 bases are included. The name of each deletion is a unique pointer to that deletion followed by an underscore and then its length. A similar track, showing human deletions in the chimpanzee assembly, appears in the chimp Genome Browser. Methods The human/chimpanzee alignments were created at UCSC with blastz and blat, using a reciprocal best strategy with chaining and netting. The initial alignments were generated using blastz on repeatmasked sequence with following matrix: A C G T A 100 -300 -150 -300 C -300 100 -300 -150 G -150 -300 100 -300 T -300 -150 -300 100 O = 400, E = 30, K = 4500, L = 4500, M = 50 The overall score is the sum of the score over all pairs. The resulting alignments were processed by the axtChain program. To place additional chimp scaffolds that weren't initially aligned by blastz, a DNA blat of the unmasked sequence was performed. The resulting blat alignments were also chained, and then merged with the blastz-based chains produced in the previous step to produce "all chains", which were further processed by the chainNet and netSyntenic programs. Finally, a "reciprocal best" strategy was employed to minimize paralog fill-in for missing orthologous chimp sequence. Details of the alignment methods can be found in the descriptions of the Chimp Chain and Chimp Net tracks. Chimp deletions in human were determined from the collection of indels implied by these alignments. The criteria for inclusion in the list of deletions were (i) within, not between, scaffolds; (ii) simple gaps only (no opposing, unmatched bases or double gaps); (iii) 80-12000 bp long; and (iv) not a missed overlap or incorrect gap size in assembly. These criteria aim to include plausible repeat insertions and exclude assembly and alignment artifacts. Credits The chimpanzee sequence used in this track was obtained from the 13 Nov. 2003 Arachne assembly. This sequence was provided by the National Human Genome Research Institute (NHGRI), the Eli & Edythe L. Broad Institute at MIT/Harvard, and Washington University School of Medicine. The BLASTZ program was created by Webb Miller of the Penn State Bioinformatics Group. Jim Kent at UCSC wrote the blat program, the chaining and netting programs, and the scripts for displaying the alignments in this browser. The list of mid-sized (80-12000 bp) chimp deletions relative to human was provided by Tarjei Mikkelsen at MIT. The UCSC alignments of complete chimpanzee scaffolds to the human genome assembly were used to generate this list. References ARACHNE: A Whole-Genome Shotgun Assembler. Serafim Batzoglou, David B. Jaffe, Ken Stanley, Jonathan Butler, Sante Gnerre, Evan Mauceli, Bonnie Berger, Jill P. Mesirov, and Eric S. Lander. Genome Research 2002 Jan;12:177-189. Whole-Genome Sequence Assembly for Mammalian Genomes: ARACHNE 2. David B. Jaffe, Jonathan Butler, Sante Gnerre, Evan Mauceli, Kerstin Lindblad-Toh, Jill P. Mesirov, Michael C. Zody, and Eric S. Lander. Genome Research 2003 Jan;13(1):91-96. Human-Mouse Alignments with BLASTZ. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, and Miller W. Genome Research 2003 Jan;13(1):103-7. Scoring pairwise genomic sequence alignments. Chiaromonte F, Yap VB, Miller W. Pac Symp Biocomput 2002;:115-26. chimpSimpleDiff Chimp Diff Chimp (Nov. 2003 (CGSC 1.1/panTro1)) Simple Differences in Regions of High Quality Sequence Comparative Genomics Description This track shows simple differences between chimp alignments and the human assembly within regions of high quality chimp sequence. The chimp data was obtained from the 13 Nov. 2003 Arachne assembly. A total of 28,889,041 differences are displayed. The difference rate in coding sequence is approximately half that of the chimp genome as a whole. Methods For a difference to be included in this track, it had to meet the following criteria: the difference must occur at a base of quality 30 or better all bases within an 11-base window around this base must have a quality of 25 or better the 11-base window must contain no more than two base differences no insertions or deletions may be present within the window Only reciprocal best chimp alignments were considered for this track (see the Chimp Net description for more information about this alignment strategy). Credits This track was generated at UCSC by Jim Kent. The chimp sequence was obtained from the 13 Nov. 2003 Arachne assembly. We'd like to thank the National Human Genome Research Institute (NHGRI), the Broad Institute, and Washington University School of Medicine in St. Louis for providing this sequence. rBestChainPanTro1 Chimp Recip Chain Chimp (Nov. 2003 (CGSC 1.1/panTro1)) Reciprocal Best Chained Alignments Comparative Genomics Description This track shows "reciprocal best" alignments of chimp (panTro1, Nov. 2003 (CGSC 1.1/panTro1)) to the human genome. These alignments were generated using blastz and blat alignments of chimp genomic sequence from the Nov. 2003 Arachne draft assembly. Alignments were made using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chimp and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chimp assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Methods The alignments were generated by blastz on repeatmasked sequence using the following human/chimp scoring matrix: A C G T A 100 -300 -150 -300 C -300 100 -300 -150 G -150 -300 100 -300 T -300 -150 -300 100 K = 4500, L = 3000, Y = 3400, H = 2000 The resulting alignments were fed into axtChain, which organizes all alignments between a single chimp chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. Chains scoring below a threshold were discarded. To place additional chimp sequences not initially aligned by blastz, a DNA blat of the unmasked sequence was performed. The resulting blat alignments were also chained and then merged with the blastz-based chains from the previous step to produce a set of "all chains". Due to the draft nature of this initial genome assembly, the chain track and the companion net track were generated using a "reciprocal best" strategy. This strategy attempts to minimize paralog fill-in for missing orthologous chimp sequence by filtering out of the human net all sequences not on the chimp side of the net. First, the merged blastz and blat chains were used to generate an alignment net using the program chainNet (described on the Chimp Recip Net track description page). Next, the subset of chains in the chimp-reference net were extracted and used for an additional netting step. The resulting human-reference net was used to generate the reciprocal best Chimp Recip Net browser track. Non-syntenic sequences smaller than 50 bases were filtered out. Chains extracted from this net are displayed on the Chimp Recip Chain browser track. Credits The chimp sequence used in this track was obtained from the 13 Nov. 2003 Arachne assembly. We'd like to thank the National Human Genome Research Institute (NHGRI), the Broad Institute at MIT/Harvard, and Washington University St. Louis School of Medicine for providing this sequence. Blastz was developed at Pennsylvania State University by Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains were generated by Robert Baertsch and Jim Kent. References Chiaromonte, F., Yap, V.B., Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). rBestNetPanTro1 Chimp Recip Net Chimp (Nov. 2003/panTro1) Reciprocal Best Net Comparative Genomics Description This track shows the "reciprocal best" human/chimpanzee alignment net. It is useful for finding orthologous regions and for studying genome rearrangement. In the graphical display, the boxes represent ungapped alignments, while the lines represent gaps. In full display mode, the top-level (Level 1) chains are the largest, highest-scoring chains that span this region. In many cases, gaps exist in the top-level chains. When possible, these are filled in by other chains displayed at Level 2. The gaps in Level 2 chains may be filled by Level 3 chains and so forth. Clicking on a box displays detailed information about the chain as a whole, while clicking on a line shows information on the gap. The detailed information is useful in determining the cause of the gap or, for lower-level chains, the genomic rearrangement. Individual track features are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on Level 1. Syn - aligns to the same chromosome as the gap in the level above it. Inv - aligns to the same chromosome as the gap above it, but in the opposite orientation. NonSyn - matches a different chromosome from the gap in the level above it. Methods These alignments were generated using blastz and blat alignments of chimpanzee genomic sequence from the 13 Nov. 2003 Arachne chimpanzee draft assembly. The initial alignments were generated using blastz on repeatmasked sequence using the following chimp/human scoring matrix: A C G T A 100 -300 -150 -300 C -300 100 -300 -150 G -150 -300 100 -300 T -300 -150 -300 100 K = 4500, L = 3000, Y=3400, H=2000 The resulting alignments were processed by the axtChain program. AxtChain organizes all the alignments between a single chimp chromosome and a single human chromosome into a group and makes a kd-tree out of all the gapless subsections (blocks) of the alignments. The maximally-scoring chains of these blocks were found by running a dynamic program over the kd-tree. Chains scoring below a certain threshold were discarded. To place additional chimp scaffolds that weren't initially aligned by blastz, a DNA blat of the unmasked sequence was performed. The resulting blat alignments were also chained, and then merged with the blastz-based chains produced in the previous step to produce "all chains". These chains were sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. Due to the draft nature of this initial genome assembly, this net track (and the companion chain track) was generated using a "reciprocal best" strategy. This strategy attempts to minimize paralog fill-in for missing orthologous chimp sequence by filtering from the human net all sequences not found in the chimp side of the net. After generating the human alignment net, the subset of chains in the chimp-reference net was extracted and used for an additional netting step, which was then filtered for non-syntenic sequences smaller than 50 bases. Credits The chimp sequence used in this track was obtained from the 13 Nov. 2003 Arachne assembly. We'd like to thank the National Human Genome Research Institute (NHGRI), the Eli & Edythe L. Broad Institute at MIT/Harvard, and Washington University School of Medicine for providing this sequence. Blastz was developed at Pennsylvania State University by Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his program RepeatMasker. The browser display and database storage of the nets were made by Robert Baertsch and Jim Kent. References Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). encodeEgaspFull EGASP Full ENCODE Gene Prediction Workshop (EGASP) All ENCODE Regions ENCODE Regions and Genes Description This track shows full sets of gene predictions covering all 44 ENCODE regions originally submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included: AceView DOGFISH-C Ensembl Exogean ExonHunter Fgenesh Pseudogenes Fgenesh++ GeneID-U12 GeneMark JIGSAW Pairagon/N-SCAN SGP2-U12 SPIDA Twinscan-MARS The EGASP Partial companion track shows original gene prediction submissions for a partial set of the 44 ENCODE regions; the EGASP Update track shows updated versions of the submitted predictions. These annotations were originally produced using the hg17 assembly. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track are displayed in a separate subtrack. See the top of the track description page for configuration options allowing display of selected subsets of gene predictions. To remove a subtrack from the display, uncheck the appropriate box. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods AceView These annotations were generated using AceView. All mRNAs and cDNAs available in GenBank, excluding NMs, were co-aligned on the Gencode sections. The results were then examined and filtered to resemble Havana. The very restrictive view of Havana on CDS was not reproduced, due to a lack of experimental data. DOGFISH-C Candidate splice sites and coding starts/stops were evaluated using DNA alignments between the human assembly and seven other vertebrate species (UCSC multiz alignments, adding the frog and removing the chimp). Genes (single transcripts only) were then predicted using dynamic programming. Ensembl The Ensembl annotation includes two types of predictions: protein-coding genes (the Ensembl Gene Predictions subtrack) and pseudogenes of protein-coding genes (the Ensembl Pseudogene Predictions subtrack). The Ensembl Pseudo track is not intended as a comprehensive annotation of pseudogenes, but rather an attempt to identify and label those gene predictions made by the Ensembl pipeline that have pseudogene characteristics. Exons that lie partially outside the ENCODE region are not included in the data set. The "Alternate Name" field on the subtrack details page shows the Ensembl ID for the selected gene or transcript. ExonHunter ExonHunter is a comprehensive gene-finder based on hidden Markov models (HMMs) allowing the use of a variety of additional sources of information (ESTs, proteins, genome-genome comparisons). Exogean Exogean annotates protein coding genes by combining mRNA and cross-species protein alignments in directed acyclic colored multigraphs where nodes and edges respectively represent biological objects and human expertise. Additional predictions and methods for this subtrack are available in the EGASP Updates track. Fgenesh Pseudogenes Fgenesh is an HMM gene structure prediction program. This data set shows predictions of potential pseudogenes. Fgenesh++ These gene predictions were generated by Fgenesh++, a gene-finding program that uses both HMMs and protein similarity to find genes in a completely automated manner. GeneID-U12 The GeneID-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both GT-AG and AT-AC subtypes) when present, employs a single-genome ab initio method. This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two GeneID-U12 subtracks are included: GeneID Gene Predictions and GeneID U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track. GeneMark The eukaryotic version of the GeneMark.hmm (release 2.2) gene prediction program utilizes the HMM statistical model with duration or hidden semi-Markov model (HSMM). The HMM includes hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes the "border" states, such as start site (initiation codon), stop site (termination codons), and donor and acceptor splice sites. Sequences of all protein-coding regions were modeled by three periodic inhomogeneous Markov chains; sequences of non-coding regions were modeled by homogeneous Markov chains. Nucleotide sequences corresponding to the site states were modeled by position-specific inhomogeneous Markov chains. Parameters of the gene models were derived from the set of genes obtained by cDNA mapping to genomic DNA. To reflect variations in G+C composition of the genome, the gene model parameters were estimated separately for the three G+C regions. JIGSAW JIGSAW uses the output from gene-finders, splice-site prediction programs and sequence alignments to predict gene models. Annotation data downloaded from the UCSC Genome Browser and TIGR gene-finder output was used as input for these predictions. JIGSAW predicts both partial and complete genes. Additional predictions and methods for this subtrack are available in the EGASP Updates track. Pairagon/N-SCAN The pairHMM-based alignment program, Pairagon, was used to align high-quality mRNA sequences to the ENCODE regions. These were supplemented with N-SCAN EST predictions which are displayed in the Pairgn/NSCAN-E subtrack, and extended further with additional transcripts from the Brent Lab to produce the predictions displayed as the Pairgn/NSCAN-E/+ subtrack. The NSCAN subtrack contains only predictions from the N-SCAN program. SGP2-U12 The SGP2-U12 gene prediction set, generated using a version of GeneID modified to detect U12-dependent introns (both AT-AC and GT-AG subtypes) when present, employs a dual-genome method (SGP2) that utilizes similarity (tblastx) to mouse genomic sequence syntenic to the ENCODE regions (Oct. 2004 MSA freeze). This modified version of GeneID uses matrices for U12 donor, acceptor and branch sites constructed from examples of published U12 intron splice junctions (both experimentally confirmed and expressed-sequence-validated predictions). Two SGP2-U12 subtracks are included: SGP2 Gene Predictions and SGP2 U12 Intron Predictions. The U12 splice sites for features in the U12 Intron Predictions track are displayed on the track details pages. Additional predictions and methods for this subtrack are available in the EGASP Updates track. SPIDA This exon-only prediction set was produced using SPIDA (Substitution Periodicity Index and Domain Analysis). Exons derived by mapping ESTs to the genome were validated by seeking periodic substitution patterns in the aligned informant DNA sequences. First, all available ESTs were mapped to the genome using Exonerate. The resulting transcript structures were "flattened" to remove redundancy. Each exon of the flattened transcripts was subjected to SPI analysis, which involves identifying periodicity in the pattern of mutations occurring between the human and an informant species DNA sequence (the informant sequences and their TBA alignments were provided by Elliott Margulies). SPI was calculated for all available human-informant pairs for whole exons and in a sliding 48 bp window. SPI analysis requires that a threshold level of periodicity be identified in at least two of the informant species if the exon is to be accepted. If accepted, SPI provides the correct frame for translation of the exon. This exon was used as a starting point for extending the ORF coding region of the flattened transcript from which it came. This gave a full or partial CDS; different exons may give different CDSs. The CDSs were translated and searched for domains using hmmpfam and Pfam_fs. Only transcripts with a domain hit with e > 1.0 were retained. Heuristics were applied to the retained CDSs to identify problems with the transcript structure, particularly frame-shifts. Many transcripts may identify the same exon, but only a single instance of each exon has been retained. Twinscan-MARS This gene prediction set was produced by a version of Twinscan that employs multiple pairwise genome comparisons to identify protein-coding genes (including alternative splices) using nucleotide homology information. No expression or protein data were used. Credits The following individuals and institutions provided the data for the subtracks in this annotation: AceView: Danielle and Jean Thierry-Mieg, NCBI, National Institutes of Health. DOGFISH-C: David Carter, Informatics Dept., Wellcome Trust Sanger Institute. Ensembl: Stephen Searle, Wellcome Trust Sanger Institute (joint Sanger/EBI project). Exogean: Sarah Djebali, Dyogen Lab, Ecole Normale Supérieure (Paris, France). ExonHunter: Tomas Vinar, Waterloo Bioinformatics, School of Computer Science, University of Waterloo. Fgenesh, Fgenesh++: Victor Solovyev, Department of Computer Science, Royal Holloway, London University. GeneID-U12, SGP2-U12: Tyler Alioto, Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM), Barcelona. GeneMark: Mark Borodovsky, Alex Lomsadze and Alexander Lukashin, Department of Biology, Georgia Institute of Technology. JIGSAW: Jonathan Allen, Steven Salzberg group, The Institute for Genomic Research (TIGR) and the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. Pairagon/N-SCAN: Randall Brown, Laboratory for Computational Genomics, Washington University in St. Louis. SPIDA: Damian Keefe, Birney Group, EMBL-EBI. Twinscan: Paul Flicek, Brent Lab, Washington University in St. Louis. encodeEgaspSuper EGASP ENCODE Gene Prediction Workshop (EGASP) ENCODE Regions and Genes Overview This super-track combines related tracks from the ENCODE Gene Annotation Assessment Project (EGASP) 2005 Gene Prediction Workshop. The goal of the workshop was to evaluate automatic methods for gene annotation of the human genome, with a focus on protein-coding genes. Predictions were evaluated in terms of their ability to reproduce the high-quality manually assisted GENCODE gene annotations and to predict novel transcripts. The EGASP Full track shows gene predictions covering all 44 ENCODE regions submitted before the GENCODE annotations were released. The EGASP Partial track shows gene predictions that cover some of the ENCODE regions, submitted before the GENCODE release. The EGASP Update track shows gene predictions that cover all ENCODE regions, submitted after the GENCODE release. These annotations were originally produced using the hg17 assembly. The following gene predictions are included: ACEScan AceView DOGFISH-C Ensembl Exogean ExonHunter Fgenesh Pseudogenes Fgenesh++ GeneID-U12 GeneMark GeneZilla JIGSAW Pairagon/N-SCAN SAGA SGP2-U12 SPIDA Twinscan-MARS Yale pseudogenes Credits Click here for a complete list of people who participated in the GENCODE project. The following individuals and institutions provided the data for the subtracks in this annotation: AceView: Danielle and Jean Thierry-Mieg, NCBI, National Institutes of Health. DOGFISH-C: David Carter, Informatics Dept., Wellcome Trust Sanger Institute. Ensembl: Stephen Searle, Wellcome Trust Sanger Institute (joint Sanger/EBI project). Exogean: Sarah Djebali, Dyogen Lab, Ecole Normale Supérieure (Paris, France). ExonHunter: Tomas Vinar, Waterloo Bioinformatics, School of Computer Science, University of Waterloo. Fgenesh, Fgenesh++: Victor Solovyev, Department of Computer Science, Royal Holloway, London University. GeneID-U12, SGP2-U12: Tyler Alioto, Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM), Barcelona. GeneMark: Mark Borodovsky, Alex Lomsadze and Alexander Lukashin, Department of Biology, Georgia Institute of Technology. JIGSAW: Jonathan Allen, Steven Salzberg group, The Institute for Genomic Research (TIGR) and the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. Pairagon/N-SCAN: Randall Brown, Laboratory for Computational Genomics, Washington University in St. Louis. SPIDA: Damian Keefe, Birney Group, EMBL-EBI. Twinscan: Paul Flicek, Brent Lab, Washington University in St. Louis. ACEScan: Gene Yeo, Crick-Jacobs Center for Computational Biology, Salk Institute. Augustus: Mario Stanke, Department of Bioinformatics, University of Göttingen, Germany. GeneZilla: William Majoros, Dept. of Bioinformatics, The Institute for Genomic Research (TIGR). SAGA: Sourav Chatterji, Lior Pachter lab, Department of Mathematics, U.C. Berkeley. References Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D459-65. Guigo R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):1140-5. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002 Dec 5;420(6915):520-62. Reymond A, Marigo V, Yaylaoglu MB, Leoni A, Ucla C, Scamuffa N, Caccioppoli C, Dermitzakis ET, Lyle R, Banfi S et al. Human chromosome 21 gene expression atlas in the mouse. Nature. 2002 Dec 5;420(6915):582-6. Reymond A, Camargo AA, Deutsch S, Stevenson BJ, Parmigiani RB, Ucla C, Bettoni F, Rossier C, Lyle R, Guipponi M et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics. 2002 Jun;79(6):824-32. Chatterji S, Pachter L. Multiple organism gene finding by collapsed Gibbs sampling. J Comput Biol. 2005 Jul-Aug;12(6):599-608. Siepel A, Haussler D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology. 2004;177-186. Augustus Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19(Suppl. 2):ii215-ii225. Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12. FGenesh++ Solovyev VV. "Statistical approaches in Eukaryotic gene prediction". In Handbook of Statistical Genetics (eds. Balding D et al.) (John Wiley & Sons, Inc., 2001). p. 83-127. GeneID Blanco E, Parra G, Guigó R. "Using geneid to identify genes". In Current Protocols in Bioinformatics, Unit 4.3. (eds. Baxevanis AD.) (John Wiley & Sons, Inc., 2002). Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998 Winter;5(4):681-702. Guigó R, Knudsen S, Drake N, Smith T. Prediction of gene structure. J Mol Biol. 1992 Jul 5;226(1):141-57. Parra G, Blanco E, Guigó R. GeneID in Drosophila. Genome Res. 2000 Apr;10(4):511-5. JIGSAW Allen JE, Pertea M, Salzberg SL. Computational gene prediction using multiple sources of evidence. Genome Res. 2004 Jan;14(1):142-8. Allen JE, Salzberg SL. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005 Sep 15;21(18):3596-603. SGP2 Guigó R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A. 2003 Feb 4;100(3):1140-5. Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW, Guigó R. Comparative gene prediction in human and mouse. Genome Res. 2003 Jan;13(1):108-17. encodeEgaspFullTwinscan Twinscan Twinscan Gene Predictions ENCODE Regions and Genes encodeEgaspFullSpida SPIDA Exons SPIDA Exon Predictions ENCODE Regions and Genes encodeEgaspFullSgp2U12 SGP2 U12 SGP2 U12 Intron Predictions ENCODE Regions and Genes encodeEgaspFullSgp2 SGP2 SGP2 Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonMultiple NSCAN N-SCAN Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonAny Pairgn/NSCAN-E/+ Pairagon/NSCAN Any Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspFullPairagonMrna Pairgn/NSCAN-E Pairagon/NSCAN-EST Gene Predictions ENCODE Regions and Genes encodeEgaspFullJigsaw Jigsaw Jigsaw Gene Predictions ENCODE Regions and Genes encodeEgaspFullGenemark GeneMark GeneMark Gene Predictions ENCODE Regions and Genes encodeEgaspFullGeneIdU12 GeneID U12 GeneID U12 Intron Predictions ENCODE Regions and Genes encodeEgaspFullGeneId GeneID GeneID Gene Predictions ENCODE Regions and Genes encodeEgaspFullSoftberryPseudo Fgenesh Pseudo Fgenesh Pseudogene Predictions ENCODE Regions and Genes encodeEgaspFullFgenesh Fgenesh++ Fgenesh++ Gene Predictions ENCODE Regions and Genes encodeEgaspFullExonhunter ExonHunter ExonHunter Gene Predictions ENCODE Regions and Genes encodeEgaspFullExogean Exogean Exogean Gene Predictions ENCODE Regions and Genes encodeEgaspFullEnsemblPseudo Ensembl Pseudo Ensembl Pseudogene Predictions ENCODE Regions and Genes encodeEgaspFullEnsembl Ensembl Ensembl Gene Predictions ENCODE Regions and Genes encodeEgaspFullDogfish DOGFISH-C DOGFISH-C Gene Predictions ENCODE Regions and Genes encodeEgaspFullAceview AceView AceView Gene Predictions ENCODE Regions and Genes encodeEgaspPartial EGASP Partial ENCODE Gene Prediction Workshop (EGASP) for Partial ENCODE Regions ENCODE Regions and Genes Description This track shows gene predictions submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005 that cover only a partial set of the 44 ENCODE regions. The partial set excludes the 13 ENCODE regions for which high-quality annotations were released in late 2004. The following gene predictions are included: ACEScan Augustus GeneZilla SAGA The EGASP Full companion track shows original gene prediction submissions for the full set of 44 ENCODE regions using Gene Prediction algorithms other than those used here; the EGASP Update track shows updated versions of some of the submitted predictions. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track is displayed in a separate subtrack. See the top of the track description page for a complete list of the subtracks available for this annotation. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods ACEScan ACEScan (Alternative Conserved Exons Scan) indicates alternative splicing that is evolutionarily conserved in human and mouse/rat. The Conserved Alternative Exon Predictions subtrack shows predicted alternative conserved exons. The Unconserved Alternative and Constitutive Exon Predictions subtrack shows exons that are predicted to be constitutive or may have species-specific alternative splicing. Augustus Augustus uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, translation start and end, and lengths of exons and introns. The track contains four different sets of predictions. Ab initio single genome predictions are based solely on the input sequence. EST and protein evidence predictions were generated using AGRIPPA hints based on alignments of human sequence from the dbEST and nr databases. Mouse homology gene predictions were produced using mouse genomic sequence only; BLAST, CHAOS, DIALIGN were used to generate the hints for Augustus. The combined EST/protein evidence and mouse homology gene predictions were created using human sequence from the dbEST and nr databases and mouse genomic sequence to generate hints for Augustus. Additional predictions and methods for this subtrack are available in the EGASP Updates track. GeneZilla GeneZilla is a program for the computational prediction of protein-coding genes in eukaryotic DNA, based on the generalized hidden Markov model (GHMM) framework. These predictions were generated using GeneZilla and IsoScan, which uses a four-state hidden Markov model to predict isochores (regions of homogeneous G+C content) in genomic DNA. SAGA SAGA is an ab initio multiple-species gene-finding program based on the Gibbs sampling-based method described in Chatterji et al. (2004). In addition to sampling parameters, SAGA also uses a phyloHMM based model to boost the scores, similar to the method described in Siepel et al. (2004). Credits The gene prediction data sets were submitted by the following individuals and institutions: ACEScan: Gene Yeo, Crick-Jacobs Center for Computational Biology, Salk Institute. Augustus: Mario Stanke, Department of Bioinformatics, University of Göttingen, Germany. GeneZilla: William Majoros, Dept. of Bioinformatics, The Institute for Genomic Research (TIGR). SAGA: Sourav Chatterji, Lior Pachter lab, Department of Mathematics, U.C. Berkeley. References Chatterji, S. and Pachter, L. Multiple organism gene finding by collapsed Gibbs sampling. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 187-193 (2004). Siepel, A. and Haussler, D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 177-186 (2004). encodeEgaspPartSaga SAGA SAGA Gene Predictions ENCODE Regions and Genes encodeEgaspPartGenezilla GeneZilla GeneZilla Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusAny Augustus/EST/Mouse Augustus + EST/Protein Evidence + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusDual Augustus/Mouse Augustus + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusEst Augustus/EST Augustus + EST/Protein Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspPartAugustusAbinitio Augustus Augustus Ab Initio Gene Predictions ENCODE Regions and Genes encodeEgaspPartAceOther ACEScan Other ACEScan Unconserved Alternative and Constitutive Exon Predictions ENCODE Regions and Genes encodeEgaspPartAceCons ACEScan Cons Alt ACEScan Conserved Alternative Exon Predictions ENCODE Regions and Genes encodeEgaspUpdate EGASP Update ENCODE Gene Prediction Workshop (EGASP) Updates ENCODE Regions and Genes Description This track shows updated versions of gene predictions submitted for the ENCODE Gene Annotation Assessment Project (EGASP) Gene Prediction Workshop 2005. The following gene predictions are included: Augustus Exogean FGenesh++ GeneID-U12 Jigsaw SGP2-U12 Yale pseudogenes The original EGASP submissions are displayed in the companion tracks, EGASP Full and EGASP Partial. Display Conventions and Configuration Data for each gene prediction method within this composite annotation track are displayed in separate subtracks. See the top of the track description page for a complete list of the subtracks available for this annotation. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The individual subtracks within this annotation follow the display conventions for gene prediction tracks. Display characteristics specific to individual subtracks are described in the Methods section. The track description page offers the option to color and label codons in a zoomed-in display of the subtracks to facilitate validation and comparison of gene predictions. To enable this feature, select the genomic codons option from the "Color track by codons" menu. Click the Help on codon coloring link for more information about this feature. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing the different gene prediction methods. Methods Augustus Augustus uses a generalized hidden Markov model (GHMM) that models coding and non-coding sequence, splice sites, the branch point region, the translation start and end, and the lengths of exons and introns. This version has been trained on a set of 1284 human genes. The track contains four sets of predictions: ab initio, EST and protein-based, mouse homology-based, and those using EST/protein and mouse homology evidence as additional input to Augustus for the predictions. The EST and protein evidence was generated by aligning sequences from the dbEST and nr databases to the ENCODE region using wublastn and wublastx. The resulting alignments were used to generate hints about putative splice sites, exons, coding regions, introns, translation start and translation stop. The mouse homology evidence was generated by aligning pairs of human and mouse genomic sequences using the program DIALIGN. Regions conserved at the peptide level were used to generate hints about coding regions. Exogean Exogean produces alternative transcripts by combining mRNA and cross-species sequence alignments using heuristic rules. The program implements a generic framework based on directed acyclic colored multigraphs (DACMs). In Exogean, DACM nodes represent biological objects (mRNA or protein HSPs/transcripts) and multiple edges between nodes represent known relationships between these objects derived from human expertise. Exogean DACMs are succesively built and reduced, leading to increasingly complex objects. This process enables the production of alternative transcripts from initial HSPs. FGenesh++ FGenesh++ predictions are based on hidden Markov models and protein similarity to the NR database. For more information, see the reference below. GeneID-U12 The GeneID program predicts genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using position weight arrays (PWAs). Next, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites plus the the log-likelihood ratio of a Markov model for coding DNA. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The modified version of GeneID used to generate the predictions in this track incorporates models for U12-dependent splice signals in addition to U2 splice signals. The GeneID subtrack shows all GeneID genes. Only U12 introns and their flanking exons are displayed in the GeneID U12 subtrack. Exons flanking predicted U12-dependent introns are assigned a type attribute reflecting their splice sites, displayed on the details page of the GeneID U12 subtrack as the "Alternate Name" of the item composed of the intron plus flanking exons. Jigsaw Jigsaw is a gene prediction program that determines genes based on target genomic sequence and output from a gene structure annotation database. Data downloaded from UCSC's annotation database is used as input and includes the following tracks of evidence: Known Genes, Ensembl, RefSeq, GeneID, Genscan, SGP, Twinscan, Human mRNAs, TIGR Gene Index, UniGene, Most Conserved Elements and Non-human RefSeq Genes. GlimmerHMM and GeneZilla, two open source ab initio gene-finding programs based on GHMMs, are also used. SGP2-U12 To predict genes in a genomic query, SGP2 combines GeneID predictions with tblastx comparisons of the genomic query against other genomic sequences. This modified version of SGP2 uses models for U12-dependent splice signals in addition to U2 splice signals. The reference genomic sequence for this data set is the Oct. 2004 release of mouse sequence syntenic to ENCODE regions. The SGP2 and SGP2 U12 tracks follow the same display conventions as the GeneID and GeneID U12 subtracks described above. Yale Pseudogenes For this analysis, pseudogenes were defined as genomic sequences similar to known human genes and with various disablements (premature stop codons or frameshifts) in their "putative" protein-coding regions. The protein sequences of known human genes (as annotated by ENSEMBL) were used to search for similar nongenic sequences in ENCODE regions. The matching sequences were assessed as disabled copies of genes based on the occurrences of premature stop codons or frameshifts. The intron-exon structure of the functional gene was further used to infer whether a pseudogene was duplicated or processed (a duplicated pseudogene keeps the intron-exon structure of its parent functional gene). Small pseudogene sequences were labeled as fragments or other types. All pseudogenes in this track were manually curated. In the browser, the track details page shows the pseudogene type. Credits Augustus was written by Mario Stanke at the Department of Bioinformatics of the University of Göttingen in Germany. Exogean was developed by Sarah Djebali and Hugues Roest Crollius from the Dyogen Lab, Ecole Normale Supérieure (Paris, France) and Franck Delaplace from the Laboratoire de Méthodes Informatiques (LaMI), (Evry, France). The FGenesh++ gene predictions were provided by Victor Solovyev of Softberry Inc. The GeneID-U12 and SGP2-U12 programs were developed by the Grup de Recerca en Informàtica Biomèdica (GRIB) at the Institut Municipal d'Investigació Mèdica (IMIM) in Barcelona. The version of GeneID on which GeneID-U12 is based (geneid_v1.2) was written by Enrique Blanco and Roderic Guigó. The parameter files were constructed by Genis Parra and Francisco Camara. Additional contributions were made by Josep F. Abril, Moises Burset and Xavier Messeguer. Modifications to GeneID that allow for the prediction of U12-dependent splice sites and incorporation of U12 introns into gene models were made by Tyler Alioto. Jigsaw was developed at The Institute for Genomic Research (TIGR) by Jonathan Allen and Steven Salzberg, with computational gene-finder contributions from Mihaela Pertea and William Majoros. Continued maintenance and development of Jigsaw will be provided by the Salzberg group at the Center for Bioinformatics and Computational Biology (CBCB) at the University of Maryland, College Park. The Yale Pseudogenes were generated by the pseudogene annotation group of Mark Gerstein at Yale University. References Augustus Stanke, M. Gene prediction with a hidden Markov model. Ph.D. thesis, Universität Göttingen, Germany (2004). Stanke, M. and Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(Suppl. 2), ii215-ii225 (2003). Stanke, M., Steinkamp, R., Waack, S. and Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Res., 32, W309-W312 (2004). FGenesh++ Solovyev V.V. "Statistical approaches in Eukaryotic gene prediction". In Handbook of Statistical Genetics (eds. Balding D. et al.) (John Wiley & Sons, Inc., 2001). p. 83-127. GeneID Blanco, E., Parra, G. and Guigó, R. "Using geneid to identify genes". In Current Protocols in Bioinformatics, Unit 4.3. (ed. Baxevanis, A.D.) (John Wiley & Sons, Inc., 2002). Guigó, R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 5(4), 681-702 (1998). Guigó, R., Knudsen, S., Drake, N. and Smith, T. Prediction of gene structure. J Mol Biol. 226(1), 141-57 (1992). Parra, G., Blanco, E. and Guigó, R. GeneID in Drosophila. Genome Research 10(4), 511-515 (2000). Jigsaw Allen, J.E., Pertea, M. and Salzberg, S.L. Computational gene prediction using multiple sources of evidence. Genome Res., 14(1), 142-8 (2004). Allen, J.E. and Salzberg, S.L. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18), 3596-3603 (2005). SGP2 Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W. and Guigó, R. Comparative gene prediction in human and mouse. Genome Res. 13(1), 108-17 (2003). encodeEgaspUpdYalePseudo Yale Pseudo Upd Yale Pseudogene Predictions ENCODE Regions and Genes encodeEgaspUpdSgp2U12 SGP2 U12 Update SGP2 U12 Intron Predictions ENCODE Regions and Genes encodeEgaspUpdSgp2 SGP2 Update SGP2 Gene Predictions ENCODE Regions and Genes encodeEgaspUpdJigsaw Jigsaw Update Jigsaw Gene Predictions ENCODE Regions and Genes encodeEgaspUpdGeneIdU12 GeneID U12 Upd GeneID U12 Intron Predictions ENCODE Regions and Genes encodeEgaspUpdGeneId GeneID Update GeneID Gene Predictions ENCODE Regions and Genes encodeEgaspUpdExogean Exogean Update Exogean Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusAny August/EST/Ms Upd Augustus + EST/Protein Evidence + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusDual August/Mouse Upd Augustus + Mouse Homology Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusEst Augustus/EST Upd Augustus + EST/Protein Evidence Gene Predictions ENCODE Regions and Genes encodeEgaspUpdAugustusAbinitio Augustus Update Augustus Ab Initio Gene Predictions ENCODE Regions and Genes encodeYaleMASPlacRNATransMap Yale MAS RNA Yale Maskless Array Synthesizer, RNA Transcript Map ENCODE Transcript Levels Description This track shows the forward (+) and reverse (-) strand transcript map of intensity scores (estimating RNA abundance) for human NB4 cell total RNA, and human placental Poly(A)+ RNA, hybridized to the Yale MAS (Maskless Array Synthesizer) ENCODE oligonucleotide microarray, transcription mapping design #1. This array has 36-mer oligonucleotide probes approximately every 36 bp (i.e. end-to-end) covering all the non-repetitive DNA sequence of the ENCODE regions ENm001-ENm012. See NCBI GEO GPL2105 for details of this array design. This transcript map is a combined signal from three biological replicates, each with at least two technical replicates. Arrays were hybridized using either the standard Nimblegen protocol or the protocol described in Bertone et al. (2004). The label of each subtrack in this annotation indicates the specific protocol used for that particular data set. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods A score was assigned to each oligonucleotide probe position by combining two or more technical replicates and by using a sliding window approach. Within a sliding window of 160 bp (corresponding to 5 oligos), the hybridization intensities for all replicates of each oligonucleotide probe were compared to their respective array median score. Within the window and across all the replicates, the number of probes above and below their respective median were counted. Using the sign test, a one-sided P-value was then calculated and a score defined as score=-log(P-value) was assigned to the oligo in the center of the window. Three independent biological replicates were generated and each was hybridized to at least 2 different arrays (technical replicates). Verification Reasonable correlation coefficients between replicates were ensured. Additionally, transcribed regions (TARs/transfrags) were called and compared between technical and biological replicates to ensure significant overlap. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242-6 (2004). Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725), 1149-54 (2005). Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-9 (2002). Kluger, Y., Tuck, D.P., Chang, J.T., Nakayama, Y., Poddar, R., Kohya, N., Lian, Z., Ben Nasr, A., Halaban, H.R. et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A 101(17), 6508-13 (2004). Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P. et al. The transcriptional activity of human Chromosome 22. Genes Dev 17(4), 529-40 (2003). encodeYaleRnaSuper Yale RNA Yale RNA (Neutrophil, Placenta and NB4 cells) ENCODE Transcript Levels Overview This super-track combines related tracks from Yale Transcript Map analysis. These tracks contain transcriptome data from different cell lines and biological samples as well as analysis of transcriptionally active regions (TARs). Experiments were performed with Yale MAS (Maskless Array Synthesizer) ENCODE oligonucleotide microarray (see NCBI GEO GPL2105 for details of this array design) as well as the Affymetrix ENCODE oligonucleotide microarray. Multiple biological samples were assayed, such as total RNA from human NB4 cells. Experiments also included chemical treatments such as retinoic acid (RA) treatments. Credits Yale MAS RNA, Yale MAS TAR These data were generated and analyzed by the the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. Yale RNA, Yale TAR These data were generated and analyzed by the Yale/Affymetrix collaboration among the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. Yale RACE These data were generated and analyzed by the lab of Mark Gerstein at Yale University. References Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004 Dec 24;306(5705):2242-6. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005 May 20;308(5725):1149-54. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002 May 3;296(5569):916-9. Kluger Y, Tuck DP, Chang JT, Nakayama Y, Poddar R, Kohya N, Lian Z, Ben Nasr A, Halaban HR et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A. 2004 Apr 27;101(17):6508-13. Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P et al. The transcriptional activity of human Chromosome 22. Genes Dev. 2003 Feb 15;17(4):529-40. encodeYaleMASPlacRNATransMapRevMless36mer36bp Yale Plc BtR RNA Yale Placenta RNA Trans Map, MAS Array, Reverse Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATransMapFwdMless36mer36bp Yale Plc BtF RNA Yale Placenta RNA TransMap, MAS array, Forward Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTMREVMless36mer36bp Yale Plc NgR RNA Yale Placenta RNA Trans Map, MAS Array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTMFWDMless36mer36bp Yale Plc NgF RNA Yale Placenta RNA Trans Map, MAS Array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANprotTMREVMless36mer36bp Yale NB4 NgR RNA Yale NB4 RNA Trans Map, MAS Array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANprotTMFWDMless36mer36bp Yale NB4 NgF RNA Yale NB4 RNA Trans Map, MAS Array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATars Yale MAS TAR Yale Maskless Array Synthesizer, RNA Transcriptionally Active Regions ENCODE Transcript Levels Description This track shows the locations of forward (+) and reverse (-) strand transcriptionally-active regions (TARs)/transcribed fragments (transfrags), for human NB4 cell total RNA and for human placenta Poly(A)+ RNA, hybridized to the Yale Maskless Array Synthesizer (MAS) ENCODE oligonucleotide microarray, transcription mapping design #1. This array has 36-mer oligonucleotide probes approximately every 36 bp (i.e. end-to-end) covering all the non-repetitive DNA sequence of the ENCODE regions ENm001 - ENm012. See NCBI GEO accession GPL2105 for details of this array design. These TARs/transfrags are based on a transcript map combining hybridization intensities from three biological replicates, each with at least two technical replicates. Arrays were hybridized using either Nimblegen standard protocol, or the protocol described in Bertone et al. (2004). The label of each subtrack in this annotation indicates the specific protocol used for that particular data set. Methods A score was assigned to each oligonucleotide probe position by combining two or more technical replicates and by using a sliding window approach. Within a sliding window of 160 bp (corresponding to 5 oligos), the hybridization intensities for all replicates of each oligonucleotide probe were compared to their respective array median intensity. Within the window and across all the replicates, the number of probes above and below their respective median was counted. Using the sign test, a one-sided P-value was then calculated and a score defined as score=-log(p-value) was assigned to the oligo in the center of the window. Three independent biological replicates were generated, and each was hybridized to at least two different arrays (technical replicates). Transcribed regions (TARs/transfrags) were then identified using a score threshold of 95th percentile as well as a maximum gap of 80 bp and a minimum run of 50 bp (between oligonucleotide positions), effectively allowing a gap of one oligo and demanding the TAR/transfrag to encompass at least 3 oligos. Verification Transcribed regions (TARs/transfrags), as determined by individual biological samples, were compared to ensure significant overlap. Credits These data were generated and analyzed by the the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. References Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR, Large-scale transcriptional activity in chromosomes 21 and 22, Science. 2002 May 3;296(5569):916-9. Rinn JL, Euskirchen G, Bertone P, Martone R, Luscombe NM, Hartman S, Harrison PM, Nelson FK, Miller P, Gerstein M, Weissman S, Snyder M, The transcriptional activity of human Chromosome 22, Genes Dev, 2003 Feb 15;17(4):529-40. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M, Global identification of human transcribed sequences with genome tiling arrays, Science. 2004 Dec 24;306(5705):2242-6. Epub 2004 Nov 11. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR, Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution, Science. 2005 May 20;308(5725):1149-54. Epub 2005 Mar 24. encodeYaleMASPlacRNATarsRevMless36mer36bp Yale Plc BtR TAR Yale Placenta RNA TARs, MAS array, Reverse Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNATarsFwdMless36mer36bp Yale Plc BtF TAR Yale Placenta RNA TARs, MAS array, Forward Direction, Bertone Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTarsREVMless36mer36bp Yale Plc NgR TAR Yale Placenta RNA TARs, MAS array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASPlacRNANprotTarsFWDMless36mer36bp Yale Plc NgF TAR Yale Placenta RNA TARs, MAS array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANProtTarsREVMless36mer36bp Yale NB4 NgR TAR Yale NB4 RNA TARs, MAS array, Reverse Direction, NimbleGen Protocol ENCODE Transcript Levels encodeYaleMASNB4RNANProtTarsFWDMless36mer36bp Yale NB4 NgF TAR Yale NB4 RNA TARs, MAS array, Forward Direction, NimbleGen Protocol ENCODE Transcript Levels encodeAffyEncode25bpProbes Affy 25bp Probes Affymetrix 25 bp Probe Locations ENCODE Transcript Levels Description This track shows the locations of the 25 bp probes on the Affymetrix ENCODE tiling array, which are spaced every 22 bp on average. The chip, which was produced using Affymetrix GeneChip technology, has 737,680 probes representing all the non-repetitive DNA sequence of the ENCODE regions. This chip was designed for high throughput experiments to explore the human transcriptome at high resolution. The probes represent all the transcribed regions, including mRNAs as well as non-coding RNAs that are used both structurally and in the regulation of gene expression. Disruption of these structures or changes in the levels of transcription or translation may play a role in disease pathogenesis; therefore, this array is a valuable tool for the discovery and elucidation of disease processes. Display Conventions and Configuration Probe locations are indicated by solid blocks in the graphical display. Methods Probe positions were provided by Affymetrix, and the sequence was verified upon mapping to the genome. The array can be utilized to study transcribed regions (see Affy RNA Signal and Affy Transfrags tracks), transcription factor binding sites (Affy pVal and Affy Sites tracks), sites of chromatin modification, sites for DNA methylation and chromosomal origins of replication. Credits This chip was generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and the Kevin Struhl group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeLIChIP LI ChIP Various Ludwig Institute/UCSD ChIP-chip: Pol2 8WG16, TAF1, H3ac, H3K4me2, H3K27me3 antibodies ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analyses were conducted of binding to the initiation-complex form of RNA polymerase II (Pol2), TATA-associated factor (TAF1), acetylated histone H3 (H3ac), lysine-4-dimethylated H3 (H3K4me2), suppressor of zeste 12 protein homolog (SUZ12), and lysine-27-tri-methylated H3 (H3K27me3). The analyses used chromatin extracted from IMR90 (lung fibroblast), HCT116 (colon epithelial carcinoma), HeLa (cervix epithelial adenocarcinoma), and THP1 (blood monocyte leukemia) cells. The initiation-complex form of Pol2 is associated with the transcription start site, as is TAF1. Both H3ac and H3K4me2 are associated with transcriptionally-active "open" chromatin. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. Data for each antibody/cell line pair is displayed in a separate subtrack. See the top of the track description page for a complete list of the subtracks available for this annotation. The subtracks may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by the list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from each of the four cell lines was separately cross-linked, precipitated with antibody to one of the six proteins, sheared, amplified and hybridized to a PCR DNA tiling array produced at the Ren Lab at UC San Diego. The array was composed of 24,537 non-repetitive sequences within the 44 ENCODE regions. For each marker, there were three biological replicates. Each experiment was normalized using the median values. The P-value and R-value were calculated using the modified single array error model (Li, Z. et al., 2003). The P-value and R-value were then derived from the weighted average results of the replicates. The displayed values were scaled to 0 - 16, corresponding to negative log base 10 of the P-value. Verification Each of the experiments has three biological replicates. The array platform, the raw and normalized data for each experiment, and the image files have all been deposited at the NCBI GEO Microarray Database. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim, T., Barrera, L.O., Qu, C., van Calcar, S., Trinklein, N., Cooper, S., Luna, R., Glass, C.K., Rosenfeld, M.G., Myers, R., Ren, B. Direct isolation and identification of promoters in the human genome. Genome Research 15,830-839 (2005). Li, Z., Van Calcar, S., Qu, C., Cavenee, W.K., Zhang, M.Z., and Ren, B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc. Natl. Acad. Sci. 100(14), 8164-8169 (2003). Ren, B., Robert, F., Wyrick, J. W., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert , T. L., Wilson, C., Bell, S. P. and Young, R. A. Genome-wide location and function of DNA-associated proteins Science 290(5500), 2306-2309 (2000). encodeUcsdChipSuper LI/UCSD ChIP Ludwig Institute/UC San Diego ChIP-chip ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP-chip data generated by the Ludwig Institute/UCSD ENCODE group. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells, including histones. Histone methylation and acetylation serves as a stable genomic imprint that regulates gene expression and other epigenetic phenomena. These histones are found in transcriptionally active domains called euchromatin. These tracks contain ChIP-chip data for transcription initiation complex (such as Pol2 and TAF1) and H3, H4 histones in multiple cell lines, including HeLa (cervical carcinoma), IMR90 (human fibroblast), and HCT116 (colon epithelial carcinoma), with some experiments including interferon-gamma induction. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim TH, Barrera LO, Qu C, Van Calcar S, Trinklein ND, Cooper SJ, Luna RM, Glass CK, Rosenfeld MG, Myers RM, Ren B. Direct isolation and identification of promoters in the human genome. Genome Res. 2005 Jun;15(6):830-9. Li Z, Van Calcar S, Qu C, Cavenee WK, Zhang MQ, Ren B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8164-9. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E et al. Genome-wide location and function of DNA-associated proteins. Science. 2000 Dec 22;290(5500):2306-9. Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, Wu Y, Green RD, Ren B. A high-resolution map of active promoters in the human genome. Nature. 2005 Aug 11;436(7052):876-80. encodeUcsdChipH3K27me3 LI H3K27me3 HeLa Ludwig Institute ChIP-chip: H3K27me3 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipH3K27me3Suz12 LI SUZ12 HeLa Ludwig Institute ChIP-chip: SUZ12 protein ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipMeh3k4Imr90_f LI H3K4me2 IMR90 Ludwig Institute ChIP-chip: H3K4me2 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipAch3Imr90_f LI H3ac IMR90 Ludwig Institute ChIP-chip: H3ac ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Hct116_f LI TAF1 HCT116 Ludwig Institute ChIP-chip: TAF1 ab, HCT116 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Imr90_f LI TAF1 IMR90 Ludwig Institute ChIP-chip: TAF1 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Thp1_f LI TAF1 THP1 Ludwig Institute ChIP-chip: TAF1 ab, THP1 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipTaf250Hela_f LI TAF1 HeLa Ludwig Institute ChIP-chip: TAF1 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapHct116_f LI Pol2 HCT116 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, HCT116 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapImr90_f LI Pol2 IMR90 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, IMR90 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapThp1_f LI Pol2 THP1 Ludwig Institute ChIP-chip: Pol2 8WG16 ab, THP1 cells ENCODE Chromatin Immunoprecipitation encodeUcsdChipRnapHela_f LI Pol2 HeLa Ludwig Institute ChIP-chip: Pol2 8WG16 ab, HeLa cells ENCODE Chromatin Immunoprecipitation encodeLIChIPgIF LI gIF ChIP Ludwig Institute/UCSD ChIP-chip - Gamma Interferon Experiments ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analysis of histones H3 and H4 with antibodies H3K4me2, H3K4me3, H3ac, H4ac, STAT1, RNA polymerase II and TAF1 was conducted with ChIP-chip, using chromatin extracted from HeLa cells induced for 30 min with interferon-gamma as well as uninduced cells. The H3K4me2, H3K4me3, H3ac form of histone H3, and H4ac form of histone H4 are associated with up-regulation of gene expression. STAT1 (signal transducer and activator of transcription) binds to DNA and activates transcription in response to various cytokines, including interferon-gamma. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from both induced and uninduced cells was separately cross-linked, precipitated with the antibodies, sheared, amplified and hybridized to a PCR DNA tiling array produced at the Ren Lab at UC San Diego. The array was composed of 24,537 non-repetitive sequences within the 44 ENCODE regions. Each state had three or more biological replicates. Each experiment was loess-normalized using R. The P-value and R-value were calculated using the modified single array error model (Li, Z. et al., 2003). The P-value and R-value were then derived from the weighted average results of the replicates. The displayed values were scaled to 0 - 16, corresponding to negative log base 10 of the P-value. Verification Each of the two experiments has three biological replicates. The array platform, the raw and normalized data for each experiment, and the image files have all been deposited at the NCBI GEO Microarray Database (pending approval). Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. References Kim, T., Barrera, L.O., Qu, C., van Calcar, S., Trinklein, N., Cooper, S., Luna, R., Glass, C.K., Rosenfeld, M.G., Myers, R., Ren, B. Direct isolation and identification of promoters in the human genome. Genome Research 15,830-839 (2005). Li, Z., Van Calcar, S., Qu, C., Cavenee, W.K., Zhang, M.Z., and Ren, B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc. Natl. Acad. Sci. 100(14), 8164-8169 (2003). Ren, B., Robert, F., Wyrick, J. W., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert , T. L., Wilson, C., Bell, S. P. and Young, R. A. Genome-wide location and function of DNA-associated proteins Science 290(5500), 2306-2309 (2000). encodeUcsdChipHeLaH3H4TAF250_p30 LI TAF1 +gIF Ludwig Institute ChIP-chip: TAF1, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4TAF250_p0 LI TAF1 -gIF Ludwig Institute ChIP-chip: TAF1, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4RNAP_p30 LI Pol2 +gIF Ludwig Institute ChIP-chip: RNA Pol2, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4RNAP_p0 LI Pol2 -gIF Ludwig Institute ChIP-chip: RNA Pol2, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4stat1_p30 LI STAT1 +gIF Ludwig Institute ChIP-chip: STAT1 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4stat1_p0 LI STAT1 -gIF Ludwig Institute ChIP-chip: STAT1 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH4_p30 LI H4ac +gIF Ludwig Institute ChIP-chip: H4ac ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH4_p0 LI H4ac -gIF Ludwig Institute ChIP-chip: H4ac ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH3_p30 LI H3ac +gIF Ludwig Institute ChIP-chip: H3ac ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4acH3_p0 LI H3ac -gIF Ludwig Institute ChIP-chip: H3ac ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4tmH3K4_p30 LI H3K4me3 +gIF Ludwig Institute ChIP-chip: H3K4me3 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4tmH3K4_p0 LI H3K4me3 -gIF Ludwig Institute ChIP-chip: H3K4me3 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4dmH3K4_p30 LI H3K4me2 +gIF Ludwig Institute ChIP-chip: H3K4me2 ab, HeLa cells, 30 min. after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdChipHeLaH3H4dmH3K4_p0 LI H3K4me2 -gIF Ludwig Institute ChIP-chip: H3K4me2 ab, HeLa cells, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeStanfordChip Stanf ChIP Stanford ChIP-chip (HCT116, Jurkat, K562 cells; Sp1, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation Description This track displays regions bound by Sp1 and Sp3, in the following three cell lines, assayed by ChIP and microarray hybridization: Cell LineClassificationIsolated From HCT 116colorectal carcinomacolon Jurkat, Clone E6-1acute T cell leukemiaT lymphocyte K-562chronic myelogenous leukemia (CML)bone marrow Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin IP was performed as described in Trinklein et al. (2004). Amplified and labeled ChIP DNA was hybridized to oligo tiling arrays produced by NimbleGen, along with a total genomic reference sample. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The value given for each probe is the transformed mean ratio of ChIP DNA:Total DNA. Verification Three biological replicates and two technical replicates were performed. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell 15(3), 1254-61 (2004). encodeStanfordChipSuper Stanf ChIP Stanford ChIP-chip ENCODE Chromatin Immunoprecipitation Overview This super-track combines related tracks of ChIP-chip data generated by the Stanford ENCODE group. ChIP-chip, also known as genome-wide location analysis, is a technique for isolation and identification of DNA sequences bound by specific proteins in cells. These tracks contain data for the Sp1 and Sp3 transcription factors in multiple cell lines, including HCT116 (colon epithelial carcinoma), Jurkat (T-cell lymphoblast), and K562 (myeloid leukemia). Credits The Sp1 and Sp3 data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007 Aug 2;448, 553-60. Trinklein ND, Murray JI, Hartman SJ, Botstein D, Myers RM. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell. 2004 Mar;15(3):1254-61. encodeStanfordChipK562Sp3 Stan K562 Sp3 Stanford ChIP-chip (K562 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipK562Sp1 Stan K562 Sp1 Stanford ChIP-chip (K562 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipJurkatSp3 Stan Jurkat Sp3 Stanford ChIP-chip (Jurkat cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipJurkatSp1 Stan Jurkat Sp1 Stanford ChIP-chip (Jurkat cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipHCT116Sp3 Stan HCT116 Sp3 Stanford ChIP-chip (HCT116 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipHCT116Sp1 Stan HCT116 Sp1 Stanford ChIP-chip (HCT116 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothed Stanf ChIP Score Stanford ChIP-chip Smoothed Score ENCODE Chromatin Immunoprecipitation Description This track displays smoothed (sliding-window mean) scores for regions bound by Sp1 and Sp3 in the following three cell lines, assayed by ChIP and microarray hybridization: Cell LineClassificationIsolated From HCT 116colorectal carcinomacolon Jurkat, Clone E6-1acute T cell leukemiaT lymphocyte K-562chronic myelogenous leukemia (CML)bone marrow Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin IP was performed as described in Trinklein et al. (2004). Amplified and labeled ChIP DNA was hybridized to oligo tiling arrays produced by NimbleGen along with a total genomic reference sample. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The transformed mean ratios of ChIP DNA:Total DNA for all probes were then smoothed by calculating a sliding-window mean. Windows of six neighboring probes (sliding two probes at a time) were used; within each window, the highest and lowest value were dropped, and the remaining 4 values were averaged. To increase the contrast between high and low values for visual display, the average was converted to a score by the formula: score = 8^(average) * 10. These scores are for visualization purposes; for all analyses, the raw ratios, which are available in the Stanf ChIP track, should be used. Verification Three biological replicates and two technical replicates were performed. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. The role of heat shock transcription factor 1 in the genome-wide regulation of the mammalian heat shock response. Mol. Biol. Cell 15(3), 1254-61 (2004). encodeStanfordChipSmoothedK562Sp3 Stan Sc K562 Sp3 Stanford ChIP-chip Smoothed Score (K562 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedK562Sp1 Stan Sc K562 Sp1 Stanford ChIP-chip Smoothed Score (K562 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedJurkatSp3 Stan Sc Jurkat Sp3 Stanford ChIP-chip Smoothed Score (Jurkat cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedJurkatSp1 Stan Sc Jurkat Sp1 Stanford ChIP-chip Smoothed Score (Jurkat cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedHCT116Sp3 Stan Sc HCT116 Sp3 Stanford ChIP-chip Smoothed Score (HCT116 cells, Sp3 ChIP) ENCODE Chromatin Immunoprecipitation encodeStanfordChipSmoothedHCT116Sp1 Stan Sc HCT116 Sp1 Stanford ChIP-chip Smoothed Score (HCT116 cells, Sp1 ChIP) ENCODE Chromatin Immunoprecipitation encodeUvaDnaRep UVa DNA Rep University of Virginia Temporal Profiling of DNA Replication ENCODE Chromosome, Chromatin and DNA Structure Description The five subtracks in this annotation correspond to five different time points relative to the start of the DNA synthesis phase (S-phase) of the cell cycle. Display Conventions and Configuration Regions that are replicated during the given time interval are shown in green. Varying shades of green are used to distinguish one subtrack from another. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The experimental strategy adopted to map this profile involved isolation of replication products from HeLa cells synchronized at the G1-S boundary by thymidine-aphidicolin double block. Cells released from the block were labeled with BrdU at every two-hour interval of the 10 hours of S-phase and DNA was isolated from them. The heavy-light(H/L) DNA representing the pool of DNA replicated during each two-hour labeling period was separated from the unlabeled DNA by double cesium chloride density gradient centrifugation. The purified heavy-light DNA was then hybridized to a high-density genome-tiling Affymetrix array comprised of all unique probes within the ENCODE regions. The raw data generated by the microarray experiments was processed by computing the enrichment of signal in a particular part of the S-phase relative to the entirety of the S-phase (10 hours). High confidence regions (P-value = 1E-04) of replication were mapped by applying the Wilcoxon Rank Sum test in a sliding window of size 10 kb using the standard Affymetrix data analysis tools and the April 2003 (hg15) version of the human genome assembly. These coordinates were then mapped to the July 2003 (hg17) assembly by UCSC using the liftOver tool. Verification The submitted data are from two biological experimental sets. Regions of significant enrichment were included from both of the biological replicates. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Jeon, Y., Bekiranov, S., Karnani, N., Kapranov, P., Ghosh, S., MacAlpine, D., Lee, C., Hwang, D.S., Gingeras, T.R. and Dutta, A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A 102(18), 6419-24 (2005). encodeUvaDnaRepSuper UVa DNA Rep University of Virginia DNA Replication Timing and Origins ENCODE Chromosome, Chromatin and DNA Structure Overview This super-track combines related tracks of DNA replication data from the University of Virginia. DNA replication is carefully coordinated, both across the genome and with respect to development. Earlier replication in S-phase is broadly correlated with gene density and transcriptional activity. These tracks contain temporal profiling of DNA replication and origin of DNA replication in multiple cell lines, such as HeLa cells (cervix carcinoma). Replication timing was measured by analyzing Brd-U-labeled fractions from synchronized cells on tiling arrays. Credits Data generation and analysis for this track were performed by the DNA replication group in the Dutta Lab at the University of Virginia: Neerja Karnani, Christopher Taylor, Hakkyun Kim, Louis Lim, Ankit Malhotra, Gabe Robins and Anindya Dutta. Neerja Karnani and Christopher Taylor prepared the data for presentation in the UCSC Genome Browser. References Giacca M, Pelizon C, Falaschi A. Mapping replication origins by quantifying relative abundance of nascent DNA strands using competitive polymerase chain reaction. Methods. 1997 Nov;13(3):301-12. Mesner LD, Crawford EL, Hamlin JL. Isolating apparently pure libraries of replication origins from complex genomes. Mol Cell. 2006 Mar 3;21(5):719-26. Jeon Y, Bekiranov S, Karnani N, Kapranov P, Ghosh S, MacAlpine D, Lee C, Hwang DS, Gingeras TR, Dutta A. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A. 2005 May 3;102(18):6419-24. encodeUvaDnaRep8 UVa DNA Rep 8h University of Virginia Temporal Profiling of DNA Replication (8-10 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep6 UVa DNA Rep 6h University of Virginia Temporal Profiling of DNA Replication (6-8 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep4 UVa DNA Rep 4h University of Virginia Temporal Profiling of DNA Replication (4-6 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep2 UVa DNA Rep 2h University of Virginia Temporal Profiling of DNA Replication (2-4 hrs) ENCODE Chromosome, Chromatin and DNA Structure encodeUvaDnaRep0 UVa DNA Rep 0h University of Virginia Temporal Profiling of DNA Replication (0-2 hrs) ENCODE Chromosome, Chromatin and DNA Structure regPotential2X 2x Reg Potential 2-Way Regulatory Potential - Human (hg16), Mouse (Oct. 2003/mm4) Regulation Description This track displays the 2-way regulatory potential (RP) score computed from alignments of human (hg16, Jul. '03) and mouse (mm4, Oct. '03). A 3-way RP score computed from alignments of human, mouse and rat is also available on this browser. RP scores compare frequencies of short alignment patterns between known regulatory elements and neutral DNA. Preliminary results from a calibration study investigating sensitivity and specificity of 2-way RP scores on the hemoglobin beta gene cluster suggest the use of a threshold just above 0 for identifying new putative regulatory elements. The default viewing range for this track is from 0.00 to 0.01 (score values below the 0.00 default lower limit indicate resemblance to alignment patterns typical of neutral DNA, while score values above the 0.01 default upper limit indicate very marked resemblance to alignment patterns typical of regulatory elements in the training set). The range of RP scores from 0.00 to 0.01 contains the prediction threshold suggested by calibration studies, and provides an effective visualization of the score for most genomic loci. However, the user can specify different viewing ranges if desired. Note: Absence of a score value at a given location indicates lack of a 2-way alignment. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods The comparison employs log-ratios of transitionu probabilities from two Markov models. Training the score entails selecting appropriate alphabet (alignment column symbols) and order (length of the patterns = order + 1) for the Markov models, and estimating their transition probabilities, based on alignment data from known regulatory elements and ancestral repeats. The 2-way RP score uses a 5-symbol alphabet and order 5. In the track, score values are displayed using a system of overlapping windows of size 100 bp along aligned portions of the human sequence Log-ratios are added over positions in a window, and the sum normalized for length. Credits Work on RP scores is performed by members of the Comparative Genomics and Bioinformatics Center at Penn State University. More information on this research and the collection of known regulatory elements used in training the score can be found at this site. Mouse sequence data were provided by the Mouse Sequencing Consortium. The alignment data were created in collaboration with the UCSC Genome Bioinformatics group. References Elnitski L, Hardison R, Li J, Yang S, Kolbe D, Eswara P, O'Connor M, Schwartz S, Miller W, Chiaromonte F. Distinguishing regulatory DNA from neutral sites. Genome Res. 2003 Jan;13(1):64-72. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. regPotential3X 3x Reg Potential 3-Way Regulatory Potential - Human, Mouse (Feb. 2003/mm3), Rat (June 2003/rn3) Regulation Description This track displays the 3-way regulatory potential (RP) score computed from alignments of human (hg16, Jul. '03), mouse (mm3, Feb. '03) and rat (rn3, Jun. '03). A 2-way RP score, computed from alignments of human and mouse only is also available on this browser. RP scores compare frequencies of short alignment patterns between known regulatory elements and neutral DNA. Preliminary results from a calibration study investigating sensitivity and specificity of 3-way RP scores on the hemoglobin beta gene cluster suggest the use of a threshold ~0.0006 for identifying new putative regulatory elements. The default viewing range for this track is from 0.00 to 0.01 (score values below the 0.00 default indicate resemblance to alignment patterns typical of neutral DNA, while score values above the 0.01 default indicate very marked resemblance to alignment patterns typical of regulatory elements in the training set). The range of RP scores from 0.00 to 0.01 contains the prediction threshold suggested by calibration studies, and provides an effective visualization of the score for most genomic loci. However, the user can specify different viewing ranges if desired. Note: Absence of a score value at a given location indicates lack of 3-way alignment. This track may be configured in a variety of ways to highlight different aspects of the displayed information. Click the Graph configuration help link for an explanation of the configuration options. Methods The comparison employs log-ratios of transitions probabilities from two Markov models. Training the score entails selecting appropriate alphabet (alignment column symbols) and order (length of the patterns = order + 1) for the Markov models, and estimating their transition probabilities, based on alignment data from known regulatory elements and ancestral repeats. The 3-way RP score uses a 10-symbol alphabet and order 2. In the track, score values are displayed using a system of overlapping windows of size 100 bp along aligned portions of the human sequence. Log-ratios are added over positions in a window, and the sum is normalized for length. Credits Work on RP scores is performed by members of the Comparative Genomics and Bioinformatics Center at Penn State University. More information on this research and the collection of known regulatory elements used in training the score can be found at this site. Mouse and rat sequence data were provided by the Mouse and Rat Sequencing Consortia. The alignment data were created in collaboration with the UCSC Genome Bioinformatics group. References Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W. Aligning Multiple Genomic Sequences with the Threaded Blockset Aligner. Genome Res. 2004 Apr;14(4):708-15. Kolbe D, Taylor J, Elnitski L, Eswara P, Li J, Miller W, Hardison RC, Chiaromonte F. Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat. Genome Res. 2004 Apr;14(4):700-7. affyHumanExon Affy All Exon Affymetrix All Exon Chips Expression Methods RNA (from a commercial source) from 11 tissues were hybridized to Affymetrix Human Exon 1.0 ST arrays. For each tissue, 3 replicate experiments were done for a total of 33 arrays. The arrays' raw signal intensity was normalized with a quantile normalization method, then run through the PLIER algorithm. The normalized data were then converted to log-ratios, which are displayed as green for negative log-ratios (underexpression), and red for positive (overexpression). The probe set for this microarray track can be displayed by turning on the Affy HuEx 1.0 track.\ Credits The data for this track was provided and analyzed by Chuck Sugnet at Affymetrix. Links AffyMetrix Human Exon 1.0 ST array web page AffyMetrix Human Exon data PLIER algorithm documentation (PDF). affyGnf1h Affy GNF1H Alignments of Affymetrix Consensus/Exemplars from GNF1H Expression Description This track shows the location of the sequences used for the selection of probes on the Affymetrix GNF1H chips. This contains 11406 predicted genes that do not overlap with the Affy U133A chip. Methods The sequences were mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to the Genomics Institute of the Novartis Research Foundation (GNF) for the data underlying this track. References Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. PMID: 15075390; PMC: PMC395923 affyHuEx1 Affy HuEx 1.0 Affymetrix Human Exon 1.0 Probe Sets Expression Description The Human Exon 1.0 ST GeneChip contains over 1.4 million probe sets designed to interrogate individual exons rather than the 3' ends of transcripts as in traditional GeneChips. Exons were derived from a variety of annotations that have been divided into the classes Core, Extended and Full. Core: RefSeq transcripts, full-length GenBank mRNAs Extended: dbEst alignments, Ensembl annotations, syntenic mRNA from rat and mouse, microRNA annotations, MITOMAP annotations, Vega genes, Vega pseudogenes Full: Geneid genes, Genscan genes, Genscan Subopt, Exoniphy, RNA genes, SGP genes, Twinscan genes Probe sets are colored by class with the Core probe sets being the darkest and the Full being the lightest color. Additionally, probe sets that do not overlap the exons of a transcript cluster, but fall inside of its introns, are considered bounded by that transcript cluster and are colored slightly lighter. Probe sets that overlap the coding portion of the Core class are colored slightly darker. The microarray track using this probe set can be displayed by turning on the Affy All Exon track. Credits and References The exons interrogated by the probe sets displayed in this track are from the Affymetrix Human Exon 1.0 GeneChip and were derived from a number of sources. In addition to the millions of cDNA sequences contributed to the GenBank, dbEst and RefSeq databases by individual labs and scientists, the following annotations were used: Ensembl: Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al.. The Ensembl genome database project. Nucleic Acids Research. 2002 Jan 1;30(1):38-41. Exoniphy: Siepel, A., Haussler, D. Computational identification of evolutionarily conserved exons. Proc. 8th Int'l Conf. on Research in Computational Molecular Biology, 177-186 (2004). Geneid Genes: Parra, G., Blanco, E., Guigo, R. Geneid in Drosophila. Genome Res. 10(4), 511-515 (2000). Genscan Genes: Burge, C., Karlin, S. Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268(1), 78-94 (1997). microRNA: Griffiths-Jones, S. The microRNA Registry. Nucl. Acids Res. 32, D109-D111 (2004). MITOMAP: Brandon, M. C., Lott, M. T., Nguyen, K. C., Spolim, S., Navathe, S. B., Baldi, P. & Wallace, D. C. MITOMAP: a human mitochondrial genome database--2004 update Nucl. Acids Res. 33(Database Issue):D611-613 (2005). RNA Genes: Lowe, T. M., Eddy, S. R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Res., 25(5), 955-964 (1997). SGP Genes: Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T., Guigo, R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res., 11(9), 1574-83 (2001). Twinscan Genes: Korf, I., Flicek, P., Duan, D., Brent, M.R. Integrating genomic homology into gene structure prediction. Bioinformatics 17, S140-148 (2001). Vega Genes and Pseudogenes: The HAVANA group, Wellcome Trust Sanger Institute. encodeAffyChIpHl60Pval Affy pVal Affymetrix ChIP/Chip (retinoic acid-treated HL-60 cells) P-Values ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of ten factors in all ENCODE regions, in retinoic-acid stimulated HL-60 cells harvested after 0, 2, 8, and 32 hours. Median P-values are shown in separate subtracks for each of the ten antibodies: Brg1 - Brahma-related Gene 1 CEBPe - CCAAT-enhancer binding protein-epsilon CTCF - CCTC binding factor H3K27me3 (H3K27T) - Histone H3 tri-methylated lysine 27 H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine P300 - E1A-binding protein, 300-KD PU1 - Spleen focus forming virus proviral integration oncogene Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) RARA (RARecA) - Retinoic Acid Receptor-Alpha SIRT1 - Sirtuin-1 Retinoic acid-stimulated HL-60 cells were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Only median P-values are displayed; data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the log transformed P-value (-10 log[10] P) across processed replicate data is displayed. Several independent biological replicates (four each for Brg1, CEBPe, CTCF, PU1, and SIRT1; five each for H3K27me3, H4Kac4, P300, Pol2 and RARA) were generated and hybridized to duplicate arrays (two technical replicates). Reproducible enriched regions were generated from the signal by first applying a cutoff of 20 to the log transformed P-values, a maxGap and minRun of 500 and 0 basepairs respectively, to each biological replicate. Since each region or site may be comprised of more than one probe, a median based on the distribution of log transformed P-values was computed per site for each of the respective replicates. These seed sites were then ranked individually within each of the replicates. If a site was absent in a replicate, the maximum or worst rank of the distribution was assigned to it. The following three values were computed for each site by combining data from all biological replicates: average of all ranks computed among biological replicates sum of all pairwise differences in these ranks computed among biological replicates a combined P-value, using a chi square distribution, across all replicates The final sites were selected when all of the above three metrics were relatively low, where "low" corresponds to the top 25 percentile of the distribution. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). encodeAffyChIpHl60PvalTfiibHr32 Affy TFIIB RA 32h Affymetrix ChIP/Chip (TFIIB retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr32 Affy SIRT1 RA 32h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr08 Affy SIRT1 RA 8h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr02 Affy SIRT1 RA 2h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalSirt1Hr00 Affy SIRT1 RA 0h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr32 Affy RARA RA 32h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr08 Affy RARA RA 8h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr02 Affy RARA RA 2h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRaraHr00 Affy RARA RA 0h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr32 Affy Pol2 RA 32h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr08 Affy Pol2 RA 8h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr02 Affy Pol2 RA 2h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalRnapHr00 Affy Pol2 RA 0h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr32 Affy PU1 RA 32h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr08 Affy PU1 RA 8h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr02 Affy PU1 RA 2h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalPu1Hr00 Affy PU1 RA 0h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr32 Affy P300 RA 32h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr08 Affy P300 RA 8h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr02 Affy P300 RA 2h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalP300Hr00 Affy P300 RA 0h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr32 Affy H4Kac4 RA 32h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr08 Affy H4Kac4 RA 8h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr02 Affy H4Kac4 RA 2h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH4Kac4Hr00 Affy H4Kac4 RA 0h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr32 Affy H3K27me3 RA 32h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr08 Affy H3K27me3 RA 8h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr02 Affy H3K27me3 RA 2h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalH3K27me3Hr00 Affy H3K27me3 RA 0h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr32 Affy CTCF RA 32h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr08 Affy CTCF RA 8h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr02 Affy CTCF RA 2h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCtcfHr00 Affy CTCF RA 0h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr32 Affy CEBPe RA 32h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr08 Affy CEBPe RA 8h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr02 Affy CEBPe RA 2h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalCebpeHr00 Affy CEBPe RA 0h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr32 Affy Brg1 RA 32h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 32hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr08 Affy Brg1 RA 8h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 8hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr02 Affy Brg1 RA 2h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 2hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60PvalBrg1Hr00 Affy Brg1 RA 0h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 0hrs) P-Value ENCODE Chromatin Immunoprecipitation encodeAffyRnaSignal Affy RNA Signal Affymetrix PolyA+ RNA Signal ENCODE Transcript Levels Description This track shows an estimate of RNA abundance (transcription) for all ENCODE regions. Retinoic acid-stimulated HL-60 cells were harvested after 0, 2, 8, and 32 hours. Purified cytosolic polyA+ RNA from unstimulated GM06990 and HeLa cells, as well as purified polyA+ RNA from the RA-stimulated HL-60 samples, was hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Composite signals are shown in separate subtracks for each cell type and for each of the four timepoints for RA-stimulated HL-60. Data for all biological replicates can be downloaded from Affymetrix in wiggle, cel, and soft formats. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different cell types and timepoints. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 101 bp window centered on each probe, an estimate of RNA abundance (signal) was found by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods; Cawley et al. also describes the analytical methods. Verification Three independent biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 50 base-pairs and minimum run (MinRun) of 50 base-pairs (see the Affy TransFrags track for the merged regions). Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and the Kevin Struhl group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeAffyRnaHl60SignalHr32 Affy RNA RA 32h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 32hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr08 Affy RNA RA 8h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 8hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr02 Affy RNA RA 2h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 2hrs) Signal ENCODE Transcript Levels encodeAffyRnaHl60SignalHr00 Affy RNA RA 0h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 0hrs) Signal ENCODE Transcript Levels encodeAffyRnaHeLaSignal Affy RNA HeLa Affymetrix PolyA+ RNA (HeLaS3) Signal ENCODE Transcript Levels encodeAffyRnaGm06990Signal Affy RNA GM06990 Affymetrix PolyA+ RNA (GM06990) Signal ENCODE Transcript Levels encodeAffyChIpHl60Sites Affy Sites Affymetrix ChIP/Chip (retinoic acid-treated HL-60 cells) Sites ENCODE Chromatin Immunoprecipitation Description This track shows regions that co-precipitate with antibodies against each of ten factors in all ENCODE regions, in retinoic-acid stimulated HL-60 cells harvested after 0, 2, 8, and 32 hours. Clustered sites are shown in separate subtracks for each of the ten antibodies: Brg1 - Brahma-related Gene 1 CEBPe - CCAAT-enhancer binding protein-epsilon CTCF - CCTC binding factor H3K27me3 (H3K27T) - Histone H3 tri-methylated lysine 27 H4Kac4 (HisH4) - Histone H4 tetra-acetylated lysine P300 - E1A-binding protein, 300-KD PU1 - Spleen focus forming virus proviral integration oncogene Pol2 - RNA Polymerase II (8WG16 ab against pre-initiation complex form) RARA (RARecA) - Retinoic Acid Receptor-Alpha SIRT1 - Sirtuin-1 Retinoic acid-stimulated HL-60 cells were harvested and whole cell extracts (control) were made. An antibody was used to immunoprecipitate bound chromatin fragments (treatment). DNA was purified from these samples and hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Display Conventions and Configuration The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options for the subtracks are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for finding the same antibody in different timepoint tracks. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 1001 bp window centered on each probe, a signal estimator S = ln[max(PM - MM, 1)] (where PM is perfect match and MM is mismatch) was computed for each biological replicate treatment- and all replicate control-probe pairs. An estimate of the significance of the enrichment of treatment signal for each replicate over control signal in each window was given by the P-value computed using the Wilcoxon Rank Sum test over each biological replicate treatment and all control signal estimates in that window. The median of the log transformed P-value (-10 log10 P) across processed replicate data is displayed. Several independent biological replicates (four each for Brg1, CEBPe, CTCF, PU1, and SIRT1; five each for H3K27me3, H4Kac4, P300, Pol2 and RARA) were generated and hybridized to duplicate arrays (two technical replicates). Reproducible enriched regions were generated from the signal by first applying a cutoff of 20 to the log transformed P-values, a maxGap and minRun of 500 and 0 basepairs respectively, to each biological replicate. Since each region or site may be comprised of more than one probe, a median based on the distribution of log transformed P-values was computed per site for each of the respective replicates. These seed sites were then ranked individually within each of the replicates. If a site was absent in a replicate, the maximum or worst rank of the distribution was assigned to it. The following three values were computed for each site by combining data from all biological replicates: average of all ranks computed among biological replicates sum of all pairwise differences in these ranks computed among biological replicates a combined P-value, using a chi square distribution, across all replicates The final sites were selected when all of the above three metrics were relatively low, where "low" corresponds to the top 25 percentile of the distribution. Verification Using the P-values from the biological replicates, all pairwise rank correlation coefficients were computed among biological replicates. Data sets showing both consistent pairwise correlation coefficients and at least weak positive correlation across all pairs were considered reproducible. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and Kevin Struhl's group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). encodeAffyChIpHl60SitesTfiibHr32 Affy TFIIB RA 32h Affymetrix ChIP/Chip (TFIIB retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr32 Affy SIRT1 RA 32h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr08 Affy SIRT1 RA 8h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr02 Affy SIRT1 RA 2h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesSirt1Hr00 Affy SIRT1 RA 0h Affymetrix ChIP/Chip (SIRT1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr32 Affy RARA RA 32h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr08 Affy RARA RA 8h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr02 Affy RARA RA 2h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRaraHr00 Affy RARA RA 0h Affymetrix ChIP/Chip (RARA retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr32 Affy Pol2 RA 32h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr08 Affy Pol2 RA 8h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr02 Affy Pol2 RA 2h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesRnapHr00 Affy Pol2 RA 0h Affymetrix ChIP/Chip (Pol2 8WG16 antibody, retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr32 Affy PU1 RA 32h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr08 Affy PU1 RA 8h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr02 Affy PU1 RA 2h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesPu1Hr00 Affy PU1 RA 0h Affymetrix ChIP/Chip (PU1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr32 Affy P300 RA 32h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr08 Affy P300 RA 8h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr02 Affy P300 RA 2h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesP300Hr00 Affy P300 RA 0h Affymetrix ChIP/Chip (P300 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr32 Affy H4Kac4 RA 32h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr08 Affy H4Kac4 RA 8h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr02 Affy H4Kac4 RA 2h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH4Kac4Hr00 Affy H4Kac4 RA 0h Affymetrix ChIP/Chip (H4Kac4 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr32 Affy H3K27me3 RA 32h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr08 Affy H3K27me3 RA 8h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr02 Affy H3K27me3 RA 2h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesH3K27me3Hr00 Affy H3K27me3 RA 0h Affymetrix ChIP/Chip (H3K27me3 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr32 Affy CTCF RA 32h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr08 Affy CTCF RA 8h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr02 Affy CTCF RA 2h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCtcfHr00 Affy CTCF RA 0h Affymetrix ChIP/Chip (CTCF retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr32 Affy CEBPe RA 32h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr08 Affy CEBPe RA 8h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr02 Affy CEBPe RA 2h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesCebpeHr00 Affy CEBPe RA 0h Affymetrix ChIP/Chip (CEBPe retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr32 Affy Brg1 RA 32h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 32hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr08 Affy Brg1 RA 8h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 8hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr02 Affy Brg1 RA 2h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 2hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyChIpHl60SitesBrg1Hr00 Affy Brg1 RA 0h Affymetrix ChIP/Chip (Brg1 retinoic acid-treated HL-60, 0hrs) Sites ENCODE Chromatin Immunoprecipitation encodeAffyRnaTransfrags Affy Transfrags Affymetrix PolyA+ RNA Transfrags ENCODE Transcript Levels Description This track shows an estimate of RNA abundance (transcription) for all ENCODE regions. Retinoic acid-stimulated HL-60 cells were harvested after 0, 2, 8, and 32 hours. Purified cytosolic polyA+ RNA from unstimulated GM06990 and HeLa cells, as well as purified polyA+ RNA from the RA-stimulated HL-60 samples, was hybridized to Affymetrix ENCODE oligonucleotide tiling arrays, which have 25-mer probes tiled every 22 bp on average in the non-repetitive ENCODE regions. Clustered sites are shown in separate subtracks for each cell type and for each of the four timepoints for RA-stimulated HL-60. Display Conventions and Configuration To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different cell types and timepoints. Methods The data from replicate arrays were quantile-normalized (Bolstad et al., 2003) and all arrays were scaled to a median array intensity of 22. Within a sliding 101 bp window centered on each probe, an estimate of RNA abundance (signal) was found by calculating the median of all pairwise average PM-MM values, where PM is a perfect match and MM is a mismatch. Both Kapranov et al. (2002) and Cawley et al. (2004) are good references for the experimental methods; Cawley et al. also describes the analytical methods. Verification Three independent biological replicates were generated and hybridized to duplicate arrays (two technical replicates). Transcribed regions (see the Affy RNA Signal track) were generated from the composite signal track by merging genomic positions to which probes are mapped. This merging was based on a 5% false positive rate cutoff in negative bacterial controls, a maximum gap (MaxGap) of 50 base-pairs and minimum run (MinRun) of 50 base-pairs. Credits These data were generated and analyzed by the Gingeras/Struhl collaboration with the Tom Gingeras group at Affymetrix and the Kevin Struhl group at Harvard Medical School. References Please see the Affymetrix Transcriptome site for a project overview and additional references to Affymetrix tiling array publications. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185-193 (2003). Cawley, S., Bekiranov, S., Ng, H. H., Kapranov, P., Sekinger, E. A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A. J., et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Kapranov, P., Cawley, S. E., Drenkow, J., Bekiranov, S., Strausberg, R. L., Fodor, S. P., and Gingeras, T. R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-919 (2002). encodeAffyRnaHl60SitesHr32 Affy RNA RA 32h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 32hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr08 Affy RNA RA 8h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 8hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr02 Affy RNA RA 2h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 2hrs) Sites ENCODE Transcript Levels encodeAffyRnaHl60SitesHr00 Affy RNA RA 0h Affymetrix PolyA+ RNA (retinoic acid-treated HL-60, 0hrs) Sites ENCODE Transcript Levels encodeAffyRnaHeLaSites Affy RNA HeLa Affymetrix PolyA+ RNA (HeLaS3) Sites ENCODE Transcript Levels encodeAffyRnaGm06990Sites Affy RNA GM06990 Affymetrix PolyA+ RNA (GM06990) Sites ENCODE Transcript Levels affyU133 Affy U133 Alignments of Affymetrix Consensus/Exemplars from HG-U133 Expression Description This track shows the location of the consensus and exemplar sequences used for the selection of probes on the Affymetrix HG-U133A and HG-U133B chips. Methods Consensus and exemplar sequences were downloaded from the Affymetrix Product Support and mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to Affymetrix for the data underlying this track. affyU95 Affy U95 Alignments of Affymetrix Consensus/Exemplars from HG-U95 Expression Description This track shows the location of the consensus and exemplar sequences used for the selection of probes on the Affymetrix HG-U95Av2 chip. For this chip, probes are predominantly designed from consensus sequences. Methods Consensus and exemplar sequences were downloaded from the Affymetrix Product Support and mapped to the genome using blat followed by pslReps with the parameters: -minCover=0.3 -minAli=0.95 -nearTop=0.005 Credits Thanks to Affymetrix for the data underlying this track. altGraphX Alt-Splicing Alternative Splicing from ESTs/mRNAs mRNA and EST Description This track summarizes alternative splicing shown in the mRNA and EST tracks. The blocks represent exons; lines indicate possible splice junctions. The graphical display is drawn such that no exons overlap, making alternative events easier to view when the track is in full display mode and the resolution is set to approximately gene-level. To help reduce the noise present in the EST libraries, exons and splice junctions are filtered based on orthologous mouse transcripts and the frequency with which an exon or intron appears in human transcript libraries. Only those exons and splice junctions that have an orthologous exon or splice junction in the mouse transcriptome or are present three or more times in the human transcriptome are kept. Transcripts labeled as mRNA in GenBank are weighted more heavily, reflecting their typically higher quality. This process is similar to that presented in Sugnet, C.W. et al., Transcriptome and genome conservation of alternative splicing events in humans and mice. Pacific Symposium on Biocomputing (PSB) 2004 Online Proceedings. Methods The splicing graphs for each genome were generated separately from their native EST and mRNA transcripts using the following process: The mRNAs and ESTs were aligned to the genomic sequence using blat. A near-best-in-genome filter was applied such that only alignments with 97% identity over 90% of the transcript and with a score no more than 0.5% lower than the best score were kept. The ESTs and mRNAs were oriented by examining the splice sites used in the genomic sequence. Only consensus splice sites, GT->AG and the less common GC->AG, were used to orient the transcripts. Alignments were clustered together by sequence overlap in exons. As new splice sites were discovered, they were entered into the graphs as vertices; the exons, introns, and splice junctions connecting them were recorded as edges. Each graph was considered to be a single locus, although they may be fragments of an actual gene structure. The supporting mRNA and EST accessions for each edge were also stored. Truncated transcripts were extended by overlap with other transcripts to the next consensus splice site. This prevented the retention of vertices in the graph that were not true splice sites. After the splicing graphs were constructed independently for both human and mouse, they were mapped to each other using the entire set of genome mouse net alignments (viewable on the browser as the Mouse Net track). Only those exons and splice junctions that were common to both or occurred three or more times in the human transcript were kept in the splicing graph. When counting the number of times an exon or splice junction was included in the human transcripts, those designated as mRNA were weighted more heavily than those designated as EST. References For more information on the mouse net alignments, see Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Credits This annotation was generated by Chuck Sugnet of the UCSC Genome Bioinformatics Group. gold Assembly Assembly from Fragments Mapping and Sequencing Description This track shows the draft assembly of the human genome. This assembly merges contigs from overlapping drafts and finished clones into longer sequence contigs. The sequence contigs are ordered and oriented when possible by mRNA, EST, paired plasmid reads (from the SNP Consortium) and BAC end sequence pairs. In dense mode, this track depicts the path through the draft and finished clones (aka the golden path) used to create the assembled sequence. Clone boundaries are distinguished by the use of alternating gold and brown coloration. Where gaps exist in the path, spaces are shown between the gold and brown blocks. If the relative order and orientation of the contigs between the two blocks is known, a line is drawn to bridge the blocks. Clone Type Key: F - Finished A - In Active Finishing D - Draft P - Pre-Draft O - Other Sequence augustusGene AUGUSTUS AUGUSTUS ab initio gene predictions v3.1 Genes and Gene Predictions Description This track shows ab initio predictions from the program AUGUSTUS (version 3.1). The predictions are based on the genome sequence alone. For more information on the different gene tracks, see our Genes FAQ. Methods Statistical signal models were built for splice sites, branch-point patterns, translation start sites, and the poly-A signal. Furthermore, models were built for the sequence content of protein-coding and non-coding regions as well as for the length distributions of different exon and intron types. Detailed descriptions of most of these different models can be found in Mario Stanke's dissertation. This track shows the most likely gene structure according to a Semi-Markov Conditional Random Field model. Alternative splicing transcripts were obtained with a sampling algorithm (--alternatives-from-sampling=true --sample=100 --minexonintronprob=0.2 --minmeanexonintronprob=0.5 --maxtracks=3 --temperature=2). The different models used by Augustus were trained on a number of different species-specific gene sets, which included 1000-2000 training gene structures. The --species option allows one to choose the species used for training the models. Different training species were used for the --species option when generating these predictions for different groups of assemblies. Assembly Group Training Species Fish zebrafish Birds chicken Human and all other vertebrates human Nematodes caenorhabditis Drosophila fly A. mellifera honeybee1 A. gambiae culex S. cerevisiae saccharomyces This table describes which training species was used for a particular group of assemblies. When available, the closest related training species was used. Credits Thanks to the Stanke lab for providing the AUGUSTUS program. The training for the chicken version was done by Stefanie König and the training for the human and zebrafish versions was done by Mario Stanke. References Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008 Mar 1;24(5):637-44. PMID: 18218656 Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. PMID: 14534192 bacEndPairs BAC End Pairs BAC End Pairs Mapping and Sequencing Description Bacterial artificial chromosomes (BACs) are a key part of many large-scale sequencing projects. A BAC typically consists of 25 - 350 kb of DNA. During the early phase of a sequencing project, it is common to sequence a single read (approximately 500 bases) off each end of a large number of BACs. Later on in the project, these BAC end reads can be mapped to the genome sequence. This track shows these mappings in cases where both ends could be mapped. These BAC end pairs can be useful for validating the assembly over relatively long ranges. In some cases, the BACs are useful biological reagents. This track can also be used for determining which BAC contains a given gene, useful information for certain wet lab experiments. A valid pair of BAC end sequences must be at least 25 kb but no more than 350 kb away from each other. The orientation of the first BAC end sequence must be "+" and the orientation of the second BAC end sequence must be "-". The scoring scheme used for this annotation assigns 1000 to an alignment when the BAC end pair aligns to only one location in the genome (after filtering). When a BAC end pair or clone aligns to multiple locations, the score is calculated as 1500/(number of alignments). Methods BAC end sequences are placed on the assembled sequence using Jim Kent's blat program. Credits Additional information about the clone, including how it can be obtained, may be found at the NCBI Clone Registry. To view the registry entry for a specific clone, open the details page for the clone and click on its name at the top of the page. Some BAC library clones (RPCI-11, and others) can be ordered from BACPAC Genomics, RIKEN, or from Thermofisher and possibly other companies. encodeBuFirstExon BU First Exon Boston University First Exon Activity ENCODE Transcript Levels Description This track displays expression levels of computationally identified first exons and a constitutive exon of genes in ENCODE regions, based on the real competitive Polymerase Chain Reaction (rcPCR) technique described in Ding et al. (2003). Expression levels are indicated by color, ranging from black (no expression) to red (high expression). Experiments were performed on total RNA samples of ten normal human tissues purchased from Clontech (Palo Alto, CA): cerebral cortex, colon, heart, kidney, liver, lung, skeletal muscle, spleen, stomach, and testis. The name for each alternative transcript starts with the gene name, followed by an identifier for the alternative first exon or the constitutive exon. For example, for gene CAV1, there are three alternative first exons (CAV1-E1A, CAV1-E1B, and CAV1-E1C) and the third exon is chosen as the constitutively expressed exon (CAV1-E3). Methods Alternative transcription start sites (TSS) for 20 ENCODE genes were predicted using PromoSer, an in-house computational tool. PromoSer computationally identifies the TSS by considering alignments of a large number of partial and full-length mRNA sequences and ESTs to genomic DNA, with provision for alternative promoters. In PromoSer, the treatment of alternative first exons (or the resulting TSSs) is as follows: all transcripts (mRNA, full-length mRNA and EST) from the same gene cluster are examined individual ESTs are not considered for alternative TSSs; only the 5'-most positions from all ESTs in the cluster are considered a potential TSS if multiple 5'-end positions are more than 20 bp apart, they are reported as alternative TSSs For each gene, all alternative first exons were identified based on manual selection of PromoSer predictions. An exon that is shared by all transcripts (called the constitutive exon) was also selected. The selection process involved visually examining the structure of the cluster, preferably using the latest data available on UCSC, to identify distinct first exons that were well formed (having multiple supporting sequences) and had no evidence (especially from newer sequences) of additional sequence that made them internal exons. After the first exon was identified, a subsequence (between 100-300 bases) was selected for use in the experiment. The selection process avoided repeat sequences as much as possible and if the two first exons partially overlapped, the non-overlapping region was selected. If those conditions caused the remaining sequence to be too short (or the first exon itself was too short), a junction with the second exon was used. A constitutive exon was also selected that was included in all (or most) of the alternative transcripts and suitable sequences were then extracted as above (no exon junctions are used). The absolute expression levels of all exons were individually quantified by rcPCR by designing four assays with PCR amplicons corresponding to each exon. Amplicons were designed according to transcript sequences and can span a large distance on the genomic sequence. In addition, some amplicons were designed across the junctions between first exons and the constitutive second exons, and thus these amplicons may overlap with the amplicons that correspond to the constitutive second exons. The rcPCR technique combined competitive PCR and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) for gene expression analysis. To measure the expression level of a gene, an oligonucleotide standard (60-80 bases) of known concentration, complementary to the target sequence with a single base mismatch in the middle, was added as the competitor for PCR. The gene of interest and the oligonucleotide standard resembled two alleles of a heterozygous locus in an allele frequency analysis experiment, and thus could be quantified by the high-throughput MALDI-TOF MS based MassARRAY system (Sequenom Inc.). After PCR, a base extension reaction was carried out with an extension primer, a ThermoSequenase and a mixture of ddNTPs/dNTP (for example, a mixture of ddA, ddC, ddT, and dG). The extension primer annealed the immediate 5’-upstream sequence of the mismatch position. Depending on the nature of the mismatch and the mixture composition of ddNTPs/dNTP, one or two bases were added to the extension primer, producing two extension products with one base-length difference. These two extension products were then detected and quantified by MALDI-TOF MS. Expression ratios (e.g. CAV1-E1A/CAV1-E3, CAV1-E1B/CAV1-E3, CAV1-E1C/CAV1-E3) indicate the relative abundance of alternative first exons. 18S rRNA was used for exon absolute expression normalization among different tissues. Values shown on this track represent the relative abundance of the alternative first exons with respect to the 18S rRNA. The raw values have been log10 transformed and scaled to show graded colors on the browser. Verification One biological replicate was performed for each gene. Two to four competitor concentrations were used to detect the expression level of each exon. Two to six technical replicates were performed for each competitor concentration. One more biological replicate will be performed in the future. Credits Data generation and analysis for this track were performed by ZLAB at Boston University. The following people contributed: Shengnan Jin, Anason Halees, Heather Burden, Yutao Fu, Ulas Karaoz, Yong Yu, Chunming Ding, Charles R. Cantor, and Zhiping Weng. References Ding, C. and Cantor, C.R. A high-throughput gene expression analysis technique using competitive PCR and matrix-assisted laser desorption ionization time-of-flight MS. Proc Natl Acad Sci U S A 100(6), 3059-64 (2003). Ding, C. and Cantor, C.R. Direct molecular haplotyping of long-range genomic DNA with M1-PCR. Proc Natl Acad Sci U S A 100(13), 7449-53 (2003). Halees, A.S., Leyfer, D. and Weng, Z. PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res. 31(13), 3554-9 (2003). Halees, A.S. and Weng, Z. PromoSer: improvements to the algorithm, visualization and accessibility. Nucleic Acids Res., 32, W191-W194 (2004). encodeBuFirstExonTestis BU Testis Boston University First Exon Activity in Testis ENCODE Transcript Levels encodeBuFirstExonStomach BU Stomach Boston University First Exon Activity in Stomach ENCODE Transcript Levels encodeBuFirstExonSpleen BU Spleen Boston University First Exon Activity in Spleen ENCODE Transcript Levels encodeBuFirstExonSkMuscle BU Skel. Muscle Boston University First Exon Activity in Skeletal Muscle ENCODE Transcript Levels encodeBuFirstExonLung BU Lung Boston University First Exon Activity in Lung ENCODE Transcript Levels encodeBuFirstExonLiver BU Liver Boston University First Exon Activity in Liver ENCODE Transcript Levels encodeBuFirstExonKidney BU Kidney Boston University First Exon Activity in Kidney ENCODE Transcript Levels encodeBuFirstExonHeart BU Heart Boston University First Exon Activity in Heart ENCODE Transcript Levels encodeBuFirstExonColon BU Colon Boston University First Exon Activity in Colon ENCODE Transcript Levels encodeBuFirstExonCerebrum BU Cere. Cortex Boston University First Exon Activity in Cerebral Cortex ENCODE Transcript Levels encodeBu_ORChID1 BU ORChID Boston University ORChID (OH Radical Cleavage Intensity Database) ENCODE Chromosome, Chromatin and DNA Structure Description This track displays the predicted hydroxyl radical cleavage intensity on naked DNA for each nucleotide in the ENCODE regions. Because the hydroxyl radical cleavage intensity is proportional to the solvent accessible surface area of the deoxyribose hydrogen atoms (Balasubramanian et al., 1998), this track represents a structural profile of the DNA in the ENCODE regions. Please visit the ORChID website maintained by the Tullius group for access to experimental hydroxyl radical cleavage data, and to a server which can be used to predict the cleavage pattern for any input sequence. Display Conventions and Configuration This track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Hydroxyl radical cleavage intensity predictions were performed using an in-house sliding trimer window (STW) algorithm. This algorithm draws data from the ·OH Radical Cleavage Intensity Database (ORChID), which contains more than 150 experimentally determined cleavage patterns. These predictions are fairly accurate, with a Pearson coefficient of ~0.85 between the predicted and experimentally determined cleavage intensities. For more details on the hydroxyl radical cleavage method, see the References section below. Verification The STW algorithm has been cross-validated by removing each test sequence from the training set and performing a prediction. The mean correlation coefficient (between predicted and experimental cleavage patterns) from this study was 0.85. Credits These data were generated through the combined effort of Bo Pang at MIT and Jason Greenbaum, Steve Parker, and Tom Tullius of Boston University. References Balasubramanian, B., Pogozelski, W.K., and Tullius, T.D. DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone. Proc. Natl. Acad. Sci. USA 95(17), 9738-9743 (1998). Price, M. A., and Tullius, T. D. Using the Hydroxyl Radical to Probe DNA Structure. Meth. Enzymol. 212, 194-219 (1992). Tullius, T. D. Probing DNA Structure with Hydroxyl Radicals. In Current Protocols in Nucleic Acid Chemistry, (eds. Beaucage, S.L., Bergstrom, D.E., Glick, G.D. and Jones, R.A.) (Wiley, 2001), pp. 6.7.1-6.7.8. cytoBand Chromosome Band Chromosome Bands Localized by FISH Mapping Clones Mapping and Sequencing Description The chromosome band track represents the approximate location of bands seen on Giemsa-stained chromosomes. Chromosomes are displayed in the browser with the short arm first. Cytologically identified bands on the chromosome are numbered outward from the centromere on the short (p) and long (q) arms. At low resolution, bands are classified using the nomenclature [chromosome][arm][band], where band is a single digit. Examples of bands on chromosome 3 include 3p2, 3p1, cen, 3q1, and 3q2. At a finer resolution, some of the bands are subdivided into sub-bands, adding a second digit to the band number, e.g. 3p26. This resolution produces about 500 bands. A final subdivision into a total of 862 sub-bands is made by adding a period and another digit to the band, resulting in 3p26.3, 3p26.2, etc. Methods A full description of the method by which the chromosome band locations are estimated can be found in Furey and Haussler, 2003. Barbara Trask, Vivian Cheung, Norma Nowak and others in the BAC Resource Consortium used fluorescent in-situ hybridization (FISH) to determine a cytogenetic location for large genomic clones on the chromosomes. The results from these experiments are the primary source of information used in estimating the chromosome band locations. For more information about the process, see the paper, Cheung, et al., 2001. and the accompanying web site, Human BAC Resource. BAC clone placements in the human sequence are determined at UCSC using a combination of full BAC clone sequence, BAC end sequence, and STS marker information. Credits We would like to thank all the labs that have contributed to this resource: Fred Hutchinson Cancer Research Center (FHCRC) National Cancer Institute (NCI) Roswell Park Cancer Institute (RPCI) The Wellcome Trust Sanger Institute (SC) Cedars-Sinai Medical Center (CSMC) Los Alamos National Laboratory (LANL) UC San Francisco Cancer Center (UCSF) References Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey TS, Kim UJ, Kuo WL, Olivier M et al. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001 Feb 15;409(6822):953-8. PMID: 11237021 Furey TS, Haussler D. Integration of the cytogenetic map with the draft human genome sequence. Hum Mol Genet. 2003 May 1;12(9):1037-44. PMID: 12700172 cytoBandIdeo Chromosome Band (Ideogram) Chromosome Bands Localized by FISH Mapping Clones (for Ideogram) Mapping and Sequencing encodeAllElements Consens Elements NHGRI/PSU/UCSC/Stanford TBA and MLAGAN Consensus Conserved Elements ENCODE Comparative Genomics Description These tracks represent conserved elements detected by any (union) or all (intersection) combinations of elements produced by binCons, phastCons, and GERP conservation scoring methods applied to TBA and MLAGAN sequence alignments of 23 vertebrates in the ENCODE regions. For more information on the individual subtracks, see the description pages for the TBA Elements and MLAGAN Elements tracks. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. The display may be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. To show only selected subtracks within this annotation, uncheck the boxes next to the tracks you wish to hide. Methods In these annotations, "non-coding" refers to those regions not overlapping with CDS regions in any of the following UCSC Genome Browser tables: refFlat, knownGene, mgcGenes, vegaGene, or ensGene. See the description pages for the TBA Elements and MLAGAN Elements for additional information about methods used to generate these data. Verification See the description pages for the TBA Elements and MLAGAN Elements for information about verification techniques used to generate these data. Credits BinCons and phastCons MCS data were contributed by Elliott Margulies in the Eric Green lab at NHGRI, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). The intersection and union data shown in these subtracks were contributed by Elliott Margulies. References See the TBA/MLAGAN Alignment and TBA/MLAGAN Cons tracks for references. encodeAllNcIntersectEl NC Intersect TBA and MLAGAN PhastCons/BinCons/GERP Intersection NonCoding Conserved Elements ENCODE Comparative Genomics encodeAllIntersectEl Intersect TBA and MLAGAN PhastCons/BinCons/GERP Intersection Conserved Elements ENCODE Comparative Genomics encodeAllNcUnionEl NC Union TBA and MLAGAN PhastCons/BinCons/GERP Union NonCoding Conserved Elements ENCODE Comparative Genomics encodeAllUnionEl Union TBA and MLAGAN PhastCons/BinCons/GERP Union Conserved Elements ENCODE Comparative Genomics encodeWorkshopSelections Consens Unann TF ENCODE Consensus Unannotated Transfrags ENCODE Analysis Description This annotation is a collection of selected non-coding Affy transcribed fragments (transfrags) and Yale transcriptionally-active regions (TARs) from the Affy Transfrags and Yale TAR annotation tracks, grouped into subsets based on their proximity to the transcript-coding and exon-coding regions of GenCode genes. Each subtrack shows the union of data from the eleven related individual subtracks in the Unann Transfrags track. Consensus Intronic Proximal: shows all transfrags/TARs within Gencode gene introns that are proximal to a Gencode gene exon, i.e. are within a Gencode gene transcript-coding region and are within 5000 bases of a Gencode gene exon-coding region. Consensus Intronic Distal: shows all transfrags/TARs within Gencode gene introns that are not near an exon, i.e. are within a Gencode gene transcript-coding region, but at least 5000 base pairs away from any Gencode gene exon-coding region. Consensus Intergenic Proximal: shows all transfrags/TARs in the ENCODE intergenic regions that are proximal to a Gencode gene exon, i.e. are outside the Gencode gene transcript-coding regions, but are within 5000 base pairs of a Gencode gene exon-coding region. Consensus Intergenic Distal: shows all transfrags/TARs in the ENCODE intergenic regions that are not near an exon, i.e. are outside the Gencode gene transcript-coding regions and at least 5000 base pairs away from any of the Gencode gene exon-coding regions. Display Conventions and Configuration This annotation track is comprised of subtracks that show different aspects of the displayed data. The complete list of subtracks is displayed near the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods This annotation track was derived from data contained in the Yale TAR, Affy Transfrags, and Gencode Genes annotation tracks using the UCSC Table Browser. See the description pages associated with these annotation tracks for more information about the data sets. Credits This annotation was generated by Hiram Clawson of the UCSC Genome Bioinformatics Group. encodeAllIntergenicDistal Intergenic Dist Consensus Intergenic Distal ENCODE Analysis encodeAllIntergenicProximal Intergenic Prox Consensus Intergenic Proximal ENCODE Analysis encodeAllIntronsDistal Intronic Dist Consensus Intronic Distal ENCODE Analysis encodeAllIntronsProximal Intronic Prox Consensus Intronic Proximal ENCODE Analysis clonePos Coverage Clone Coverage Mapping and Sequencing Description In dense display mode, this track shows the coverage level of the genome. Finished regions are depicted in black. Draft regions are shown in various shades of gray that correspond to the level of coverage. In full display mode, this track shows the position of each clone that aligns to the genome sequence. Finished clones are depicted in black, and unfinished clones are colored gray. NOTE: Fragment positions in unfinished clones are no longer delineated. cpgIsland CpG Islands CpG Islands (Islands < 300 Bases are Light Green) Regulation Description CpG islands are associated with genes, particularly housekeeping genes, in vertebrates. CpG islands are typically common near transcription start sites, and may be associated with promoter regions. Normally a C (cytosine) base followed immediately by a G (guanine) base (a CpG) is rare in vertebrate DNA because the Cs in such an arrangement tend to be methylated. This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to keep them or a region is not methylated for some reason, perhaps having to do with the regulation of gene expression. CpG islands are regions where CpGs are present at significantly higher levels than is typical for the genome as a whole. Methods CpG islands are predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment is then evaluated for the following criteria: GC content roughly 50% or greater length greater than 200 ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the GC content of the segment The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. Credits This track was generated using a modification of a program developed by G. Miklem and L. Hillier. ECgene ECgene Genes ECgene Gene Predictions with Alt-Splicing Genes and Gene Predictions Description ECgene (gene prediction by EST clustering) predicts genes by combining genome-based EST clustering and transcript assembly methods. The EST clustering is based on genomic alignment of mRNA and ESTs similar to that of NCBI's UniGene for the human genome. The transcript assembly procedure yields gene models for each cluster that include alternative splicing variants. This algorithm was developed by Prof. Sanghyuk Lee's Lab of Bioinformatics at Ewha Womans University in Seoul, Korea. For more detailed information, see the ECgene website. Display Conventions This track follows the display conventions for gene prediction tracks. Methods The following is a brief summary of the ECgene algorithm: Genomic alignment of mRNA and ESTs: Input sequences are aligned against the genome using the Blat program developed by Jim Kent. Blat alignments are corrected for valid splice sites, and the SIM4 program is used for suspicious alignments if necessary. Sequences that share more than one splice site are clustered together. This produces the primary clusters without unspliced sequences (singletons). The genomic alignment of exons in each spliced sequence is represented as a directed acyclic graph (DAG), and all possible gene models are derived by the depth-first-search (DFS) method. Sequences compatible with each gene model are grouped together as sub-clusters. Gene models without sufficient evidence are discarded at this stage. Sensitive detection of polyA tails is achieved by analyzing genomic alignment of mRNA and EST sequences, and specifically used to determine the gene boundary. Finally, unspliced sequences are added so as not to change the splice sites of the existing gene model. Credits The predictions for this track were produced by Namshin Kim and Sanghyuk Lee at Ewha Womans Univeristy, Seoul, KOREA. encodeRegions ENCODE Regions Encyclopedia of DNA Elements (ENCODE) Regions ENCODE Regions and Genes Description This track depicts target regions for the NHGRI ENCODE project. The long-term goal of this project is to identify all functional elements in the human genome sequence to facilitate a better understanding of human biology and disease. During the pilot phase, 44 regions comprising 30 Mb — approximately 1% of the human genome — have been selected for intensive study to identify, locate and analyze functional elements within the regions. These targets are being studied by a diverse public research consortium to test and evaluate the efficacy of various methods, technologies, and strategies for locating genomic features. The outcome of this initial phase will form the basis for a larger-scale effort to analyze the entire human genome. See the NHGRI target selection process web page for a description of how the target regions were selected. To open a UCSC Genome Browser with a menu for selecting ENCODE regions on the human genome, use ENCODE Regions in the UCSC Browser. The UCSC resources provided for the ENCODE project are described on the UCSC ENCODE Portal. Credits Thanks to the NHGRI ENCODE project for providing this initial set of data. ensEstGene Ensembl EST Genes Ensembl EST Gene Predictions Genes and Gene Predictions Description Gene predictions from Ensembl based on ESTs. Methods ESTs were mapped onto the genome using a combination of Exonerate, Blast and Est_Genome, with a threshold defined as an overall percentage identity of 90% and at least one exon having a percentage identity of 97% or higher. The results were processed by merging the redundant ESTs and setting splice sites to the most common ends, resulting in alternative spliced forms. This evidence was processed by Genomewise, which finds the longest ORF and assigns 5' and 3' UTRs. Track Configuration This track has an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu at the top of the track description page. This page is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. Click here for more information about this feature. After you have made your configuration selections, click the Submit button to return to the tracks display page. Credits Thanks to Ensembl for providing this annotation. ensGene Ensembl Genes Ensembl Genes Genes and Gene Predictions Description These gene predictions were generated by Ensembl. For more information on the different gene tracks, see our Genes FAQ. Methods For a description of the methods used in Ensembl gene predictions, please refer to Hubbard et al. (2002), also listed in the References section below. Data access Ensembl Gene data can be explored interactively using the Table Browser or the Data Integrator. For local downloads, the genePred format files for hg16 are available in our downloads directory as ensGene.txt.gz or in our genes download directory in GTF format. For programmatic access, the data can be queried from the REST API or directly from our public MySQL servers. Instructions on this method are available on our MySQL help page and on our blog. Previous versions of this track can be found on our archive download server. Credits We would like to thank Ensembl for providing these gene annotations. For more information, please see Ensembl's genome annotation page. References Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al. The Ensembl genome database project. Nucleic Acids Res. 2002 Jan 1;30(1):38-41. PMID: 11752248; PMC: PMC99161 evofold EvoFold EvoFold Predictions of RNA Secondary Structure Genes and Gene Predictions Description This track shows RNA secondary structure predictions made with the EvoFold program, a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. Display Conventions and Configuration Track elements are labeled using the convention ID_strand_score. When zoomed out beyond the base level, secondary structure prediction regions are indicated by blocks, with the stem-pairing regions shown in a darker shade than unpaired regions. Arrows indicate the predicted strand. When zoomed in to the base level, the specific secondary structure predictions are shown in parenthesis format. The confidence score for each position is indicated in grayscale, with darker shades corresponding to higher scores. The details page for each track element shows the predicted secondary structure (labeled SS anno), together with details of the multiple species alignments at that location. Substitutions relative to the human sequence are color-coded according to their compatibility with the predicted secondary structure (see the color legend on the details page). Each prediction is assigned an overall score and a sequence of position-specific scores. The overall score measures evidence for any functional RNA structures in the given region, while the position-specific scores (0 - 9) measure the confidence of the base-specific annotations. Base-pairing positions are annotated with the same pair symbol. The offsets are provided to ease visual navigation of the alignment in terms of the human sequence. The offset is calculated (in units of ten) from the start position of the element on the positive strand or from the end position when on the negative strand. The graphical display may be filtered to show only those track elements with scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods Evofold makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist of both a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score between a phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. The predictions for this track were based on the conserved elements of an 8-way vertebrate alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebrafish, and Fugu assemblies. NOTE: These predictions were originally computed on the hg17 (May 2004) human assembly, from which the hg16 (July 2003), hg18 (May 2006), and hg19 (Feb 2009) predictions were lifted. As a result, the multiple alignments shown on the track details pages may differ from the 8-way alignments used for their prediction. Additionally, some weak predictions have been eliminated from the set displayed on hg18 and hg19. The hg17 prediction set corresponds exactly to the set analyzed in the EvoFold paper referenced below. Credits The EvoFold program and browser track were developed by Jakob Skou Pedersen of the UCSC Genome Bioinformatics Group, now at Aarhus University, Denmark. The RNA secondary structure is rendered using the VARNA Java applet. References EvoFold Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006 Apr;2(4):e33. PMID: 16628248; PMC: PMC1440920 Phylo-SCFGs Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999 Jun;15(6):446-54. PMID: 10383470 Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. 2004;32(16):4925-36. PMID: 15448187; PMC: PMC519121 PhastCons Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. PMID: 16024819; PMC: PMC1182216 exoniphy Exoniphy Exoniphy Human/Mouse/Rat/Dog Genes and Gene Predictions Description The exoniphy program identifies evolutionarily conserved protein-coding exons in multiple, aligned sequences using a phylogenetic hidden Markov model (phylo-HMM), a kind of statistical model that simultaneously describes exon structure and exon evolution. This track shows exoniphy predictions for the human Jul. 2003 (hg16), mouse Feb. 2003 (mm3), and rat Jun. 2003 (rn3) genomes, as aligned by the multiz program. Methods Exoniphy is described in Siepel A and Haussler D (2004), "Computational identification of evolutionarily conserved exons," RECOMB '04. Multiz is described in Blanchette M et al. (2004), "Aligning multiple genomic sequences with the threaded blockset aligner," Genome Res. 14:708-175. softberryGene Fgenesh++ Genes Fgenesh++ Gene Predictions Genes and Gene Predictions Description Fgenesh++ predictions are based on Softberry's gene-finding software. Methods Fgenesh++ uses both hidden Markov models (HMMs) and protein similarity to find genes in a completely automated manner. For more information, see Solovyev, V.V. (2001) in the References section below. Credits The Fgenesh++ gene predictions were produced by Softberry Inc. Commercial use of these predictions is restricted to viewing in this browser. Please contact Softberry Inc. to make arrangements for further commercial access. References Solovyev, V.V. "Statistical approaches in Eukaryotic gene prediction" in the Handbook of Statistical Genetics (ed. Balding, D. et al.), 83-127. John Wiley & Sons, Ltd. (2001). firstEF FirstEF FirstEF: First-Exon and Promoter Prediction Regulation Description This track shows predictions from the FirstEF (First Exon Finder) program. Three types of predictions are displayed: exon, promoter and CpG window. If two consecutive predictions are separated by less than 1000 bp, FirstEF treats them as one cluster of alternative first exons that may belong to same gene. The cluster number is displayed in the parentheses of each item. For example, "exon(405-)" represents the exon prediction in cluster number 405 on the minus strand. The exon, promoter and CpG-window are interconnected by this cluster number. Alternative predictions within the same cluster are denoted by "#N" where "N" is the serial number of an alternative prediction in the cluster. Each predicted exon is either CpG-related or non-CpG-related, based on a score of the frequency of CpG dinucleotides. An exon is classified as CpG-related if the CpG score is greater than a threshold value, and non-CpG-related if less than the threshold. If an exon is CpG-related, its associated CpG-window is displayed. The browser displays features with higher scores in darker shades of gray/black. Method FirstEF is a 5' terminal exon and promoter prediction program. It consists of different discriminant functions structured as a decision tree. The probabilistic models are optimized to find potential first donor sites and CpG-related and non-CpG-related promoter regions based on discriminant analysis. For every potential first donor site (GT) and an upstream promoter region, FirstEF decides whether or not the intermediate region can be a potential first exon, based on a set of quadratic discriminant functions. FirstEF calculates the a posteriori probabilities of exon, donor, and promoter for a given GT and an upstream window of length 570 bp. For a description of the FirstEF program and the underlying classification models, refer to Davuluri et al., 2001. Credits The predictions for this track are produced by Ramana V. Davuluri of Ohio State University and Ivo Grosse and Michael Q. Zhang of Cold Spring Harbor Lab. References Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001 Dec;29(4):412-7. fishClones FISH Clones Clones Placed on Cytogenetic Map Using FISH Mapping and Sequencing Description This track shows the location of fluorescent in situ hybridization (FISH)-mapped clones along the assembly sequence. The locations of these clones were obtained from the NCBI Human BAC Resource here. Earlier versions of this track obtained this information directly from the paper Cheung, et al. (2001). More information about the BAC clones, including how they may be obtained, can be found at the Human BAC Resource and the Clone Registry web sites hosted by NCBI. To view Clone Registry information for a clone, click on the clone name at the top of the details page for that item. Using the Filter This track has a filter that can be used to change the color or include/exclude the display of a dataset from an individual lab. This is helpful when many items are shown in the track display, especially when only some are relevant to the current task. The filter is located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. To use the filter: In the pulldown menu, select the lab whose data you would like to highlight or exclude in the display. Choose the color or display characteristic that will be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display clones from the lab selected in the pulldown list. If "include" is selected, the browser will display clones only from the selected lab. When you have finished configuring the filter, click the Submit button. Credits We would like to thank all of the labs that have contributed to this resource: Fred Hutchinson Cancer Research Center (FHCRC) National Cancer Institute (NCI) Roswell Park Cancer Institute (RPCI) The Wellcome Trust Sanger Institute (SC) Cedars-Sinai Medical Center (CSMC) Los Alamos National Laboratory (LANL) UC San Francisco Cancer Center (UCSF) References Cheung VG, Nowak N, Jang W, Kirsch IR, Zhao S, Chen XN, Furey TS, Kim UJ, Kuo WL, Olivier M et al. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature. 2001 Feb 15;409(6822):953-8. PMID: 11237021 fosEndPairs Fosmid End Pairs Fosmid End Pairs Mapping and Sequencing Description A valid pair of fosmid end sequences must be at least 30 kb but no more than 50 kb away from each other. The orientation of the first fosmid end sequence must be "+" and the orientation of the second fosmid end sequence must be "-". Note: For hg19 and hg18 assemblies, the Fosmid End Pairs track is a main track under the "Mapping and Sequencing" track category. On the hg38 assembly, the FOSMID End Pairs track is a subtrack within the Clone Ends track under the "Mapping and Sequencing" track category. Under the list of subtracks on the Clone Ends Track Settings page, the FOSMID End Pairs track is now named "WIBR-2 Fosmid library." With the WIBR-2 Fosmid library track setting on full, individual clone end mapping items are listed in the browser; click into any item to see details from NCBI. Methods End sequences were trimmed at the NCBI using ssahaCLIP written by Jim Mullikin. Trimmed fosmid end sequences were placed on the assembled sequence using Jim Kent's blat program. Credits Sequencing of the fosmid ends was done at the Eli & Edythe L. Broad Institute of MIT and Harvard University. Clones are available through the BACPAC Resources Center at Children's Hospital Oakland Research Institute (CHORI). gc5Base GC Percent Percentage GC in 5-Base Windows Mapping and Sequencing Description The GC percent track shows the percentage of G (guanine) and C (cytosine) bases in 5-base windows. High GC content is typically associated with gene-rich areas. This track may be configured in a variety of ways to highlight different apsects of the displayed information. Click the "Graph configuration help" link for an explanation of the configuration options. Credits The data and presentation of this graph were prepared by Hiram Clawson. encodeGencodeGene Gencode Genes Gencode Gene Annotations ENCODE Regions and Genes Description The Gencode Gene track shows high-quality manual annotations in the ENCODE regions generated by the GENCODE project. A companion track, Gencode Introns, shows experimental gene structure validations for these annotations. The gene annotations are colored based on the Havana annotation type. Known and validated transcripts are colored dark green, putative and unconfirmed are light green, pseudogenes are blue, and artifacts are grey. The transcript types are defined in more detail in the accompanying table. The Gencode project recommends that the annotations with known and validated transcripts; i.e., the types Known, Novel_CDS, Novel_transcript_gencode_conf, and Putative_gencode_conf (which are colored dark green in the track display) be used as the reference annotation. Type Color Description Known dark green Known protein coding genes (referenced in Entrez Gene, NCBI) Novel_CDS dark green Novel protein coding genes annotated by Havana (not referenced in Entrez Gene, NCBI) Novel_transcript_gencode_conf dark green Novel transcripts annotated by Havana (no ORF assigned) with at least one junction validated by RT-PCR Putative_gencode_conf dark green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) with at least one junction validated by RT-PCR Novel_transcript light green Novel transcripts annotated by Havana (no ORF assigned) not validated by RT-PCR Putative light green Putative transcripts (similar to "novel transcripts", EST supported, short, no viable ORF) not validated by RT-PCR TEC light green Single exon objects (supported by multiple ESTs with polyA sites and signals) undergoing experimental validation/extension. Processed_pseudogene blue Pseudogenes arising via retrotransposition (exon structure of parent gene lost) Unprocessed_pseudogene blue Pseudogenes arising via gene duplication (exon structure of parent gene retained) Artifact grey Transcript evidence and/or its translation equivocal Methods The Human and Vertebrate Analysis and Annotation manual curation process (HAVANA) was used to produce these annotations. Finished genomic sequence was analyzed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, as well as a series of ab initio gene predictions. Nucleotide sequence databases were searched with WUBLASTN and significant hits were realigned to the unmasked genomic sequence by EST2GENOME. WUBLASTX was used to search the Uniprot protein database, and the accession numbers of significant hits were retrieved from the Pfam database. Hidden Markov models for Pfam protein domains were aligned against the genomic sequence using Genewise to provide annotation of protein domains. A number of ab initio prediction algorithms were also run: Genscan and Fgenesh for genes, tRNAscan to find tRNA genes, and Eponine TSS for transcription start site predictions. The annotators used the (AceDB-based) Otterlace interface to create and edit gene objects, which were then stored in a local database named Otter. In cases where predicted transcript structures from Ensembl are available, these can be viewed from within the Otterlace interface and may be used as starting templates for gene curation. Annotation in the Otter database is submitted to the EMBL/Genbank/DDBJ nucleotide database. Verification The gene objects selected for verification came from various computational prediction methods and HAVANA annotations. RT-PCR and RACE experiments were performed on them, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using 12 poly(A)+ RNAs from Origene, eight from Clemente Associates/Quantum Magnetics and four from BD Biosciences as described in [Reymond et al., 2002a,b]. The relative amount of each cDNA was normalized by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System. Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002b; Mouse Genome Sequencing Consortium, 2002; Guigo, 2003]. Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and four ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50°C; annealing temperature of the next 30 cycles was carried out at 50°C. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences). Credits Click here for a complete list of people who participated in the GENCODE project. References Ashurst, J.L. et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (Database Issue), D459-65 (2005). Guigo, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915), 520-62 (2002). Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915), 582-6 (2002). Reymond, A. et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79(6), 824-32 (2002). encodeGencodeIntron Gencode Introns Gencode Intron Validation ENCODE Regions and Genes Description The Gencode Intron Validation track shows gene structure validations generated by the GENCODE project. This track serves as a companion to the Gencode Genes track. The items in this track are colored based on the validation status determined via RT-PCR of exons flanking the intron: Status Color Validation Result RT_positive green Intron validated (PCR product corresponds to expected junction) RT_negative red Intron not validated (no PCR product was obtained) RT_wrong_junction gold Intron not validated, but another junction exists between the two (PCR product does not correspond to the expected junction) Methods Selected gene models from the Genecode Genes track were picked for RT-PCR and RACE verification experiments. RT-PCR and RACE experiments were performed on the objects, using a variety of human tissues, to confirm their structure. Human cDNAs from 24 different tissues (brain, heart, kidney, spleen, liver, colon, small intestine, muscle, lung, stomach, testis, placenta, skin, peripheral blood leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal heart, fetal lung, thymus, pancreas, mammary gland, prostate) were synthesized using twelve poly(A)+ RNAs from Origene, eight from Clemente Associates/Quantum Magnetics and four from BD Biosciences as described in [Reymond et al., 2002a,b]. The relative amount of each cDNA was normalized with glyceraldehyde-3-phosphate dehydrogenase (GAPDH) by quantitative PCR using SyberGreen as intercalator and an ABI Prism 7700 Sequence Detection System. Predictions of human genes junctions were assayed experimentally by RT-PCR as previously described and modified [Reymond, 2002b; Mouse Genome Sequencing Consortium, 2002; Guigo, 2003]. Similar amounts of Homo sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and 4 ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The ten first cycles of PCR amplification were performed with a touchdown annealing temperatures decreasing from 60 to 50°C; annealing temperature of the next 30 cycles was carried out at 50°C. Amplimers were separated on "Ready to Run" precast gels (Pharmacia) and sequenced. RACE experiments were performed with the BD SMART RACE cDNA Amplification Kit following the manufacturer instructions (BD Biosciences). Credits Click here for a complete list of people who participated in the GENCODE project. References Ashurst, J.L. et al. The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (Database Issue), D459-65 (2005). Guigo, R. et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc Natl Acad Sci U S A 100(3), 1140-5 (2003). Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915), 520-62 (2002). Reymond, A. et al. Human chromosome 21 gene expression atlas in the mouse. Nature 420(6915), 582-6 (2002). Reymond, A. et al. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79(6), 824-32 (2002). encodeGencodeRegions Gencode Unann Gencode Unannotated Region Classification ENCODE Analysis Description This track shows genomic regions grouped into subsets based on their locations relative to the transcript-coding and exon-coding regions of Gencode genes. The data are divided into four region types: Gencode distal intronic: regions between Gencode gene exons that are located more than 5000 base pairs away from any exon. Gencode proximal intronic: regions between Gencode gene exons that are located no more than 5000 base pairs from an exon. Gencode distal intergenic: regions outside of all Gencode gene extents that are located more than 5000 base pairs from a gene extent. Gencode proximal intergenic: regions outside of all Gencode gene extents that are located within 5000 base pairs of a gene extent. In this context, a gene extent is defined as the region between transcript start and transcript end. Display Conventions and Configuration This annotation track is comprised of subtracks that show different aspects of the displayed data. The complete list of subtracks is displayed near the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The Gencode reference gene set is comprised of the following gene types: Known, Novel_CDS, Novel_transcript_gencode_conf, and Putative_gencode_conf. The exons of this gene set were extracted using the UCSC Table Browser to define the set of Gencode intronic regions. Similarly, the transcript start and stop coordinates of this gene set were extracted to define the Gencode intergenic regions. There is some ambiguity associated with these classifications due to the presence of isoforms and overlapping transcripts. Credits This annotation was generated by Hiram Clawson of the UCSC Genome Bioinformatics Group. encodeGencodeIntergenicDistal Intergenic Dist Gencode Intergenic Distal Regions ENCODE Analysis encodeGencodeIntergenicProximal Intergenic Prox Gencode Intergenic Proximal Regions ENCODE Analysis encodeGencodeIntronsDistal Intronic Dist Gencode Intronic Distal Regions ENCODE Analysis encodeGencodeIntronsProximal Intronic Prox Gencode Intronic Proximal Regions ENCODE Analysis rnaCluster Gene Bounds Gene Boundaries as Defined by RNA and Spliced EST Clusters mRNA and EST Description This track shows the boundaries of genes and the direction of transcription as deduced from clustering spliced ESTs and mRNAs against the genome. When many spliced variants of the same gene exist, this track shows the variant that spans the greatest distance in the genome. Method ESTs and mRNAs from GenBank were aligned against the genome using BLAT. Alignments with less than 97.5% base identity within the aligning blocks were filtered out. When multiple alignments occurred, only those alignments with a percentage identity within 0.2% of the best alignment were kept. The following alignments were also discarded: ESTs that aligned without any introns, blocks smaller than 10 bases, and blocks smaller than 130 bases that were not located next to an intron. The orientations of the ESTs and mRNAs were deduced from the GT/AG splice sites at the introns; ESTs and mRNAs with overlapping blocks on the same strand were merged into clusters. Only the extent and orientation of the clusters are shown in this track. Scores for individual gene boundaries were assigned based on the number of cDNA alignments used: 300 — based on a single cDNA alignment 600 — based on two alignments 900 — based on three alignments 1000 — based on four or more alignments Credits This track, which was originally developed by Jim Kent, was generated at UCSC and uses data submitted to GenBank by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32:D23-6. Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. geneid Geneid Genes Geneid Gene Predictions Genes and Gene Predictions Description This track shows gene predictions from the geneid program developed by Roderic Guigó's Computational Biology of RNA Processing group which is part of the Centre de Regulació Genòmica (CRG) in Barcelona, Catalunya, Spain. Methods Geneid is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, start and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAs). Next, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the the log-likelihood ratio of a Markov Model for coding DNA. Finally, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. Credits Thanks to Computational Biology of RNA Processing for providing these data. References Blanco E, Parra G, Guigó R. Using geneid to identify genes. Curr Protoc Bioinformatics. 2007 Jun;Chapter 4:Unit 4.3. PMID: 18428791 Parra G, Blanco E, Guigó R. GeneID in Drosophila. Genome Res. 2000 Apr;10(4):511-5. PMID: 10779490; PMC: PMC310871 genscan Genscan Genes Genscan Gene Predictions Genes and Gene Predictions Description This track shows predictions from the Genscan program written by Chris Burge. The predictions are based on transcriptional, translational and donor/acceptor splicing signals as well as the length and compositional distributions of exons, introns and intergenic regions. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods For a description of the Genscan program and the model that underlies it, refer to Burge and Karlin (1997) in the References section below. The splice site models used are described in more detail in Burge (1998) below. Credits Thanks to Chris Burge for providing the Genscan program. References Burge C. Modeling Dependencies in Pre-mRNA Splicing Signals. In: Salzberg S, Searls D, Kasif S, editors. Computational Methods in Molecular Biology. Amsterdam: Elsevier Science; 1998. p. 127-163. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997 Apr 25;268(1):78-94. PMID: 9149143 encodeGisChipPet GIS p53 5FU HCT116 GIS ChIP-PET: p53 Ab on 5FU treated HCT116 cells ENCODE Chromatin Immunoprecipitation Description This track shows genome-wide p53 binding sites as determined by chromatin immunoprecipitation (ChIP) and paired-end di-tag (PET) sequencing. The p53 protein is a transcription factor involved in the control of cell growth that is often expressed at high levels in cancer cells. See the Methods section below for more information about ChIP and PET. The PET sequences in this track are derived from 65,572 individual p53 ChIP fragments of 5-fluorouracil (5FU) stimulated HCT116 cells. More datasets will be submitted in the future, including STAT1, TAF250, and E2F1. Display Conventions and Configuration In the graphical display, PET sequences are shown as two blocks, representing the ends of the pair, connected by a thin arrowed line. Overlapping PET clusters (PET fragments that overlap one another) originating from the ChIP enrichment process define the genomic loci that are potential transcription factor binding sites (TFBSs). PET singletons, from non-specific ChIP fragments that did not cluster, are not shown. In full and packed display modes, the arrowheads on the horizontal line represent the orientation of the PET sequence, and an ID of the format XXXXX-M is shown to the left of each PET, where X is the unique ID for each PET and M is the number of PET sequences at this location. The track coloring reflects the value of M: light gray indicates one or two sequences (score = 333), dark gray is used for three sequences (score = 800) and black indicates four or more PET sequences (score = 1000) at the location. Methods HCT116 cells were treated with 5FU for six hours. The cross-linked chromatin was sheared and precipitated with a high affinity antibody. The DNA fragments were end-polished and cloned into the plasmid vector, pGIS3. pGIS3 contains two MmeI recognition sites that flank the cloning site, which were used to produce a 36 bp PET from the original ChIP DNA fragments (18 bp from each of the 5' and 3' ends). Multiple 36 bp PETs were concatenated and cloned into pZero-1 for sequencing, where each sequence read can generate 10-15 PETs. The PET sequences were extracted from raw sequence reads and mapped to the genome, defining the boundaries of each ChIP DNA fragment. The following specific mapping criteria were used: both 5' and 3' signatures must be present on the same chromosome their 5' to 3' orientation must be correct a minimal 17 bp match must exist for each 18 bp 5' and 3' signature the tags must have genomic alignments within 4 Kb of each other Due to the known possibility of MmeI slippage (+/- 1 bp) that leads to ambiguities at the PET signature boundaries, a minimal 17 bp match was set for each 18 bp signature. The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. Only PETs with specific mapping (one location) to the genome were considered. PETs that mapped to multiple locations may represent low complexity or repetitive sequences, and therefore were not included for further analysis. Verification Statistical and experimental verification exercises have shown that the overlapping PET clusters result from ChIP enrichment events. Monte Carlo simulation using the p53 ChIP-PET data estimated that about 27% of PET-2 clusters (PET clusters with two overlapping members), 3% of the PET clusters with 3 overlapping members (PET-3 clusters), and less than 0.0001% of PET clusters with more than 3 overlapping members were due to random chance. This suggests that the PET clusters most likely represent the real enrichment events by ChIP and that a higher number of overlapping fragments correlates to a higher probability of a real ChIP enrichment event. Furthermore, based on goodness-of-fit analysis for assessing the reliability of PET clusters, it was estimated that less than 36% of the PET-2 clusters and over 99% of the PET-3+ clusters (clusters with three or more overlapping members) are true enrichment ChIP sites. Thus, the verification rate is nearly 100% for PET-3+ ChIP clusters, and the PET-2 clusters contain significant noise. In addition to these statistical analyses, 40 genomic locations identified by PET-3+ clusters were randomly selected and analyzed by quantitative real-time PCR. The relative enrichment of candidate regions compared to control GST ChIP DNA was determined and all 40 regions (100%) were confirmed to have significant enrichment of p53 ChIP clusters. Credits The p53 ChIP-PET library and sequence data were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists from the Genome Institute of Singapore, the Bioinformatics Institute, Singapore, and Boston University. References Ng, P. et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nature Methods 2, 105-111 (2005). encodeGisRnaPet GIS-PET RNA Gene Identification Signature Paired-End Tags of PolyA+ RNA ENCODE Regions and Genes Description This track shows the starts and ends of mRNA transcripts determined by paired-end ditag (PET) sequencing. PETs are composed of 18 bases from either end of a cDNA; 36 bp PETs from many clones were concatenated together and cloned into pZero-1 for efficient sequencing. See the Methods and References sections below for more details on PET sequencing. The PET sequences in this track are full-length transcripts derived from two cell lines and mapped on whole genome: the log phase of MCF7 cells HCT116 cells treated with 5FU (5-fluorouracil) for 6 hours In total, 584,624 PETs were generated for MCF7 and 280,340 PETs were generated for HCT116. More than 80% of the PETs in each group were mapped to the genome. The 474,278 MCF7 PETs and 223,261 HCT116 PETs that mapped with single and multiple (up to ten) matches in the genome are shown in the two subtracks. In the graphical display, the ends are represented by blocks connected by a horizontal line. In full and packed display modes, the arrowheads on the horizontal line represent the direction of transcription, and an ID of the format XXXXX-N-M is shown to the left of each PET, where X is the unique ID for each PET, N indicates the number of mapping locations in the genome (1 for a single mapping location, 2 for two mapping locations, and so forth), and M is the number of PET sequences at this location. The total count of PET sequences mapped to the same locus but with slight nucleotide differences may reflect the expression level of the transcripts. PETs that mapped to multiple locations may represent low complexity or repetitive sequences. The graphical display also uses color coding to reflect the uniqueness and expression level of each PET: ColorMappingPETS observed at location dark blueunique2 or more light blueunique1 medium brownmultiple2 or more light brownmultiple1 Methods PolyA+ RNA was isolated from the cells. A full-length cDNA library was constructed and converted into a PET library for Gene Identification Signature analysis (Ng et al., 2005). Generation of PET sequences involved cloning of cDNA sequences into the plasmid vector, pGIS3. pGIS3 contains two MmeI recognition sites that flank the cloning site, which were used to produce a 36 bp PET. Each 36 bp PET sequence contains 18 bp from each of the 5' and 3' ends of the original full-length cDNA clone. The 18 bp 3' signature contains 16 bp 3'-specific nucleotides and an AA residual of the polyA tail to indicate the sequence orientation. PET sequences were mapped to the genome using the following specific criteria: a minimal continuous 16 bp match must exist for the 5' signature; the 3' signature must have a minimal continuous 14 bp match both 5' and 3' signatures must be present on the same chromosome their 5' to 3' orientation must be correct the maximal genomic span of a PET genomic alignment must be less than one million bp Most of the PET sequences (more than 90%) were mapped to specific locations (single mapping loci). PETs mapping to 2 - 10 locations are also included and may represent duplicated genes or pseudogenes in the genome. Verification To assess overall PET quality and mapping specificity, the top ten most abundant PET clusters that mapped to well-characterized known genes were examined. Over 99% of the PETs represented full-length transcripts, and the majority fell within ten bp of the known 5' and 3' boundaries of these transcripts. The PET mapping was further verified by confirming the existence of physical cDNA clones represented by the ditags. PCR primers were designed based on the PET sequences and amplified the corresponding cDNA inserts from the parental GIS flcDNA library for sequencing analysis. In a set of 86 arbitrarily-selected PETs representing a wide range of annotation categories — including known genes (38 PETs), predicted genes (2 PETs), and novel transcripts (46 PETs) — 84 (97.7%) confirmed the existence of bona fide transcripts. Credits The GIS-PET libraries and sequence data for transcriptome analysis were produced at the Genome Institute of Singapore. The data were mapped and analyzed by scientists from the Genome Institute of Singapore and the Bioinformatics Institute of Singapore. References Ng, P. et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2(2), 105-11 (2005). encodeGisRnaPetHCT116 GIS RNA HCT116 Gene Identification Signature Paired-End Tags of PolyA+ RNA (5FU-stim HCT116) ENCODE Regions and Genes encodeGisRnaPetMCF7 GIS RNA MCF7 Gene Identification Signature Paired-End Tags of PolyA+ RNA (log phase MCF7) ENCODE Regions and Genes gnfAtlas2 GNF Atlas 2 GNF Expression Atlas 2 Expression Description This track shows expression data from the GNF Gene Expression Atlas 2. This contains two replicates each of 79 human tissues run over Affymetrix microarrays. By default, averages of related tissues are shown. Display all tissues by selecting "All Arrays" from the "Combine arrays" menu on the track settings page. As is standard with microarray data red indicates overexpression in the tissue, and green indicates underexpression. You may want to view gene expression with the Gene Sorter as well as the Genome Browser. Credits Thanks to the Genomics Institute of the Novartis Research Foundation (GNF) for the data underlying this track. References Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7. PMID: 15075390; PMC: PMC395923 affyRatio GNF Ratio GNF Gene Expression Atlas Ratios Using Affymetrix GeneChips Expression Description This track shows expression data from GNF (The Genomics Institute of the Novartis Research Foundation) using Affymetrix GeneChips. The chip types, chip IDs or tissue averages associated with experiments can be displayed by selecting the appropriate option from the Experiment Display menu on the track description page. For more information, see the Track Configuration section. Methods For detailed information about the experiments, see Su et al. 2002 in the References section below. Alignments displayed on the track correspond to the target sequences used by Affymetrix to choose probes. In dense display mode, the track color denotes the average signal over all experiments on a log base 2 scale. Lighter colors correspond to lower signals and darker colors correspond to higher signals. In full display mode, the color of each item represents the log base 2 ratio of the signal of that particular experiment to the median signal of all experiments for that probe. More information about individual probes and probe sets is available on the Affymetrix website. Track Configuration This track may be configured to change the display mode and colors or vary the type of experiment information shown. The configuration controls are located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or the link on the track's control menu. Display mode: To change the display mode for the track, select the desired display setting from the Display Mode pulldown list. Combine Arrays: All arrays may be displayed with either the chip ID or the tissue type as the label. Replicate arrays may also be combined by expression medians. When you have finished making changes, click the Submit button to commit your changes and return to the Genome Browser tracks display. Credits Thanks to GNF for providing these data. References Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A. et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99(7), 4465-70 (2002). HInvGeneMrna H-Inv H-Invitational Genes mRNA Alignments mRNA and EST Description This track shows alignments of full-length cDNAs that were used as the basis of the H-Invitational Gene Database (HInv-DB). The HInv-DB is a human gene database containing human-curated annotation of 41,118 full-length cDNA clones representing 21,037 cDNA clusters. The project was initiated in 2002 and the database became publicly available in April 2004. HInv-DB entries describe the following entities: gene structures functions novel alternative splicing isoforms non-coding functional RNAs functional domains sub-cellular localizations metabolic pathways predictions of protein 3D structure mapping of SNPs and microsatellite repeat motifs in relation with orphan diseases gene expression profiling comparative results with mouse full-length cDNAs gene structures Methods To cluster redundant cDNAs and alternative splicing variants within the H-Inv cDNAs, a total of 41,118 H-Inv cDNAs were mapped to the human genome using the mapping pipeline developed by the Japan Biological Information Research Center (JBIRC). The mapping yielded 40,140 cDNAs that were aligned against the genome using the stringent criteria of at least 95% identity and 90% length coverage. These 40,140 cDNAs were clustered to 20,190 loci, resulting in an average of 2.0 cDNAs per locus. For the remaining 978 unmapped cDNAs, cDNA-based clustering was applied, yielding 847 clusters. In total, 21,037 clusters (20,190 mapped and 847 unmapped) were identified and integrated into H-InvDB. H-Inv cluster IDs (e.g. HIX0000001) were assigned to these clusters. A representative sequence was selected from each cluster and used for further analyses and annotation. A full description of the construction of the HInv-DB is contained in the report by the H-Inv Consortium (see References section). Credits The H-InvDB is hosted at the JBIRC. The human-curated annotations were produced during invitational annotation meetings held in Japan during the summer of 2002, with a follow-up meeting in November 2004. Participants included 158 scientists representing 67 institutions from 12 countries. The full-length cDNA clones and sequences were produced by the Chinese National Human Genome Center (CHGC), the Deutsches Krebsforschungszentrum (DKFZ/MIPS), Helix Research Institute, Inc. (HRI), the Institute of Medical Science in the University of Tokyo (IMSUT), the Kazusa DNA Research Institute (KDRI), the Mammalian Gene Collection (MGC/NIH) and the Full-Length Long Japan (FLJ) project. References Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi- Kabata Y, Tanino M et al. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004 Jun;2(6):e162. PMID: 15103394; PMC: PMC393292 haplotype Haplotype Blocks Common Haplotype Blocks Variation and Repeats Description Haplotype blocks on chromosome 22 from The University of Oxford and The Wellcome Trust Sanger Institute, as described in Dawson, E. et al. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418, 544-8 (2002). The location of each haplotype block is represented by a blue horizontal line with tall vertical blue bars at the first and last SNPs of the block. Blocks are displayed as starting at the first SNP and ending at the last SNP of the block. Individual SNPs are denoted by smaller black vertical bars. At multi-megabase resolution in dense display mode, clusters of tall blue bars may indicate hotspots for recombination. NOTE: Haplotype block annotations appear only on chromosome 22. Credits Thanks to The University of Oxford and the the Sanger Institute for providing these data. encodeHapMapCov HapMap Coverage ENCODE HapMap (16c.1) Resequencing Coverage ENCODE Variation Description This track shows depth sequencing coverage for the four HapMap populations in the ten ENCODE regions that have been resequenced for variation. The data for each population is shown in a separate subtrack: HapMap Allele Frequencies (CEU): Utah residents with ancestry from northern and western Europe HapMap Allele Frequencies (CHB): Han Chinese in Beijing, China HapMap Allele Frequencies (JPT): Japanese in Tokyo, Japan HapMap Allele Frequencies (YRI): Yoruba in Ibadan, Nigeria The ENCODE regions targeted in this annotation include: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Display Conventions and Configuration The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Each data value represents the number of sequencing traces that covered the nucleotide. See the International HapMap Project website for information about how these data were collected and analyzed. Credits These data were obtained from HapMap public release 16c.1. Thanks to the International HapMap Project for making this information available. encodeHapMapCovYRI HapMap Cov YRI HapMap Resequencing Coverage Yoruban (YRI) ENCODE Variation encodeHapMapCovJPT HapMap Cov JPT HapMap Resequencing Coverage Japanese (JPT) ENCODE Variation encodeHapMapCovCHB HapMap Cov CHB HapMap Resequencing Coverage Chinese (CHB) ENCODE Variation encodeHapMapCovCEU HapMap Cov CEU HapMap Resequencing Coverage CEPH (CEU) ENCODE Variation hapmapLd HapMap LD HapMap Linkage Disequilibrium - Phase II Variation and Repeats Description Linkage disequilibrium (LD) is the association of alleles on chromosomes. It measures the difference between the observed allele frequency for a two locus allele as compared to its expected frequency, which is the product of the two single allele frequencies. When LD is low, the two loci tend to be inherited in a nearly random manner. This track shows three different measures of linkage disequilibrium — D', r2, and LOD (log odds) — between pairs of SNPs as genotyped by the HapMap consortium. LD is useful for understanding the associations between genetic variants throughout the genome, and can be helpful in selecting SNPs for genotyping. By default, LOD values are displayed in full mode. Each diagonal represents a different SNP with each diamond representing a pairwise comparison between two SNPs. Shades are used to indicate linkage disequilibrium between the pair of SNPs, with darker shades indicating stronger LD. For the LOD values, additional colors are used in some cases: White diamonds indicate pairwise D' values less than 1 with no statistically significant evidence of LD (LOD < 2). Light blue diamonds indicate high D' values (>0.99) with low statistical significance (LOD < 2). Light pink diamonds are drawn when the statistical significance is high (LOD >= 2) but the D' value is low (less than 0.5). Methods Genotypes from HapMap Phase II release 19 were used with Haploview to infer phasing and calculate LD values for all SNP pairs within 250 kb. As the children in the trios are not independent samples, Haploview uses only the parents from those populations. The YRI and CEU tracks each use 60 unrelated individuals (parents from the trios), the CHB and JPT tracks use 45 unrelated individuals each. Haploview uses a two marker EM (ignoring missing data) to estimate the maximum-likelihood values of the four gamete frequencies, from which the D', LOD, and r2 calculations derive. Haplotype phase is inferred using a standard EM algorithm with a partition-ligation approach for blocks with greater than 10 markers. Display Conventions and Configuration Display Mode Full mode shows the pairwise LD values in a Haploview-style mountain plot. Dense mode shows the pairwise LD values in a single line for each population, where the intensity at each position is the average of all of the LD values between the SNP at that position and all other SNPs within 250 kb. LD Values: measures of linkage disequilibrium r2 displays the raw r2 value, or the square of the correlation coefficient for a given marker pair. SNPs that have not been separated by recombination have r2 = 1; in this case, these two markers are said to be redundant for genotyping, but may have different functional effects. Lower r2 values show a lower degree of LD, indicating that some recombination has occurred in this population. See Hill and Robertson (1966) for details. D' displays the raw D' value, which is the normalized covariance for a given marker pair. A D' value of 1 (complete LD) indicates that two SNPs have not been separated by recombination, while lower values indicate evidence of recombination in the history of the sample. Only D' values near 1 are a reliable measure of LD; lower values are difficult to interpret as the magnitude of D' depends strongly on sample size. See Lewontin (1988) for more details. LOD displays the log odds score for linkage disequilibrium between a given marker pair, and is shown by default. Track Geometry Trim to triangle shows the standard mountain plot (default); turning this option off will show LD values with SNPs outside the window. Inverting makes it easier to visually compare two adjacent populations. Colors LD Values can be drawn in a variety of colors, with red as default. The intensity of the color is proportional to the strength of the LD measure chosen above. Outlines can be drawn in contrasting colors or turned off. Outlines are automatically suppressed when the window is larger than 100,000 bp. Population Selection The HapMap populations can be individually displayed or hidden. YRI: Yoruba people in Ibadan, Nigeria (30 parent-and-adult-child trios) CEU: European samples from the Centre d'Etude du Polymorphisme Humain (CEPH) (30 trios) CHB: Han Chinese in Beijing (45 unrelated individuals) JPT: Japanese in Tokyo (45 unrelated individuals) Credits This track was created by Daryl Thomas at UCSC using data from the International HapMap Project, following the display style from Haploview. References HapMap Project The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299-1320 (2005). The International HapMap Consortium. The International HapMap Project. Nature 426, 789-96 (2003). HapMap Data Coordination Center Thorisson, G.A., Smith, A.V., Krishnan, L. and Stein, L.D. The International HapMap Project Web site. Genome Res 15, 1591-3 (2005). Haploview Barrett, J.C., Fry, B., Maller, J. and Daly, M.J. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21(2), 263-5 (2005). General references on Linkage Disequilibrium Lewontin, R.C. On measures of gametic disequilibrium. Genetics 120, 849-52 (1988). Hill, W. G. and Robertson, A. The effect of linkage on limits to artificial selection. Genet. Res., 8:269-294 (1966). hapmapLdJpt LD JPT Linkage Disequilibrium for the Japanese from Tokyo (JPT) Variation and Repeats hapmapLdChb LD CHB Linkage Disequilibrium for the Han Chinese (CHB) Variation and Repeats hapmapLdCeu LD CEU Linkage Disequilibrium for the CEPH (CEU) Variation and Repeats hapmapLdYri LD YRI Linkage Disequilibrium for the Yoruba (YRI) Variation and Repeats encodeHapMapAlleleFreq HapMap SNPs ENCODE HapMap (16c.1) Allele Frequencies ENCODE Variation Description This track shows allele frequencies for the four HapMap populations in the ten ENCODE regions that have been resequenced for variation. The data for each population is shown in a separate subtrack: HapMap Allele Frequencies (CEU): Utah residents with ancestry from northern and western Europe HapMap Allele Frequencies (CHB): Han Chinese in Beijing, China HapMap Allele Frequencies (JPT): Japanese in Tokyo, Japan HapMap Allele Frequencies (YRI): Yoruba in Ibadan, Nigeria The ENCODE regions targeted in this annotation include: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) See the Methods section for a discussion of the scoring method used in this annotation. The data set combines SNPs from the HapMap resequencing project, in addition to SNPs discovered previously. Display Conventions and Configuration The complete list of subtracks available in this annotation is shown at the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Allele locations are indicated by tickmarks using a grayscale coloring scheme based on score, where darker shading indicates a higher score. A lower score indicates little or no variation; a higher score indicates a split between the reference and variant observations in the population. The track details page for an individual allele displays the variant and reference sequences, the allele frequencies, the origination of the data, and the total sample count. Methods See the International HapMap Project website for information about how these data were collected and analyzed. The score calculation in this annotation is a function of the minor allele frequency (maf), which varies from 0.0 to 0.5. The score has been normalized to a range of 500 to 1000 using the formula score = 500 + (maf * 1000). Thus, a score of 500 indicates no variation; a score of 1000 indicates an even split between reference and variant observations in the population. Credits These data were obtained from HapMap public release 16c.1. Thanks to the International HapMap Project for making this information available. encodeHapMapAlleleFreqYRI Allele Freq YRI HapMap Minor Allele Frequencies Yoruban (YRI) ENCODE Variation encodeHapMapAlleleFreqJPT Allele Freq JPT HapMap Minor Allele Frequencies Japanese (JPT) ENCODE Variation encodeHapMapAlleleFreqCHB Allele Freq CHB HapMap Minor Allele Frequencies Chinese (CHB) ENCODE Variation encodeHapMapAlleleFreqCEU Allele Freq CEU HapMap Minor Allele Frequencies CEPH (CEU) ENCODE Variation est Human ESTs Human ESTs Including Unspliced mRNA and EST Description This track shows alignments between human expressed sequence tags (ESTs) in GenBank and the genome. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. NOTE: As of April, 2007, we no longer include GenBank sequences that contain the following URL as part of the record: http://fulllength.invitrogen.com Some of these entries are the result of alignment to pseudogenes, followed by "correction" of the EST to match the genomic sequence. It is therefore not the sequence of the actual EST and makes it appear that the EST is transcribed. Invitrogen no longer sells the clones. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) indicates the direction of the match between the EST and the matching genomic sequence. It bears no relationship to the direction of transcription of the RNA with which it might be associated. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, click here. Several types of alignment gap may also be colored; for more information, click here. Methods To make an EST, RNA is isolated from cells and reverse transcribed into cDNA. Typically, the cDNA is cloned into a plasmid vector and a read is taken from the 5' and/or 3' primer. For most — but not all — ESTs, the reverse transcription is primed by an oligo-dT, which hybridizes with the poly-A tail of mature mRNA. The reverse transcriptase may or may not make it to the 5' end of the mRNA, which may or may not be degraded. In general, the 3' ESTs mark the end of transcription reasonably well, but the 5' ESTs may end at any point within the transcript. Some of the newer cap-selected libraries cover transcription start reasonably well. Before the cap-selection techniques emerged, some projects used random rather than poly-A priming in an attempt to retrieve sequence distant from the 3' end. These projects were successful at this, but as a side effect also deposited sequences from unprocessed mRNA and perhaps even genomic sequences into the EST databases. Even outside of the random-primed projects, there is a degree of non-mRNA contamination. Because of this, a single unspliced EST should be viewed with considerable skepticism. To generate this track, human ESTs from GenBank were aligned against the genome using blat. Note that the maximum intron length allowed by blat is 750,000 bases, which may eliminate some ESTs with very long introns that might otherwise align. When a single EST aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. Kent WJ. BLAT - The BLAST-Like Alignment Tool. Genome Res. 2002 Apr;12(4):656-64. knownGene Known Genes Known Genes (March 04) Based on SWISS-PROT, TrEMBL, mRNA, and RefSeq Genes and Gene Predictions Description The UCSC Known Genes track shows known protein-coding genes based on protein data from SWISS-PROT, TrEMBL and TrEMBL-NEW and their corresponding mRNAs from GenBank. For more information on the different gene tracks, see our Genes FAQ. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. Black coloring indicates features that have corresponding entries in the Protein Databank (PDB). Blue indicates features associated with mRNAs from NCBI RefSeq or (dark blue) items having associated proteins in the SWISS-PROT database. The variation in blue shading of RefSeq items corresponds to the level of review the RefSeq record has undergone: predicted (light), provisional (medium), or reviewed (dark). This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Go to the Coloring Gene Predictions and Annotations by Codon page for more information about this feature. Methods mRNA sequences were aligned against the human genome using blat. When a single mRNA aligned in multiple places, only alignments having at least 98% base identity with the genomic sequence were kept. This set of mRNA alignments was further reduced by keeping only those mRNAs referenced by a protein in SWISS-PROT, TrEMBL or TrEMBL-NEW. Among multiple mRNAs referenced by a single protein, the best mRNA was selected, based on a quality score derived from its length, the level of the match between its translation and the protein sequence, and its release date. The resulting mRNA and protein pairs were further filtered by removing short invalid entries and consolidating entries with identical CDS regions. Finally, RefSeq entries derived from DNA sequences instead of mRNA sequences were added to produce the final data set shown in this track. Disease annotations were obtained from SWISS-PROT. Credits The Known Genes track was produced at UCSC based primarily on cross-references between proteins from SWISS-PROT (including TrEMBL and TrEMBL-NEW) and mRNAs from GenBank contributed by scientists worldwide. NCBI RefSeq data were also included in this track. Data Use Restrictions The UniProt data have the following terms of use, UniProt copyright(c) 2002 - 2004 UniProt consortium: For non-commercial use, all databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. For commercial use, all databases and documents in the UniProt FTP directory except the files ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/uniprot_sprot.xml.gz may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. More information for commercial users can be found at the UniProt License & disclaimer page. From January 1, 2005, all databases and documents in the UniProt FTP directory may be copied and redistributed freely by all entities, without advance permission, provided that this copyright statement is reproduced with each copy. References Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32:D23-6. Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46. Kent WJ. BLAT - The BLAST-Like Alignment Tool. Genome Res. 2002 Apr;12(4):656-64. encodeRna Known+Pred RNA Known and Predicted RNA Transcription in the ENCODE Regions ENCODE Regions and Genes Description This track shows the locations of known and predicted non-protein-coding RNA genes and pseudogenes that fall within the ENCODE regions. It contains all information in Sean Eddy's RNA Genes track for these regions, combined with computational predictions generated by Jakob Skou Pedersen's EvoFold algorithm. In addition to the fields contained in the RNA Genes track, this track also includes ENCODE-related fields describing overlap with transcribed regions and repeats. Feature types in this annotation include: tRNA: transfer RNA (or pseudogene) rRNA: ribosomal RNA (or pseudogene) scRNA: small cytoplasmic RNA (or pseudogene) snRNA: small nuclear RNA (or pseudogene) snoRNA: small nucleolar RNA (or pseudogene) miRNA: microRNA (or pseudogene) misc_RNA: miscellaneous other RNA, such as Xist (or pseudogene) "-": unknown RNA Display Conventions and Configuration The locations of the RNA genes and pseudogenes are represented by blocks in the graphical display, color-coded as follows: Black: region is Repeatmasked. Green: region is transcribed. Red: region is from the RNA Genes track and is not transcribed. Blue: region is an EvoFold prediction and is not transcribed. The display may be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods The RNA Genes track was supplemented with EvoFold predictions and filtered to include only those items that lie within the ENCODE regions. Regions that are at least 10 percent Repeatmasked are flagged because no transcriptional data is available for them. A region is considered transcribed if at least 10 percent overlaps with any Affymetrix transcribed fragment (transfrag), derived from six microarray experiments, or Yale transcriptionally-active region (TAR), derived from 15 microarray experiments. In these cases, each array from which the overlapped transfrags and TARs were derived is listed. EvoFold is a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. The method makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist both of a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. Two sets of EvoFold predictions are included in this track. The first, labeled EvoFold, contains predictions based on the conserved elements of an 8-way vertebrate alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebrafish, and Fugu assemblies. The second set of predictions, TBA23_EvoFold, was based on the conserved elements of the 23-way TBA alignments present in the ENCODE regions. When a pair of these predictions overlap, only the EvoFold prediction is shown. Credits These data were kindly provided by Sean Eddy at Washington University, Jakob Skou Pedersen at UC Santa Cruz, and The Encode Consortium. This annotation track was generated by Matt Weirauch. References Knudsen, B. and J.J. Hein. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6), 446-54 (1999). Pedersen, J.S., Bejerano, G. and Haussler, D. Identification and classification of conserved RNA secondary structures in the human genome. (In preparation). encodeUcsdNgGif LI Ng gIF ChIP Ludwig Institute/UCSD ChIP/Chip NimbleGen - Gamma Interferon Experiments ENCODE Chromatin Immunoprecipitation Description This track displays results of the following ChIP-chip (NimbleGen) gamma interferon experiments on HeLa cells: anti-H3K4me3, no gamma interferon anti-H3K4me3, 30 minutes after gamma interferon anti-RNA Pol2 in initiation complex, no gamma interferon anti-RNA Pol2 in initiation complex, 30 minutes after gamma interferon ENCODE region-wide location analysis of trimethylated K4 histone H3 (H3K4me3, or triMeH3K4) and RNA polymerase II was conducted with ChIP-chip using chromatin extracted from HeLa cells induced for 30 minutes with gamma interferon as well as uninduced cells. Methods Chromatin from both induced and uninduced HeLa cells was separately cross-linked, precipitated with different antibodies, sheared, amplified and hybridized to an oligonucleotide tiling array produced by NimbleGen Systems. The array includes non-repetitive sequences within the 44 ENCODE regions tiled from NCBI Build 35 (UCSC hg17) with 50-mer probes at 38 bp interval. Resulting genomic coordinates were translated to NCBI Build 34 (UCSC hg16). Intensity values for biological replicate arrays were combined after quantile normalization using R. The averages of the quantile normalized intensity values for each probe were then median-scaled and loess-normalized using R to obtain the adjusted log R values. Verification Three biological replicates were used to generate the track for each factor at each time point with the exception of RNA Pol2 uninduced, for which only two biological replicates were used. Credits The data for this track were generated at the Ren Lab, Ludwig Institute for Cancer Research at UC San Diego. encodeUcsdNgHeLaRnap_p30 LI Ng Pol2 +gIF Ludwig Institute/UCSD ChIP/Chip Ng: HeLa, Pol2, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaRnap_p0 LI Ng Pol2 -gIF Ludwig Institute/UCSD ChIP/Chip Ng: HeLa, Pol2, no gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaH3K4me3_p30 LI Ng H3K4m3 +gIF Ludwig Institute/UCSD ChIP/Chip Ng: HeLa, H3K4me3, 30 min after gamma interferon ENCODE Chromatin Immunoprecipitation encodeUcsdNgHeLaH3K4me3_p0 LI Ng H3K4m3 -gIF Ludwig Institute/UCSD ChIP/Chip Ng: HeLa, H3K4me3, no gamma interferon ENCODE Chromatin Immunoprecipitation ctgPos Map Contigs Physical Map Contigs Mapping and Sequencing Description This track shows the locations of human contigs on the physical map. The underlying data is derived from the NCBI seq_contig.md file that accompanies this assembly. All contigs are "+" oriented in the assembly. Methods For human genome reference sequences dated April 2003 and later, the individual chromosome sequencing centers are responsible for preparing the assembly of their chromosomes in AGP format. The files provided by these centers are checked and validated at NCBI, and form the basis for the seq_contig.md file that defines the physical map contigs. For more information on the human genome assembly process, see The NCBI Handbook. mgcFullMrna MGC Genes Mammalian Gene Collection Full ORF mRNAs Genes and Gene Predictions Description This track shows alignments of human mRNAs from the Mammalian Gene Collection (MGC) having full-length open reading frames (ORFs) to the genome. The goal of the Mammalian Gene Collection is to provide researchers with unrestricted access to sequence-validated full-length protein-coding cDNA clones for human, mouse, and rat genes. Display Conventions and Configuration The track follows the display conventions for gene prediction tracks. An optional codon coloring feature is available for quick validation and comparison of gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Methods GenBank human MGC mRNAs identified as having full-length ORFs were aligned against the genome using blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 1% of the best and at least 95% base identity with the genomic sequence were kept. Credits The human MGC full-length mRNA track was produced at UCSC from mRNA sequence data submitted to GenBank by the Mammalian Gene Collection project. References Mammalian Gene Collection project references. Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 microsat Microsatellite Microsatellites - Di-nucleotide and Tri-nucleotide Repeats Variation and Repeats Description This track displays regions that are likely to be useful as microsatellite markers. These are sequences of at least 15 perfect di-nucleotide and tri-nucleotide repeats and tend to be highly polymorphic in the population. Methods The data shown in this track are a subset of the Simple Repeats track, selecting only those repeats of period 2 and 3, with 100% identity and no indels and with at least 15 copies of the repeat. The Simple Repeats track is created using the Tandem Repeats Finder. For more information about this program, see Benson (1999). Credits Tandem Repeats Finder was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 vntr Microsatellites Perfect Microsatellites - VNTR Variation and Repeats Description This track contains all perfect 'microsatellite' repeats with between 2 and 10 bp repeat units and 10 or more perfect copies. Over 90% of the items will be multi-allelic polymorphisms. Click on an individual repeat element within the track for more information about that item. Methods This track was created by using three programs: Tandyman, display_VNTR and Primeleftright. Tandyman is a program for identifying perfectly identical tandem repeat sequences written by Robert Leach. It has been shown that the number of continous perfect repeats in a microsatellite is perhaps the primary factor in generating polymorphism at that locus. display_VNTR is a wrapper for tandyman which, among other things, creates a fasta delimited file suitable for automated primer design. This is available from Gerome Breen. Primeleftright is a simple program which uses a strict set of thermodynamic parameters to select primers giving the smallest possible PCR product. This is available from Leo Schalkwyk. These programs were used (via linking Perl scripts to reformat output and input files) to find all perfect 'microsatellite' repeats with between 2 and 10 bp repeat units and 10 or more perfect copies. Particular features include: The high probability (>90%) that elements of this track are polymorphic and may have multiple alleles. The exclusion of mono-nucleotide repeats. These are particularly common but very difficult to genotype. The presence of a "distance to next repeat" score in bp allowing users to filter overlapping repeats. The primer designs are a first pass design which will improve in future versions of this data. The main problem with the design is the tendency of primers to end up in repeat regions near the repeat of interest. Users may want to carry out their own QC on the quality of the designs and we expect a good proportion of the design to be usable. Credits We'd like to thank Gerome Breen and Nik Ammar and the SGDP Centre at the Institute of Psychiatry for providing the data used to generate this track. If you wish to cite this data please cite Breen et al. "Distributions of Polymorphic Microsatellites in Mammalian and Other Genomes." (in preparation). miRNA miRNA MicroRNAs from miRBase Genes and Gene Predictions Description The miRNA track shows microRNAs from miRBase. Display Conventions and Configuration Mature miRNAs (miRs) are represented by thick blocks. The predicted stem-loop portions of the primary transcripts are indicated by thinner blocks. miRNAs in the sense orientation are shown in black; those in the reverse orientation are colored grey. When a single precursor produces two mature miRs from its 5' and 3' parts, it is displayed twice with the two different positions of the mature miR. To display only those items that exceed a specific unnormalized score, enter a minimum score between 0 and 1000 in the text box at the top of the track description page. Methods Mature and precursor miRNAs from the miRNA Registry were aligned against the genome using blat. The extents of the precursor sequences were not generally known, and were predicted based on base-paired hairpin structure. miRBase is described in Griffiths-Jones, S. et al. (2006). The miRNA Registry is described in Griffiths-Jones, S. (2004) and Weber, M.J. (2005) in the References section below. Credits This track was created by Michel Weber of Laboratoire de Biologie Moléculaire Eucaryote, CNRS Université Paul Sabatier (Toulouse, France), Yves Quentin of Laboratoire de Microbiologie et Génétique Moléculaires (Toulouse, France) and Sam Griffiths-Jones of The Wellcome Trust Sanger Institute (Cambridge, UK). References When making use of these data, please cite: Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D140-4. PMID: 16381832; PMC: PMC1347474 Griffiths-Jones S. The microRNA Registry. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D109-11. PMID: 14681370; PMC: PMC308757 Weber MJ. New human and mouse microRNA genes found by homology search. FEBS J. 2005 Jan;272(1):59-73. PMID: 15634332 The following publication provides guidelines on miRNA annotation: Ambros V, Bartel B, Bartel DP, Burge CB, Carrington JC, Chen X, Dreyfuss G, Eddy SR, Griffiths-Jones S, Marshall M et al. A uniform system for microRNA annotation. RNA. 2003 Mar;9(3):277-9. PMID: 12592000; PMC: PMC1370393 For more information on blat, see Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 encodeMlaganAlign MLAGAN Alignment Stanford MLAGAN Alignments ENCODE Comparative Genomics Description This track displays human-centric multiple sequence alignments in the ENCODE regions for the 23 vertebrates included in the May 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project. The alignments in this track were generated using the LAGAN Alignment Toolkit. A complete list of the vertebrates included in the May 2005 freeze may be found at the top of the description page for this track. The Genome Browser companion tracks, MLAGAN Cons and MLAGAN Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods. Display Conventions and Configuration In full display mode, this track shows pairwise alignments of each species aligned to the human genome. The alignments are shown in dense display mode using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "human gap" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Methods To create the alignments, the sequence of each non-human species was first "rearranged" to be orthologously collinear with respect to the human sequence. The rearrangements were generated using a suite of tools and algorithms based on Shuffle-LAGAN and SuperMap. For each pairing of human sequence with that of another species, Shuffle-LAGAN was used to find the best-scoring chain of local similarities according to a scoring scheme that penalized evolutionary rearrangements. SuperMap was then used to aggregate parts of the chain into a human-monotonic map of syntenic blocks. This mapping was used to undo the genomic rearrangements of the other sequence and convert it to a form that was directly alignable to the human sequence. A multiple global alignment was created for every region using MLAGAN. The alignments were then refined using MUSCLE, which processes small non-overlapping windows of an alignment and attempts to realign them in an iterative fashion, keeping the refined alignment if it has a better sum-of-pairs score than the original. Credits The MLAGAN alignments were generated by George Asimenos from Stanford's ENCODE group. Shuffle-LAGAN, SuperMap and MLAGAN were written by Mike Brudno. MUSCLE was authored by Bob Edgar. The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Brudno M, Do C, Cooper G, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2006;13(4):721-31. Brudno M, Malde S, Poliakov A, Do C, Courone O, Dubchak I, Batzoglou S. Glocal alignment: finding rearrangements during alignment. Bioinformatics. 2003;19(Suppl. 1):i54-i62. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004;32(5):1792-7. Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294(5550):2348-51. encodeMlaganCons MLAGAN Cons Stanford MLAGAN Conservation ENCODE Comparative Genomics Description This track displays different measurements of conservation based on the MLAGAN multiple sequence alignments of ENCODE regions shown in the MLAGAN Alignment track. Three programs — binCons (binomial-based conservation method), phastCons (phylogenetic hidden-Markov model method), and GERP (Genomic Evolutionary Rate Profiling) — generated the conservation scoring used to create this track. A related track, MLAGAN Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track. For details on the conservation scores generated by each program, refer to the individual Methods subsections. Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack. Methods The methods used to create the MLAGAN alignments in the ENCODE regions are described in the MLAGAN Alignment track description. BinCons The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores (this latter weighting scheme was found to closely match 4D weights). The negative log of these P-values was then averaged across all human-referenced pairwise combinations, and the highest scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater (top 10% have a score of 0.9 or greater, and so on). BinCons scores were normalized to represent a percentile to the power of 10. For example, scores representing the top 1 percent most conserved sequence, 99th percentile, have a score greater than or equal to 0.99^10 = 0.904. Transforming scores to the power of 10 was done for visual purposes only, in order to accentuate and distinguish the peaks of more highly conserved regions. More details on binCons can be found in Margulies et. al. (2003) cited below. PhastCons The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1). For determining the conservation for the ENCODE MLAGAN alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions. PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.05 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%. The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation. More details on phastCons can be found in Siepel et. al. (2005) cited below. GERP The GERP score is the expected substitution rate divided by the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA generated by MLAGAN. The scores range from 0 to 3; those greater than 3 are clipped to 3. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column. Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns. Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment. Credits BinCons was developed by Elliott Margulies of the Eric Green lab at NHGRI. PhastCons was developed by Adam Siepel in the Haussler lab at UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). The data for this track were generated by Elliott Margulies, with assistance from Adam Siepel. References Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S. and Sidow, A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res.. 15(7), 901-13 (2005). Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507-18 (2003). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-50 (2005). encodeMlaganGerpCons MLAGAN GERP Cons MLAGAN GERP Conservation ENCODE Comparative Genomics encodeMlaganBinCons MLAGAN BinCons MLAGAN BinCons Conservation ENCODE Comparative Genomics encodeMlaganPhastCons MLAGAN PhastCons MLAGAN PhastCons Conservation ENCODE Comparative Genomics encodeMlaganElements MLAGAN Elements Stanford MLAGAN Conserved Elements ENCODE Comparative Genomics Description This track displays multi-species conserved sequences (MCSs) derived from binCons, phastCons, and genomic evolutionary rate profiling (GERP) conservation scoring of human ENCODE genomic DNA alignments to 22 other vertebrates using the MLAGAN alignment package. The combined-methods subtracks show the union/intersection of conserved elements produced by the three conservation methods. The multiple sequence alignments may be viewed in the MLAGAN Alignments track. Another related track, MLAGAN Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the three methods listed above, as well as both unions and intersections of the sets of coding and noncoding conserved elements. To show only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Display characteristics specific to certain subtracks are described in the respective Methods sections below. Methods BinCons-based Elements For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions. PhastCons-based Elements The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e., maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state. GERP-based Elements GERP constrained elements exhibit significant evidence of the effects of purifying selection. Elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name. "Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS). Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%. PhastCons/BinCons/GERP Union/Intersection of Coding/NonCoding Elements These subtracks were produced by creating unions and intersections of the constrained element data detected by binCons, phastCons, and GERP on MLAGAN alignments. In these annotations, "non-coding" is defined as those regions not overlapping with CDS regions in any of the following UCSC gene tables: refFlat, knownGene, mgcGenes, vegaGene, or ensGene. Credits BinCons and phastCons MCS data were contributed by Elliott Margulies in the Eric Green lab at NHGRI, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). References See the MLAGAN Alignment and MLAGAN Cons tracks for references. encodeMlaganNcIntersectEl MLAGAN NC Intersect MLAGAN PhastCons/BinCons/GERP Intersection NonCoding Conserved Elements ENCODE Comparative Genomics encodeMlaganIntersectEl MLAGAN Intersect MLAGAN PhastCons/BinCons/GERP Intersection Conserved Elements ENCODE Comparative Genomics encodeMlaganNcUnionEl MLAGAN NC Union MLAGAN PhastCons/BinCons/GERP Union NonCoding Conserved Elements ENCODE Comparative Genomics encodeMlaganUnionEl MLAGAN Union MLAGAN PhastCons/BinCons/GERP Union Conserved Elements ENCODE Comparative Genomics encodeMlaganGerpEl MLAGAN GERP MLAGAN GERP Conserved Elements ENCODE Comparative Genomics encodeMlaganBinConsEl MLAGAN BinCons MLAGAN BinCons Conserved Elements ENCODE Comparative Genomics encodeMlaganPhastConsEl MLAGAN PhastCons MLAGAN PhastCons Conserved Elements ENCODE Comparative Genomics nci60 NCI60 Microarray Experiments for NCI 60 Cell Lines Expression Description Expression data from "Systematic variation in gene expression patterns in human cancer cell lines" [pubmed], Ross et al., Nature Genetics 2000 Mar; 24(3):227-35. cDNA microarrays were used to explore the variation in expression of approximately 8,000 unique genes among the 60 cell lines used in the National Cancer Institute's screen for anti-cancer drugs. The authors have provided a web supplement where more data and experimental description can be obtained. cDNA probes were placed on the draft human genome using genebank sequences referenced by the IMAGE clone ids. The data are shown in a tabular format in which each column of colored boxes represents the variation in transcript levels for a given cDNA across all of the array experiments, and each row represents the measured transcript levels for all genes in a single sample. The variation in transcript levels for each gene is represented by a color scale, in which red indicates an increase in transcript levels, and green indicates a decrease in transcript levels, relative to the reference sample. The saturation of the color corresponds to the magnitude of transcript variation. A black color indicates an undetectable change in expression, while a gray box indicates missing data. Display Options This track has filter options to customize tissue types presented and the color of the display. Combine Arrays: This option is only valid when the track is displayed in full. It determines how the experiments are displayed. The options are: Arrays Grouped by Tissue Median: (default) Displays the median of the log ratio scores of all cell lines from the different tissue types. Arrays Grouped by Tissue Mean: Displays the mean of the log ratio scores of all cell lines from the different tissue types. All Arrays (experiments): Displays the log ratio score for all cell line experiments. Color Scheme: Data are presented using two color false display. By default the Brown/Botstein colors of red -> positive log ratio, green -> negative log ratio are used. However, a yellow/blue option can be selected for those who are colorblind. Details Page On the details page, the probes presented correspond to those contained in the window range displayed on the Genome Browser. The exon probe and experiment selected are highlighted in blue. encodeIndels NHGRI DIPs NHGRI Deletion/Insertion Polymorphisms in ENCODE regions ENCODE Variation Description This track shows deletion/insertion polymorphisms (DIPs). In packed and full modes, the sequence variation is shown to the left of the DIP. The naming convention "-/sequence" is used for deletions; "sequence/-" is used for insertions. The details page shows the name of the trace used to define the polymorphism, the quality score, and the strand on which the trace aligns to the reference sequence. The quality score reflects the minimum PHRED quality value over the entire range of the DIP within the trace, plus 5 flanking bases. PHRED quality scores are expressed as log probabilities using the formula: Q = -10 * log10(Pe) where Pe is the estimated probability of an error at that base. PHRED quality scores typically vary from 0 to 40, where 0 indicates complete uncertainty about the base and 40 implies odds of 10,000 to 1 that the base is correct. Sometimes a PHRED value of 50 or higher is used to denote finished sequence. A color gradient is used to distinguish quality scores in the browser display: brighter shading indicates higher scores. The "Trace Pos" value on the details page indicates the 3' position of the DIP within the trace. The alleles are reported relative to the "+" strand of the reference sequence; however, the trace may actually align to the "-" strand. When viewing the chromatogram using the URL provided, if the trace aligned to the "-" strand, the DIP bases in the trace will be the reverse compliment of the variant allele given. Methods All human trace data from NCBI's trace archive were aligned to hg17 with ssahaSNP, followed by ssahaDIP post-processing to detect deletion/insertion polymorphisms. DIPs within ENCODE regions were extracted. Verification For verification, 500k traces from the mouse whole genome shotgun (WGS) sequencing effort were compared to mm6 using ssahaSNP and ssahaDIP. Because mm6 and these traces are from the same mouse strain, C57BL/6J, the DIP rate should be very low. Applying a quality threshold of Q23, the detected DIP rate was one DIP per 140k Neighborhood Quality Standard (NQS) bases. This level was ten-fold lower than the SNP rate for the same data set using ssahaSNP, which has been validated as having a 5% false positive rate. The detected DIP rate for human traces against hg17 is one DIP per 12k NQS bases, indicating a false positive rate of 12k/140k, or about 8%. Further validation experiments are in progress. Credits All analyses were performed by Jim Mullikin using ssahaSNP and ssahaDIP. The trace data were contributed to the trace archive by many sequencing centers. References Ning Z, Cox AJ, Mullikin JC. SSAHA: A fast search method for large DNA databases. Genome Res. 2001 Oct;11(10):1725-9. The International SNP Map Working Group. A map of human genome sequence variation containing 1.4 million single nucleotide polymorphisms. Nature. 2001 Feb 15;409(6822):928-33. nhgriDnaseHs NHGRI DNaseI-HS NHGRI DNaseI-Hypersensitive Sites Regulation Description This track displays DNaseI-hypersensitive sites in CD4+ T-cells. DNaseI-hypersensitive sites are associated with gene regulatory regions, particularly for upregulated genes. CD4+ T-cells, also known as helper or inducer T cells, are involved in generating an immune response. CD4+ T-cells are also one of the primary targets of the HIV virus. Display Conventions and Configuration Gray and black blocks (which appear as vertical lines when the display is zoomed-out) represent probable hypersensitive sites. The darker the blocks, the more likely the site is to be hypersensitive. The display may be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods DNaseI-hypersensitive sites were cloned from primary human CD4+ T cells and sequenced using massively parallel signature sequencing (Brenner et al., 2000; Crawford et al., 2006). Only those clusters of multiple DNaseI library sequences that map within 500 bases of each other are displayed. Each cluster has a unique identifier, visible when the track is displayed in full or packed mode. The last digit of each identifier represents the number of sequences that map within that particular cluster. The sequence number is also reflected in the score, e.g. a cluster of two sequences scores 500, three sequences scores 750 and four or more sequences scores 1000. Real-time PCR assay was used to verify valid DNaseI-hypersensitive sites. Approximately 50% of clusters of two sequences are valid. These clusters are shown in light gray. 80% of clusters of three sequences are valid, indicated by dark gray. 100% of clusters of four or more sequences are valid, shown in black. Credits These data were produced at the Collins Lab at NHGRI. Thanks to Gregory E. Crawford and Francis S. Collins for supplying the information for this track. References Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 2000 Jun;18(6):597-8. Crawford GE, Holt IE, Mullikin JC, Tai D, Blakesley R, Bouffard G, Young A, Masiello C, Green ED, Wolfsberg TG et al. Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc. Natl. Acad. Sci. USA. 2004 Jan 27;101(4):992-7. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. (See also NHGRI's data site for the project.) McArthur M, Gerum S, Stamatoyannopoulos G. Quantification of DNaseI-sensitivity by real-time PCR: quantitative analysis of DNaseI-hypersensitivity of the mouse beta-globin LCR. J. Mol. Biol. 2001 Oct 12;313(1):27-34. encodeNhgriDnaseHs NHGRI DNaseI-HS NHGRI DNaseI-Hypersensitive Sites ENCODE Chromosome, Chromatin and DNA Structure Description This track displays DNaseI-hypersensitive sites in CD4+ T-cells before and after activation by anti-CD3 and anti-CD28 antibodies. DNaseI-hypersensitive sites are associated with gene regulatory regions, particularly for upregulated genes. CD4+ T-cells, also known as helper or inducer T cells, are involved in generating an immune response. CD4+ T-cells are also one of the primary targets of the HIV virus. Display Conventions and Configuration The top subtrack of this annotation corresponds to unactivated T cells; the bottom subtrack shows activated T cells. Within the subtracks, the gray and black blocks (which appear as vertical lines when the display is zoomed-out) represent probable hypersensitive sites. The darker the blocks, the more likely the site is to be hypersensitive. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods Primary human CD4+ T cells were activated by incubation with anti-CD3 and anti-CD28 antibodies for 24 hours. DNaseI-hypersensitive sites were cloned from the cells before and after activation, and sequenced using massively parallel signature sequencing (Brenner et al., 2000; Crawford et al., 2006). Only those clusters of multiple DNaseI library sequences that map within 500 bases of each other are displayed. Each cluster has a unique identifier, visible when the track is displayed in full or packed mode. The last digit of each identifier represents the number of sequences that map within that particular cluster. The sequence number is also reflected in the score, e.g. a cluster of two sequences scores 500, three sequences scores 750 and four or more sequences scores 1000. Verification Real-time PCR assay was used to verify valid DNaseI-hypersensitive sites. Approximately 50% of clusters of two sequences are valid. These clusters are shown in light gray. 80% of clusters of three sequences are valid, indicated by dark gray. 100% of clusters of four or more sequences are valid, shown in black. This data set includes confirmed elements for 35 of the 44 ENCODE regions. It is estimated that these data identify 10-20% of all hypersensitive sites within CD4+ T cells. Further sequencing will be required to identify additional sites. Credits These data were produced at the Collins Lab at NHGRI. Thanks to Gregory E. Crawford and Francis S. Collins for supplying the information for this track. References Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 2000 Jun;18(6):597-8. Crawford GE, Holt IE, Mullikin JC, Tai D, Blakesley R, Bouffard G, Young A, Masiello C, Green ED, Wolfsberg TG et al. Identifying gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc. Natl. Acad. Sci. USA. 2004 Jan 27;101(4):992-7. Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006 Jan;16(1):123-31. (See also NHGRI's data site for the project.) McArthur M, Gerum S, Stamatoyannopoulos G. Quantification of DNaseI-sensitivity by real-time PCR: quantitative analysis of DNaseI-hypersensitivity of the mouse beta-globin LCR. J. Mol. Biol. 2001 Oct 12;313(1):27-34. encodeNhgriDnaseHsAct DNase CD4 Activ. NHGRI DNaseI-Hypersensitive Sites (CD4+ T-Cells Activated) ENCODE Chromosome, Chromatin and DNA Structure encodeNhgriDnaseHsNonAct DNase CD4 Unact. NHGRI DNaseI-Hypersensitive Sites (CD4+ T-Cells Unactivated) ENCODE Chromosome, Chromatin and DNA Structure xenoEst Other ESTs Non-Human ESTs from GenBank mRNA and EST Description This track displays translated blat alignments of expressed sequence tags (ESTs) in GenBank from organisms other than human. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + or - indicates the orientation of the query sequence whose translated protein produced the match. The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the EST display. For example, to apply the filter to all ESTs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only ESTs that match all filter criteria will be highlighted. If "or" is selected, ESTs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display ESTs that match the filter criteria. If "include" is selected, the browser will display only those ESTs that match the filter criteria. This track may also be configured to display base labeling, a feature that allows the user to display all bases in the aligning sequence or only those that differ from the genomic sequence. For more information about this option, go to the Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods To generate this track, the ESTs were aligned against the genome using blat. When a single EST aligned in multiple places, the alignment having the highest base identity was found. Only alignments having a base identity level within 0.5% of the best and at least 96% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from EST sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoMrna Other mRNAs Non-Human mRNAs from GenBank mRNA and EST Description This track displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank from organisms other than human. Display Conventions and Configuration This track follows the display conventions for PSL alignment tracks. In dense display mode, the items that are more darkly shaded indicate matches of better quality. The strand information (+/-) for this track is in two parts. The first + indicates the orientation of the query sequence whose translated protein produced the match (here always 5' to 3', hence +). The second + or - indicates the orientation of the matching translated genomic sequence. Because the two orientations of a DNA sequence give different predicted protein sequences, there are four combinations. ++ is not the same as --, nor is +- the same as -+. The description page for this track has a filter that can be used to change the display mode, alter the color, and include/exclude a subset of items within the track. This may be helpful when many items are shown in the track display, especially when only some are relevant to the current task. To use the filter: Type a term in one or more of the text boxes to filter the mRNA display. For example, to apply the filter to all mRNAs expressed in a specific organ, type the name of the organ in the tissue box. To view the list of valid terms for each text box, consult the table in the Table Browser that corresponds to the factor on which you wish to filter. For example, the "tissue" table contains all the types of tissues that can be entered into the tissue text box. Multiple terms may be entered at once, separated by a space. Wildcards may also be used in the filter. If filtering on more than one value, choose the desired combination logic. If "and" is selected, only mRNAs that match all filter criteria will be highlighted. If "or" is selected, mRNAs that match any one of the filter criteria will be highlighted. Choose the color or display characteristic that should be used to highlight or include/exclude the filtered items. If "exclude" is chosen, the browser will not display mRNAs that match the filter criteria. If "include" is selected, the browser will display only those mRNAs that match the filter criteria. This track may also be configured to display codon coloring, a feature that allows the user to quickly compare mRNAs against the genomic sequence. For more information about this option, go to the Codon and Base Coloring for Alignment Tracks page. Several types of alignment gap may also be colored; for more information, go to the Alignment Insertion/Deletion Display Options page. Methods The mRNAs were aligned against the human genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept. Credits The mRNA track was produced at UCSC from mRNA sequence data submitted to the international public sequence databases by scientists worldwide. References Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. PMID: 23193287; PMC: PMC3531190 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779 Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 xenoRefGene Other RefSeq Non-Human RefSeq Genes Genes and Gene Predictions Description This track shows known protein-coding and non-protein-coding genes for organisms other than human, taken from the NCBI RNA reference sequences collection (RefSeq). The data underlying this track are updated weekly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The color shading indicates the level of review the RefSeq record has undergone: predicted (light), provisional (medium), reviewed (dark). The item labels and display colors of features within this track can be configured through the controls at the top of the track description page. Label: By default, items are labeled by gene name. Click the appropriate Label option to display the accession name instead of the gene name, show both the gene and accession names, or turn off the label completely. Codon coloring: This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. For more information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page. Hide non-coding genes: By default, both the protein-coding and non-protein-coding genes are displayed. If you wish to see only the coding genes, click this box. Methods The RNAs were aligned against the human genome using blat; those with an alignment of less than 15% were discarded. When a single RNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 25% base identity with the genomic sequence were kept. Credits This track was produced at UCSC from RNA sequence data generated by scientists worldwide and curated by the NCBI RefSeq project. References Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518 Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63. PMID: 24259432; PMC: PMC3965018 Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4. PMID: 15608248; PMC: PMC539979 perlegen Perlegen Haplotypes Perlegen Common High-Resolution Haplotype Blocks Variation and Repeats Description Haplotype blocks derived from common single nucleotide polymorphisms (SNPs) on chromosome 21 by Perlegen Sciences, as described in Patil, N. et al. Blocks of limited haplotype diversity revealed by high-resolution scanning. Science. 2001;294(5547):1719-1723. The location of each haplotype block is represented by a blue horizontal line with tall vertical blue bars at the first and last SNPs of the block. Blocks are displayed as starting at the first SNP and ending at the last SNP of the block. This is slightly different from the representation on the Perlegen web site in which blocks are stretched until they abut each other. The shade of the blue indicates the minimum number of SNPs required to discriminate between haplotype patterns that account for at least 80% of genotyped chromosomes. Darker colors indicate that fewer SNPs are necessary. Individual SNPs are denoted by smaller black vertical bars. At multi-megabase resolution in dense display mode, clusters of tall blue bars may indicate hotspots for recombination. For more information on a particular block, click "Outside Link" on the item's details page. NOTE: Perlegen annotations appear only on chromosome 21. Credits Thanks to Perlegen Sciences for making these data available. recombRate Recomb Rate Recombination Rate from deCODE, Marshfield, or Genethon Maps (deCODE default) Mapping and Sequencing Description The recombination rate track represents calculated sex-averaged rates of recombination based on either the deCODE, Marshfield, or Genethon genetic maps. By default, the deCODE map rates are displayed. Female- and male-specific recombination rates, as well as rates from the Marshfield and Genethon maps, can also be displayed by choosing the appropriate filter option on the track description page. Methods The deCODE genetic map was created at deCODE Genetics and is based on 5,136 microsatellite markers for 146 families with a total of 1,257 meiotic events. For more information on this map, see Kong, et al., 2002. The Marshfield genetic map was created at the Center for Medical Genetics and is based on 8,325 short tandem repeat polymorphisms (STRPs) for 8 CEPH families consisting of 134 individuals with 186 meioses. For more information on this map, see Broman et al., 1998. The Genethon genetic map was created at Genethon and is based on 5,264 microsatellites for 8 CEPH families consisting of 134 individuals with 186 meioses. For more information on this map, see Dib et al., 1996. Each base is assigned the recombination rate calculated by assuming a linear genetic distance across the immediately flanking genetic markers. The recombination rate assigned to each 1 Mb window is the average recombination rate of the bases contained within the window. Using the Filter This track has a filter that can be used to change the map or gender-specific rate displayed. The filter is located at the top of the track description page, which is accessed via the small button to the left of the track's graphical display or through the link on the track's control menu. To view a particular map or gender-specific rate, select the corresponding option from the "Map Distances" pulldown list. By default, the browser displays the deCODE sex-averaged distances. When you have finished configuring the filter, click the Submit button. Credits This track was produced at UCSC using data that are freely available for the Genethon, Marshfield, and deCODE genetic maps (see above links). Thanks to all who played a part in the creation of these maps. References Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet. 1998 Sep;63(3):861-9. PMID: 9718341; PMC: PMC1377399 Dib C, Fauré S, Fizames C, Samson D, Drouot N, Vignal A, Millasseau P, Marc S, Hazan J, Seboun E et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. 1996 Mar 14;380(6570):152-4. PMID: 8600387 Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G et al. A high-resolution recombination map of the human genome. Nat Genet. 2002 Jul;31(3):241-7. PMID: 12053178 encodeReseqRegions Reseq Regions ENCODE Resequencing Regions ENCODE Variation Description This track depicts the 10 ENCODE resequencing regions for the NHGRI ENCODE project. The long-term goal of this project is to identify all functional elements in the human genome sequence to facilitate a better understanding of human biology and disease. These regions were chosen out of the 44 total. The resequencing was done by Broad and Baylor. See the NHGRI target selection process web page for a description of how the target regions were selected. To open a UCSC Genome Browser with a menu for selecting ENCODE regions on the Build 34 human genome, use ENCODE Regions in the UCSC Browser. The UCSC resources provided for the ENCODE project are described on the UCSC ENCODE Portal. Credits Thanks to the NHGRI ENCODE project for providing this initial set of data. encodeRikenCage Riken CAGE Riken CAGE - Predicted Gene Start Sites ENCODE Transcript Levels Description This track shows the number of 5' cap analysis gene expression (CAGE) tags that map to the genome on the "plus" and "minus" strands at a specific location. For clarity, only the first 5' nucleotide in the tag (relative to the transcript direction) is considered. Areas in which many tags map to the same region may indicate a significant transcription start site. Display Conventions and Configuration The position of the first 5' nucleotide in the tag is represented by a solid block. The height of the block indicates the number of 5' cDNA starts that map at that location. This composite annotation track contains multiple subtracks that may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. For more information about the graphical configuration options, click the Graph configuration help link. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Methods The CAGE tags are sequenced from the 5' ends of full-length cDNAs produced using RIKEN full-length cDNA technology. To create the tag, a linker was attached to the 5' end of full-length cDNAs which were selected by cap trapping. The first 20 bp of the cDNA were cleaved using class II restriction enzymes, followed by PCR amplification and then concatamers of the resulting 32 bp tags were formed for more efficient sequencing. For more information on CAGE analysis, see Shiraki et al. (2003) below. Refer to the RIKEN website for information about RIKEN full-length cDNA technologies. The mapping methodology employed in this annotation will be described in upcoming publications. Verification The techniques used to verify these data will be described in upcoming publications. Credits These data were contributed by the Functional Annotation of Mouse (FANTOM) Consortium, RIKEN Genome Science Laboratory and RIKEN Genome Exploration Research Group (Genome Network Project Core Group). FANTOM Consortium: P. Carninci, T. Kasukawa, S. Katayama, Gough, M. Frith, N. Maeda, R. Oyama, T. Ravasi, B. Lenhard, C. Wells, R. Kodzius, K. Shimokawa, V. B. Bajic, S. E. Brenner, S. Batalov, A. R. R. Forrest, M. Zavolan, M. J. Davis, L. G. Wilming, V. Aidinis, J. Allen, A. Ambesi-Impiombato, R. Apweiler, R. N. Aturaliya, T. L. Bailey, M. Bansal, K. W. Beisel, T. Bersano, H. Bono, A. M. Chalk, K. P. Chiu, V. Choudhary, A. Christoffels, D. R. Clutterbuck, M. L. Crowe, E. Dalla, B. P. Dalrymple, B. de Bono, G. Della Gatta, D. di Bernardo, T. Down, P. Engstrom, M. Fagiolini, G. Faulkner, C. F. Fletcher, T. Fukushima, M. Furuno, S. Futaki, M. Gariboldi, P. Georgii-Hemming, T. R. Gingeras, T. Gojobori, R. E. Green, S. Gustincich, M. Harbers, V. Harokopos, Y. Hayashi, S. Henning, T. K. Hensch, N. Hirokawa, D. Hill, L. Huminiecki, M. Iacono, K. Ikeo, A. Iwama, T. Ishikawa, M. Jakt, A. Kanapin, M. Katoh, Y. Kawasawa, J. Kelso, H. Kitamura, H. Kitano, G. Kollias, S. P. T. Krishnan, A.F. Kruger, K. Kummerfeld, I. V. Kurochkin, L. F. Lareau, L. Lipovich, J. Liu, S. Liuni, S. McWilliam, M. Madan Babu, M. Madera, L. Marchionni, H. Matsuda, S. Matsuzawa, H. Miki, F. Mignone, S. Miyake, K. Morris, S. Mottagui-Tabar, N. Mulder, N. Nakano, H. Nakauchi, P. Ng, R. Nilsson, S. Nishiguchi, S. Nishikawa, F. Nori, O. Ohara, Y. Okazaki, V. Orlando, K. C. Pang, W. J. Pavan, G. Pavesi, G. Pesole, N. Petrovsky, S. Piazza, W. Qu, J. Reed, J. F. Reid, B. Z. Ring, M. Ringwald, B. Rost, Y. Ruan, S. Salzberg, A. Sandelin, C. Schneider, C. Schoenbach, K. Sekiguchi, C. A. M. Semple, S. Seno, L. Sessa, Y. Sheng, Y. Shibata, H. Shimada, K. Shimada, B. Sinclair, S. Sperling, E. Stupka, K. Sugiura, R. Sultana, Y. Takenaka, K. Taki, K. Tammoja, S. L. Tan, S. Tang, M. S. Taylor, J. Tegner, S. A. Teichmann, H. R. Ueda, E. van Nimwegene, R. Verardo, C. L. Wei, K. Yagi, H. Yamanishi, E. Zabarovsky, S. Zhu, A. Zimmer, W. Hide, C. Bult, S. M. Grimmond, R. D. Teasdale, E. T. Liu, V. Brusic, J. Quackenbush, C. Wahlestedt, J. Mattick, D. Hume. RIKEN Genome Exploration Research Group: C. Kai, D. Sasaki, Y. Tomaru, S. Fukuda, M. Kanamori-Katayama, M. Suzuki, J. Aoki, T. Arakawa, J. Iida, K. Imamura, M. Itoh, T. Kato, H. Kawaji, N. Kawagashira, T. Kawashima, M. Kojima, S. Kondo, H. Konno, K. Nakano, N. Ninomiya, T. Nishio, M. Okada, C. Plessy, K. Shibata, T. Shiraki, S. Suzuki, M. Tagami, K Waki, A. Watahiki, Y. Okamura-Oho, H. Suzuki, J. Kawai. General Organizer: Y. Hayashizaki References Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius, R., Watahiki, A., Nakamura, M. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 100(26), 15776-81 (2003). encodeRikenCageMinus Riken CAGE - Riken CAGE Minus Strand - Predicted Gene Start Sites ENCODE Transcript Levels encodeRikenCagePlus Riken CAGE + Riken CAGE Plus Strand - Predicted Gene Start Sites ENCODE Transcript Levels rnaGene RNA Genes Non-coding RNA Genes (dark) and Pseudogenes (light) Genes and Gene Predictions Description This track shows the location of non-protein coding RNA genes and pseudogenes. Feature types include: tRNA: Transfer RNA (or pseudogene) rRNA: Ribosomal RNA (or pseudogene) scRNA: Small cytoplasmic RNA (or pseudogene) snRNA: Small nuclear RNA (or pseudogene) snoRNA: Small nucleolar RNA (or pseudogene) miRNA: MicroRNA (or pseudogene) misc_RNA: Miscellaneous other RNA, such as Xist (or pseudogene) Methods Eddy-tRNAscanSE (tRNA genes, Sean Eddy): tRNAscan-SE 1.23 with default parameters. Score field contains tRNAscan-SE bit score; >20 is good, >50 is great. Eddy-BLAST-tRNAlib (tRNA pseudogenes, Sean Eddy): Wublast 2.0, with options "-kap wordmask=seg B=50000 W=8 cpus=1". Score field contains % identity in blast-aligned region. Used each of 602 tRNAs and pseudogenes predicted by tRNAscan-SE in the human oo27 assembly as queries. Kept all nonoverlapping regions that hit one or more of these with P Eddy-BLAST-snornalib (known snoRNAs and snoRNA pseudogenes, Steve Johnson): Wublastn 2.0, with options "-V=25 -hspmax=5000 -kap wordmask=seg B=5000 W=8 cpus=1". Score field contains blast score. Used each of 104 unique snoRNAs in snorna.lib as a query. Any hit >=95% full length and >=90% identity is annotated as a "true gene". Any other hit with P Eddy-BLAST-otherrnalib (non-tRNA, non-snoRNA noncoding RNAs with GenBank entries for the human gene.): Wublastn 2.0 [15 Apr 2002] with options: "-kap -cpus=1 -wordmask=seg -W=8 -E=0.01 -hspmax=0 -B=50000 -Z=3000000000". Exceptions to this are: Large ncRNAs (LSU & SSU rRNA, H19, Xist): change "-W=11"; addition "-maskextra=50". Xist contains repetitive elements and was masked with RepeatMasker, Library version 6.8. microRNAs: "-kap -cpus=1 -S=70 -hspmax=0 -B=100" replaces all above parameters. The score field contains the blastn score. 41 unique miRNAs and 29 other ncRNAs were used as queries. Any hit >=95% full length and >=95% identity is annotated as a "true gene". Any other hit with P = 65% identity is annotated as a "related sequence". There is an exception to this: all miRNAs consist of 16-26 bp sequences in GenBank and are annotated only if they are 100% full length and have 100% identity. The set of miRNAs used consists of Let-7 from Pasquinelli et al. (2000) and 40 miRNAs from Mourelatos et al. (2002), as mentioned in the references section below. Credits These data were kindly provided by Sean Eddy at Washington University. References Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kuroda MI, Maller B, Hayward DC, Ball EE, Degnan B, Müller P, et al. Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature. 2000 Nov 2;408(6808):86-9. Mourelatos Z, Dostie J, Paushkin S, Sharma A, Charroux B, Abel L, Rappsilber J, Mann M, Dreyfuss G. miRNPs: a novel class of ribonucleoproteins containing numerous microRNAs. Genes Dev. 2002 Mar 15;16(6):720-8. encodeSangerGenoExprAssociation Sanger Assoc Sanger Genotype-Expression Association ENCODE Variation Description This track displays associations among gene expression data from the 60 unrelated Centre d'Etude du Polymorphisme Humain (CEPH) individuals of the International HapMap Project with SNPs genotyped by HapMap. The CEPH population is composed of Utah residents with ancestry from northern and western Europe. The expression data were generated with the Illumina platform at the Wellcome Trust Sanger Institute. Display Conventions and Configuration In the graphical display, an association is displayed as a block drawn at the location of the associated SNP. In pack or full modes, the name of the associated gene is drawn to the left of the block. The shading of the block indicates the strength of the association: light gray indicates a (-log10) P-value close to 0 and black indicates a P-value of 2 or more. Methods An association analysis was performed for each ENCODE RefSeq gene with the genotypes of SNPs in the same ENCODE region (cis). Expression values were initially log2 transformed and subsequently normalized with quantile normalization to ensure homogeneous levels between arrays. Analysis of variance (ANOVA) was then performed with 1 or 2 degrees of freedom (depending on whether only two or all three genotypes in the population were available), using the genotype as a categorical variable and the normalized/transformed expression values as the response. The values presented here are the -log10 P-value. Verification There were six technical replicates for each sample; the average values from these were used for the ANOVA. Credits The following people contributed to this analysis: Barbara Stranger, Matthew Forrest, Panos Deloukas, and Manolis Dermitzakis from Wellcome Trust Sanger Institute and Simon Tavare from Cambridge University. References Dausset, J., Cann, H., Cohen, D., Lathrop, M., Lalouel, J.M. and White, R. Centre d'Etude du Polymorphisme Humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6(3), 575-7 (1990). encodeSangerChipH3H4 Sanger ChIP Sanger ChIP/Chip (histones H3,H4 antibodies in GM06990, K562 cells) ENCODE Chromatin Immunoprecipitation Description ENCODE region-wide location analysis of H3 and H4 histones was conducted employing ChIP-chip using chromatin extracted from GM06990 (lymphoblastoid) and K562 (myeloid leukemia-derived) cells. Histone methylation and acetylation serves as a stable genomic imprint that regulates gene expression and other epigenetic phenomena. These histones are found in transcriptionally active domains called euchromatin. TrackCellLine/TypeAntibodyEpitope Data FTP accessHistoneArray ID 1: SI H3K4m1 GM06990 GM06990 ab8895 H3K4me1 H3K4me1_GM06990_1 H3 monomethyl lysine 4 ENCODE3.1.1 2: SI H3K4m2 GM06990 GM06990 ab7766 H3K4me2 H3K4me2_GM06990_1 H3 dimethyl lysine 4 ENCODE3.1.1 3: SI H3K4m3 GM06990 GM06990 ab8580 H3K4me3 H3K4me3_GM06990_2 H3 trimethyl lysine 4 ENCODE3.1.1 4: SI H3ac GM06990 GM06990 06-599 H3ac H3ac_GM06990_1 H3 acetylated lysines 9 and 14 ENCODE3.1.1 5: SI H4ac GM06990 GM06990 06-866 H4ac H4ac_GM06990_1 H4 acetylated lysines 5, 8, 12, 16 ENCODE3.1.1 6: SI H3K4me2 K562 K562 ab7766 H3K4me2 H3K4me2_K562_1 H3 K4 dimethylated ENCODE3.1.1 7: SI H3K4me3 K562 K562 ab8580 H3K4me3 H3K4me3_K562_1 H3 trimethyl lysine 4 ENCODE3.1.1 8: SI H3ac K562 K562 06-599 H3ac H3ac_K562_1 H3 acetylated ENCODE3.1.1 9: SI H4ac K562 K562 06-866 H4ac H4ac_K562_1 H4 acetylated ENCODE3.1.1 Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Chromatin from the cell line was cross-linked with 1% formaldehyde, precipitated with antibody binding to the histone, and sheared and hybridized to a DNA array. DNA was not amplified prior to hybridization. The raw and transformed data files reflect fold enrichment over background, averaged over six replicates. Verification There are six replicates: two technical replicates (immunoprecipitations) for each of the three biological replicates (cell cultures). Raw and transformed (averaged) data can be downloaded from the Wellcome Trust Sanger Institute FTP site as indicated in the table above. Credits The data for this track were generated by the ENCODE investigators at the Wellcome Trust Sanger Institute, Hinxton, UK. encodeSangerChipH4acK562 SI H4ac K562 Sanger Institute ChIP/Chip (H4ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3acK562 SI H3ac K562 Sanger Institute ChIP/Chip (H3ac ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3K562 SI H3K4me3 K562 Sanger Institute ChIP/Chip (H3K4me3 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2K562 SI H3K4me2 K562 Sanger Institute ChIP/Chip (H3K4me2 ab, K562 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH4ac SI H4ac GM06990 Sanger Institute ChIP/Chip (H4ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3ac SI H3ac GM06990 Sanger Institute ChIP/Chip (H3ac ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me3 SI H3K4m3 GM06990 Sanger Institute ChIP/Chip (H3K4me3 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me2 SI H3K4m2 GM06990 Sanger Institute ChIP/Chip (H3K4me2 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation encodeSangerChipH3K4me1 SI H3K4m1 GM06990 Sanger Institute ChIP/Chip (H3K4me1 ab, GM06990 cells) ENCODE Chromatin Immunoprecipitation genomicSuperDups Segmental Dups Duplications of >1000 Bases of Non-RepeatMasked Sequence Variation and Repeats Description This track shows regions detected as putative genomic duplications within the golden path. The following display conventions are used to distinguish levels of similarity: Light to dark gray: 90 - 98% similarity Light to dark yellow: 98 - 99% similarity Light to dark orange: greater than 99% similarity Red: duplications of greater than 98% similarity that lack sufficient Segmental Duplication Database evidence (most likely missed overlaps) For a region to be included in the track, at least 1 Kb of the total sequence (containing at least 500 bp of non-RepeatMasked sequence) had to align and a sequence identity of at least 90% was required. Methods Segmental duplications play an important role in both genomic disease and gene evolution. This track displays an analysis of the global organization of these long-range segments of identity in genomic sequence. Large recent duplications (>= 1 kb and >= 90% identity) were detected by identifying high-copy repeats, removing these repeats from the genomic sequence ("fuguization") and searching all sequence for similarity. The repeats were then reinserted into the pairwise alignments, the ends of alignments trimmed, and global alignments were generated. For a full description of the "fuguization" detection method, see Bailey et al., 2001. This method has become known as WGAC (whole-genome assembly comparison); for example, see Bailey et al., 2002. Credits These data were provided by Ginger Cheng, Xinwei She, Archana Raja, Tin Louie and Evan Eichler at the University of Washington. References Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. Recent segmental duplications in the human genome. Science. 2002 Aug 9;297(5583):1003-7. PMID: 12169732 Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001 Jun;11(6):1005-17. PMID: 11381028; PMC: PMC311093 chainSelf Self Chain Chained Self Alignments Repeats Description This track shows alignments of the human genome with itself, using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. The system can also tolerate gaps in both sets of sequence simultaneously. After filtering out the "trivial" alignments produced when identical locations of the genome map to one another (e.g. chrN mapping to chrN), the remaining alignments point out areas of duplication within the human genome. The pseudoautosomal regions of chrX and chrY are an exception: in this assembly, these regions have been copied from chrX into chrY, resulting in a large amount of self chains aligning in these positions on both chromosomes. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the query assembly or an insertion in the target assembly. Double lines represent more complex gaps that involve substantial sequence in both the query and target assemblies. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one of the assemblies. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Display Conventions and Configuration By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Methods The genome was aligned to itself using blastz. Trivial alignments were filtered out, and the remaining alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single target chromosome and a single query chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. Chains scoring below a threshold were discarded; the remaining chains are displayed in this track. Credits Blastz was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains were generated by Robert Baertsch and Jim Kent. References Chiaromonte, F., Yap, V.B., Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20), 11484-11489 (2003). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D., and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res. 13(1), 103-7 (2003). sgpGene SGP Genes SGP Gene Predictions Using Mouse/Human Homology Genes and Gene Predictions Description This track shows gene predictions from the SGP2 homology-based gene prediction program developed by Roderic Guigó's "Computational Biology of RNA Processing" group, which is part of the Centre de Regulació Genòmica (CRG) in Barcelona, Catalunya, Spain. To predict genes in a genomic query, SGP2 combines geneid predictions with tblastx comparisons of the genome of the target species against genomic sequences of other species (reference genomes) deemed to be at an appropriate evolutionary distance from the target. Credits Thanks to the "Computational Biology of RNA Processing" group for providing these data. simpleRepeat Simple Repeats Simple Tandem Repeats by TRF Variation and Repeats Description This track displays simple tandem repeats (possibly imperfect repeats) located by Tandem Repeats Finder (TRF) which is specialized for this purpose. These repeats can occur within coding regions of genes and may be quite polymorphic. Repeat expansions are sometimes associated with specific diseases. Methods For more information about the TRF program, see Benson (1999). Credits TRF was written by Gary Benson. References Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999 Jan 15;27(2):573-80. PMID: 9862982; PMC: PMC148217 encodeRecombHotspot SNP Recomb Hots Oxford Recombination Hotspots from ENCODE resequencing data ENCODE Variation Description This track shows the location of recombination hotspots detected from patterns of genetic variation. It is based on the HapMap ENCODE data, in the ten ENCODE regions that have been resequenced: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequlibrium (Daly et al., 2001; Gabriel et al., 2002). Recombination hotspot estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination hotspots would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Methods Recombination hotspots are identified using the likelihood-ratio test described in McVean et al. (2004) and Winckler et al. (2005), referred to as LDhot. For successive intervals of 200 kb, the maximum likelihood of a model with a constant recombination rate is compared to the maximum likelihood of a model in which the central 2 kb is a recombination hotspot (likelihoods are approximated by the composite likelihood method of Hudson 2001). The observed difference in log composite likelihood is compared against the null distribution, which is obtained by simulations. Simulations are matched for sample size, SNP density, background recombination rate and an approximation to the ascertainment scheme (a panel of 12 individuals with a Poisson number of chromosomes, mean 1, sampled from this panel, using a single hit ascertainment scheme for dbSNP and resequencing of 16 individuals for the 10 HapMap ENCODE regions). Evidence for a hotspot was assessed in each analysis panel separately (YRI, CEU and combined CHB+JPT), and p-values were combined such that a hotspot requires that two of the three populations show some evidence of a hotspot (p < 0.05) and at least one population showed stronger evidence for a hotspot (p < 0.01). Hotspot centers were estimated at those locations where distinct recombination rate estimate peaks occurred with at least a factor of two separation between peaks, within the low p-value intervals. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The data are based on HapMap release 16a. The recombination hotspots were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hudson, R. R. Two-locus sampling distributions and their application. Genetics 159(4):1805-1817 (2001). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). snpRecombHotspot SNP Recomb Hots Recombination Hotspots from SNP Genotyping Variation and Repeats Description This track shows the location of recombination hotspots detected from patterns of genetic variation. It is based on the HapMap Phase I data, release 16a, and Perlegen data (Hinds et al., 2005). Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequilibrium (Daly et al., 2001; Gabriel et al., 2002). Recombination hotspot estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination hotspots would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Methods Recombination hotspots are identified using the likelihood-ratio test described in McVean et al. (2004) and Winckler et al. (2005), referred to as LDhot. For successive intervals of 200 kb, the maximum likelihood of a model with a constant recombination rate is compared to the maximum likelihood of a model in which the central 2 kb is a recombination hotspot (likelihoods are approximated by the composite likelihood method of Hudson 2001). The observed difference in log composite likelihood is compared against the null distribution, which is obtained by simulations. Simulations are matched for sample size, SNP density, background recombination rate and an approximation to the ascertainment scheme (a panel of 12 individuals with a Poisson number of chromosomes, mean 1, sampled from this panel, using a single hit ascertainment scheme for dbSNP and resequencing of 16 individuals for the ten HapMap ENCODE regions). Evidence for a hotspot was assessed in each analysis panel separately (YRI, CEU and combined CHB+JPT), and p-values were combined such that a hotspot requires that two of the three populations show some evidence of a hotspot (p < 0.05) and at least one population showed stronger evidence for a hotspot (p < 0.01). Hotspot centers were estimated at those locations where distinct recombination rate estimate peaks occurred with at least a factor of two separation between peaks, within the low p-value intervals. Validation This approach has been validated in three ways: Over large scales from the genetic map, both by extensive simulation studies and by comparisons with independent estimates of recombination rates, and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The HapMap data are based on HapMap release 16a; the Perlegen data are from Hinds et al. (2005). The recombination hotspots were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hudson, R. R. Two-locus sampling distributions and their application. Genetics 159(4):1805-1817 (2001). Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., Cox, D.R. Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 307(5712), 1072-1079 (2005). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). snpRecombHotspotPerlegen Perlegen Oxford Recombination Hotspots from Perlegen Data Variation and Repeats snpRecombHotspotHapmap HapMap Oxford Recombination Hotspots from HapMap Phase I Release 16c.1 Variation and Repeats snpRecombRate SNP Recomb Rates Recombination Rates from SNP Genotyping Variation and Repeats Description This track shows recombination rates measured in centiMorgans per Megabase. It is based on the HapMap Phase I data, release 16a, and Perlegen data (Hinds et al., 2005). Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequilibrium (Daly et al., 2001; Gabriel et al., 2002). Fine-scale recombination rate estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination rate variation would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Display Conventions and Configuration This annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Fine-scale recombination rates are estimated using the reversible-jump Markov chain Monte Carlo (MCMC) method (McVean et al., 2004). This approach explores the posterior distribution of fine-scale recombination rate profiles, where the state-space considered is the distribution of piece-wise constant recombination maps. The Markov chain explores the distribution of both the number and location of change-points, in addition to the rates for each segment. A prior is set on the number of change-points that increases the smoothing effect of trans-dimensional MCMC, which is necessary because of the composite-likelihood scheme employed. This method is implemented in the package LDhat, which includes full details of installation and implementation. A block-penalty of five was used (calibrated by simulation and comparison to data from sperm-typing studies). Each region was analyzed as a single run with 10,000,000 iterations, sampling every 5000th iteration and discarding the first third of all samples as burn-in. The mean posterior rate for each SNP interval is the value reported. Because of the non-independence of the composite likelihood scheme, the quantiles of the sampling distribution do not reflect true uncertainty and are therefore not given. Estimates were generated separately from each of the four HapMap populations, and then combined to give a single figure. Differences between populations are not significant. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The HapMap data are based on HapMap release 16a; the Perlegen data are from Hinds et al. (2005). The recombination rates were ascertained by Simon Myers from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., Cox, D.R. Whole-Genome Patterns of Common DNA Variation in Three Human Populations. Science 307(5712), 1072-1079 (2005). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). snpRecombRatePerlegen Perlegen Oxford Recombination Rates from Perlegen Data Variation and Repeats snpRecombRateHapmap HapMap Phase I Oxford Recombination Rates from HapMap Phase I Release 16c.1 Variation and Repeats encodeRecombRate SNP Recomb Rates Oxford Recombination Rates from ENCODE resequencing data ENCODE Variation Description This track shows recombination rates measured in centiMorgans/Megabase in ten ENCODE regions that have been resequenced: ENr112 (chr2) ENr131 (chr2) ENr113 (chr4) ENm010 (chr7) ENm013 (chr7) ENm014 (chr7) ENr321 (chr8) ENr232 (chr9) ENr123 (chr12) ENr213 (chr18) Observations from sperm studies (Jeffreys et al., 2001) and patterns of genetic variation (McVean et al., 2004; Crawford et al., 2004) show that recombination rates in the human genome vary extensively over kilobase scales and that much recombination occurs in recombination hotspots. This provides an explanation for the apparent block-like structure of linkage disequlibrium (Daly et al., 2001; Gabriel et al., 2002). Fine-scale recombination rate estimates provide a new route to understanding the molecular mechanisms underlying human recombination. A better understanding of the genomic landscape of human recombination rate variation would facilitate the efficient design and analysis of disease association studies and greatly improve inferences from polymorphism data about selection and human demographic history. Display Conventions and Configuration This annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page. For more information, click the Graph configuration help link. Methods Fine-scale recombination rates are estimated using the reversible-jump Markov chain Monte Carlo method (McVean et al., 2004). This approach explores the posterior distribution of fine-scale recombination rate profiles, where the state-space considered is the distribution of piece-wise constant recombination maps. The Markov chain explores the distribution of both the number and location of change-points, in addition to the rates for each segment. A prior is set on the number of change-points that increases the smoothing effect of trans-dimensional MCMC, which is necessary because of the composite-likelihood scheme employed. This method is implemented in the package LDhat, which includes full details of installation and implementation. For the ENCODE regions, a block-penalty of 5 was used (calibrated by simulation and comparison to data from sperm-typing studies). Each region was analyzed as a single run with 10,000,000 iterations, sampling every 5000th iteration and discarding the first third of all samples as burn-in. The mean posterior rate for each SNP interval is the value reported. Because of the non-independence of the composite likelihood scheme, the quantiles of the sampling distribution do not reflect true uncertainty and are therefore not given. Estimates were generated separately from each of the four ENCODE resequencing populations, and then combined to give a single figure. Differences between populations are not significant. Validation This approach has been validated in three ways: by extensive simulation studies and by comparisons with independent estimates of recombination rates, both over large scales from the genetic map and over fine scales from sperm analysis. Full details of validation can be found in McVean et al. (2004) and Winckler et al. (2005). Credits The data is based on HapMap release 16. The recombination rates were ascertained by Gil McVean from the Mathematical Genetics Group at the University of Oxford. References Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat Genet. 36(7), 700-6 (2004). Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. High-resolution haplotype structure in the human genome. Nat Genet. 29(2), 229-32 (2001). Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. The structure of haplotype blocks in the human genome. Science 296(5576), 2225-9 (2002). Jeffreys, A.J,. Kauppi, L. and Neumann, R. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat Genet. 29(2), 217-22 (2001). McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R. and Donnelly, P. The fine-scale structure of recombination rate variation in the human genome. Science 304(5670), 581-4 (2004). Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P. et al. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308(5718), 107-11 (2005). encodeStanfordMeth Stanf Meth Stanford Methylation Digest: Be2C, CRL1690, HCT116, HT1080, HepG2, JEG3, Snu182, U87 ENCODE Chromosome, Chromatin and DNA Structure Description This track displays experimentally determined regions of unmethylated CpGs in the ENCODE regions. These experiments were performed in eight cell lines, each of which is displayed as a separate subtrack: Cell LineClassificationIsolated From BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) CRL-1690™hybridomaB lymphocyte HCT 116colorectal carcinomacolon HT-1080fibrosarcomaconnective tissue HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods High molecular weight genomic DNA was prepared from each cell line. The genomic DNA was digested with a cocktail of six methyl-sensitive restriction enzymes (AciI, HhaI, BstUI, HpaII, HgaI, and HpyCH4IV) and size selected to deplete the genome of unmethylated regions. Digested and undigested DNA (control) were amplified, labeled, and hybridized to oligo tiling arrays produced by NimbleGen. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The value given for each array probe is the transformed mean ratio of undigested:digested genomic DNA. Higher scores in this track indicate regions that are more strongly methylated, due to the greater difference between the undigested and digested hybridization signals. Verification Three biological replicates and two technical replicates were done for each of the eight cell lines. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). Please contact David Johnson for further information regarding the methods and the data for this track. encodeStanfordMethU87 Stan Meth U87 Stanford Methylation Digest (U87 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSnu182 Stan Meth Snu182 Stanford Methylation Digest (Snu182 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethJEG3 Stan Meth JEG3 Stanford Methylation Digest (JEG3 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHepG2 Stan Meth HepG2 Stanford Methylation Digest (HepG2 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHT1080 Stan Meth HT1080 Stanford Methylation Digest (HT1080 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethHCT116 Stan Meth HCT116 Stanford Methylation Digest (HCT116 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethCRL1690 Stan Meth CRL1690 Stanford Methylation Digest (CRL1690 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethBe2C Stan Meth Be2C Stanford Methylation Digest (Be2C cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothed Stanf Meth Score Stanford Methylation Digest Smoothed Score ENCODE Chromosome, Chromatin and DNA Structure Description This track displays smoothed (sliding-window mean) scores for experimentally determined regions of unmethylated CpGs in the ENCODE regions. These experiments were performed in eight cell lines, each of which is displayed as a separate subtrack: Cell LineClassificationIsolated From BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) CRL-1690™hybridomaB lymphocyte HCT 116colorectal carcinomacolon HT-1080fibrosarcomaconnective tissue HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Display Conventions and Configuration This annotation follows the display conventions for composite tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods High molecular weight genomic DNA was prepared from each cell line. The genomic DNA was digested with a cocktail of six methyl-sensitive restriction enzymes (AciI, HhaI, BstUI, HpaII, HgaI, and HpyCH4IV) and size selected to deplete the genome of unmethylated regions. Digested and undigested DNA (control) were amplified, labeled, and hybridized to oligo tiling arrays produced by NimbleGen. The data for each array were median subtracted (log 2 ratios) and normalized (divided by the standard deviation). The transformed mean ratios of undigested:digested genomic DNA for all probes were then smoothed by calculating a sliding-window mean. Windows of six neighboring probes (sliding two probes at a time) were used; within each window, the highest and lowest value were dropped, and the remaining four values were averaged. In order to increase the contrast between high and low values for visual display, the average was converted to a score by the formula: score = 8^(average) * 10 These scores are for visualization purposes; for all analyses, the raw ratios, which are available in the Stanf Meth track, should be used. Higher scores in this track indicate regions that are more strongly methylated, due to the greater difference between the undigested and digested hybridization signals. Verification Three biological replicates and two technical replicates were done for each of the eight cell lines. The Myers lab is currently testing the specificity and sensitivity using real-time PCR. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). Please contact David Johnson for further information regarding the methods and the data for this track. encodeStanfordMethSmoothedU87 Stan Meth Sc U87 Stanford Methylation Digest Smoothed Score (U87 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedSnu182 Stan Meth Sc Snu182 Stanford Methylation Digest Smoothed Score (Snu182 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedJEG3 Stan Meth Sc JEG3 Stanford Methylation Digest Smoothed Score (JEG3 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHepG2 Stan Meth Sc HepG2 Stanford Methylation Digest Smoothed Score (HepG2 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHT1080 Stan Meth Sc HT1080 Stanford Methylation Digest Smoothed Score (HT1080 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedHCT116 Stan Meth Sc HCT116 Stanford Methylation Digest Smoothed Score (HCT116 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedCRL1690 Stan Meth Sc CRL1690 Stanford Methylation Digest Smoothed Score (CRL1690 cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordMethSmoothedBe2C Stan Meth Sc Be2C Stanford Methylation Digest Smoothed Score (Be2C cells) ENCODE Chromosome, Chromatin and DNA Structure encodeStanfordPromoters Stanf Promoter Stanford Promoter Activity ENCODE Transcript Levels Description This track displays activity levels of 643 putative promoter fragments in the ENCODE regions, based on high-throughput transient transfection luciferase reporter assays. The activity of each putative promoter is indicated by color, ranging from black (no activity) to red (strong activity). Each of the fragments was tested in a panel of 16 cell lines: Cell LineClassificationIsolated From AGSgastric adenocarcinomastomach BE(2)-Cneuroblastomabrain (metastatic, from bone marrow) T98G (CRL-1690)glioblastomabrain G-402renal leiomyoblastomakidney HCT 116colorectal carcinomacolon HMCBmelanomaskin HT-1080fibrosarcomaconnective tissue SK-N-SH (HTB-11)neuroblastomabrain (metastatic, from bone marrow) HeLaadenocarcinomacervix HepG2hepatocellular carcinomaliver JEG-3choriocarcinomaplacenta MG-63osteosarcomabone MRC-5fibroblastlung PANC-1epithelioid carcinomapancreas (duct) SNU-182hepatocellular carcinomaliver U-87 MGglioblastoma-astrocytomabrain Methods Promoters in the ENCODE region were predicted using a variation on methods previously described (Trinklein et al., 2003, Trinklein et al., 2004). Using BLAT alignments of human cDNAs in Genbank to the genome, those with at least one bp of exon overlap were merged, generating gene models. The transcription start sites were predicted by assigning the 5' end of each gene model as one transcription start site and alternative 5' ends that were at least 500 bp downstream and supported by full-length cDNAs as other start sites. Promoters were defined as the regions approximately 600 bp upstream and 100 bp downstream of each transcription start site. Primer3 was used to pick primers yielding approximately 500 bp amplicons containing the predicted transcription start site. Each fragment of DNA represented in this track was cloned into a luciferase reporter vector (pGL3-Basic, Promega) using the BD Clontech Infusion Cloning System. The Dual Luciferase system (Promega) was used to co-transfect the experimental DNA along with a control plasmid expressing Renilla - to control for variation in transcription efficiency - in 96-well format into one of the sixteen cell types using FuGENE Transfection Reagent (Roche). Each transfection was done in duplicate. Data are reported as normalized and log2 transformed averages of the Luciferase/Renilla ratio. This normalization was based on the activity of 102 random genomic fragments (negative controls) derived from exons and intergenic regions. Such a normalization allows for a meaningful comparison between cell types. The average log transformed Luciferase/Renilla ratio was scaled linearly to create a score where the maximum value is 1000 and the minimum value is 0. This score is arbitrary and for visualization purposes only; the raw ratio values should be used for all analyses. Verification Data were verified by repeating the preparation and measurement of 48 random fragments. No significant variation between the two preparations was detected. A spreadsheet containing the negative control data can be downloaded here. Credits This work was done in collaboration at the Myers Lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). The following people contributed: Sara J. Cooper, Nathan D. Trinklein, Elizabeth D. Anton, Loan Nguyen, and Richard M. Myers. References Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006 Jan;16(1):1-10. Epub 2005 Dec 12. Trinklein ND, Aldred SJ, Saldanha AJ, Myers RM. Identification and functional analysis of human transcriptional promoters. Genome Res. 2003 Feb;13(2):308-12. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, Myers RM. An abundance of bidirectional promoters in the human genome. Genome Res. 2004 Jan;14(1):62-6. encodeStanfordPromotersAverage Stan Pro Average Stanford Promoter Activity (Average) ENCODE Transcript Levels encodeStanfordPromotersU87 Stan Pro U87 Stanford Promoter Activity (U87 cells) ENCODE Transcript Levels encodeStanfordPromotersSnu182 Stan Pro Snu182 Stanford Promoter Activity (Snu182 cells) ENCODE Transcript Levels encodeStanfordPromotersPanc1 Stan Pro Panc1 Stanford Promoter Activity (Panc1 cells) ENCODE Transcript Levels encodeStanfordPromotersMRC5 Stan Pro MRC5 Stanford Promoter Activity (MRC5 cells) ENCODE Transcript Levels encodeStanfordPromotersMG63 Stan Pro MG63 Stanford Promoter Activity (MG63 cells) ENCODE Transcript Levels encodeStanfordPromotersJEG3 Stan Pro JEG3 Stanford Promoter Activity (JEG3 cells) ENCODE Transcript Levels encodeStanfordPromotersHepG2 Stan Pro HepG2 Stanford Promoter Activity (HepG2 cells) ENCODE Transcript Levels encodeStanfordPromotersHela Stan Pro Hela Stanford Promoter Activity (HeLa cells) ENCODE Transcript Levels encodeStanfordPromotersHTB11 Stan Pro HTB11 Stanford Promoter Activity (HTB11 cells) ENCODE Transcript Levels encodeStanfordPromotersHT1080 Stan Pro HT1080 Stanford Promoter Activity (HT1080 cells) ENCODE Transcript Levels encodeStanfordPromotersHMCB Stan Pro HMCB Stanford Promoter Activity (HMCB cells) ENCODE Transcript Levels encodeStanfordPromotersHCT116 Stan Pro HCT116 Stanford Promoter Activity (HCT116 cells) ENCODE Transcript Levels encodeStanfordPromotersG402 Stan Pro G402 Stanford Promoter Activity (G402 cells) ENCODE Transcript Levels encodeStanfordPromotersCRL1690 Stan Pro CRL1690 Stanford Promoter Activity (CRL1690 cells) ENCODE Transcript Levels encodeStanfordPromotersBe2C Stan Pro Be2c Stanford Promoter Activity (Be2c cells) ENCODE Transcript Levels encodeStanfordPromotersAGS Stan Pro AGS Stanford Promoter Activity (AGS cells) ENCODE Transcript Levels encodeStanfordRtPcr Stanf RTPCR Stanford Endogenous Transcript Levels in HCT116 Cells ENCODE Transcript Levels Description This track displays absolute transcript copy numbers for 136 genes and 12 negative control intergenic regions, determined by RTPCR in HCT116 cells. Display Conventions and Configuration The genomic regions are indicated by solid blocks. The shade of an item gives a rough indication of its count, ranging from light gray for zero to black for a count of 7000 or greater. To display only those items that exceed a specific unnormalized score, enter a minimum score between 0 and 1000 in the text box at the top of the track description page. Methods Total RNA was prepared in quadruplicate from HCT116 cells grown in culture. cDNA was prepared as described in Trinklein et al. (2004). Duplicate primer pairs were designed to each gene, and the absolute number of cDNA molecules containing each amplicon were determined by real-time PCR. The submitted data are the calculated number of molecules of each transcript containing the defined amplicon. Verification Four biological replicates were performed, and two primer pairs were used to measure the abundance of each transcript. Credits These data were generated in the Richard M. Myers lab at Stanford University (now at HudsonAlpha Institute for Biotechnology). References Trinklein, N.D., Chen, W.C., Kingston, R.E. and Myers, R.M. Transcriptional regulation and binding of HSF1 and HSF2 to 32 human heat shock genes during thermal stress and differentiation. Cell Stress Chaperones 9(1), 21-28 (2004). cnp Structural Var Structural Variation Variation and Repeats Description This annotation shows regions detected as putative copy number polymorphisms (CNP) and sites of detected intermediate-sized structural variation (ISV). The CNPs and ISVs were determined by various methods, displayed in individual subtracks within the annotation: BAC microarray analysis (Sharp): 160 putative CNP regions detected by BAC microarray analysis in a population of 47 individuals comprised of 8 Chinese, 4 Japanese, 10 Czech, 2 Druze, 7 Biaka, 9 Mbuti, and 7 Amerindians. BAC microarray analysis (Iafrate): 255 putative CNP regions detected by BAC microarray analysis in a population of 55 individuals, 16 of which had previously-characterized chromosomal abnormalities. The group consisted of 10 Caucasians, 4 Amerindians, 2 Chinese, 2 Indo-Pakistani, 2 Sub-Saharan African, and 35 of unknown ethnic origin. Representational oligonucleotide microarray analysis (ROMA) (Sebat): 81 putative CNP regions detected by ROMA in a population of 20 normal individuals comprised of 1 Biaka, 1 Mbuti, 1 Druze, 1 Melanesian, 4 French, 1 Venezualan, 1 Cambodian, 1 Mayan and 9 of unknown ethnicity. Fosmid mapping (Tuzun): 285 ISV sites detected by mapping paired-end sequences from a human fosmid DNA library. Deletions from genotype analysis (McCarroll): 541 deletions detected by analysis of SNP genotypes, using the HapMap Phase I data, release 16a. Deletions from genotype analysis (Conrad): 935 deletions detected by analysis of SNP genotypes, using the HapMap Phase I data, release 16c.1, CEU and YRI samples. Deletions from haploid hybridization analysis (Hinds): 100 deletions from haploid hybridization analysis in 24 unrelated individuals from the Polymorphism Discovery Resource, selected for SNP LD study. Display Conventions and Configuration CNP and ISV regions are indicated by solid blocks that are color-coded to indicated the type of variation detected: Green: gain (duplications) Red: loss (deletions) Blue: gain and loss (both deletion and duplication) Black: inversion Sharp subtrack On the details pages for elements in this subtrack, the table shows value/threshold data for each individual in the population. "Value" is defined as the log2 ratio of fluorescence intensity of test versus reference DNA. "Threshold" is defined as 2 standard deviations from the mean log2 ratio of all autosomal clones per hybridization. The "Disease Percent" value reflects the percent of the BAC that lies within a "rearrangement hotspot", as defined in Sharp et al. (2005) (the rationale used to choose BACs for the array construction). A rearrangement hotspot is defined by the presence of flanking intrachromosomal duplications >10 kb in length with >95% similarity and separated by 50 kb - 10 Mb of intervening sequence. Tuzun subtrack Items are labeled using the following naming convention: First letter: rearrangement type (D=deletion, I=insertion, V=inversion). Second letter: association with repeat or duplication (R=human-specific repeat, D=duplication, N=neither (unique)). Third letter: second haplotype support (N=variant site lacking support from the human genome reference, S=variant site with support from the human genome reference). Conrad subtrack The method used to identify these deletions approximates the breakpoints of each event; therefore, a set of minimal and maximal endpoints is associated with each deletion. Thick lines delineate the minimally deleted region; thin lines delineate the maximally deleted region. Methods Sharp BAC microarray analysis All hybridizations were performed in duplicate incorporating a dye-reversal using a custom array consisting of 2,194 end-sequence or FISH-confirmed BACs, targeted to regions of the genome flanked by segmental duplications. The false positive rate was estimated at ~3 clones per 4,000 tested. Iafrate BAC microarray analysis All hybridizations were performed in duplicate incorporating a dye-reversal using proprietary 1 Mb GenomeChip V1.2 Human BAC Arrays consisting of 2,632 BAC clones (Spectral Genomics, Houston, TX). The false positive rate was estimated at ~1 clone per 5,264 tested. Further information is available from the Database of Genome Variants website. Sebat ROMA Following digestion with BglII or HindIII, genomic DNA was hybridized to a custom array consisting of 85,000 oligonucleotide probes. The probes were selected to be free of common repeats and have unique homology within the human genome. The average resolution of the array was ~35 kb; however, only intervals in which three consecutive probes showed concordant signals were scored as CNPs. All hybridizations were performed in duplicate incorporating a dye-reversal, with the false positive rate estimated to be ~6%. Note that CNP intervals, as detailed by Sebat et al. (2004), were converted from the April 2003 human genome assembly (NCBI Build 33) to the July 2003 assembly (NCBI Build 34) using the UCSC liftOver tool. Tuzun fosmid mapping Paired-end sequences from a human fosmid DNA library were mapped to the assembly. The average resolution of this technique was ~8 kb, and included 56 sites of inversion not detectable by the array-based approaches. However, because of the physical constraints of fosmid insert size, this technique was unable to detect insertions greater than 40 kb in size. McCarroll genotype analysis A segregating deletion can leave "footprints" in SNP genotype data, including apparent deviations from Mendelian inheritance, apparent deviations from Hardy-Weinberg equilibrium and null genotypes. Using these clues to discover true variants is challenging, however, because the vast majority of such observations represent technical artifacts and genotyping errors. To determine whether a subset of "failed" SNP genotyping assays in the HapMap data might reflect structural variation, the authors examined whether such failures were physically clustered in a manner that is specific to individuals. Consistent with this hypothesis, the rate of Mendelian-inconsistent genotypes was elevated near other Mendelian-inconsistent genotypes in the same individual but was unrelated to Mendelian inconsistencies in other individuals. The authors systematically looked for regions of the genome in which the same failure profile appeared repeatedly at nearby markers in a manner that was statistically unexpected based on chance. A set of statistical thresholds was tailored to each mode of failure, genotyping center and genotyping platform used in the project. The same procedure could readily apply to dense SNP data from any platform or study. Conrad genotype analysis SNPs in regions that are hemizygous for a deletion are generally miscalled as homozygous for the allele that is present. Hence, when a deletion is transmitted from parent to child, the genotypes at SNPs within the deletion region will often appear to violate the rules of Mendelian transmission. The authors developed a simple algorithm for scanning trio data for unusual runs of consecutive SNPs that, in a single family, have genotype configurations consistent with the presence of a deletion. Hinds haploid hybridization analysis Approximately 600 Mb of genomic DNA from 24 unrelated individuals was obtained from the Polymorphism Discovery Resource. Haploid hybridization was used to identify genomic intervals showing a reduced hybridization signal in comparison to the reference assembly. PCR amplification was performed on 215 candidate deletions. 100 deletions were selected that were unambiguously confirmed. Validation McCarroll genotype analysis Four methods of validation were used: fluorescent in situ hybridization (FISH), two-color fluorescence intensity measurements, PCR amplification and quantitative PCR. The authors performed fluorescent in situ hybridization (FISH) for five candidate deletions large enough to span available FISH probes. In all five cases, FISH assays confirmed the deletions in the predicted individuals. The authors examined two-color allele-specific fluorescence data from SNP genotyping assays from a data subset available at the Broad Institute, looking for a reduction in fluorescence intensity in individuals predicted to carry a deletion. At most SNPs in the genome, fluorescence intensity measurements cluster into two or three discrete groups corresponding to homozygous and hetrozygous genotypes. At 15 of 17 candidate deletion loci, fluorescence intensity data for one or more SNPs clustered into additional groups that corresponded to the predicted deletion genotypes. The authors used PCR amplification to query 60 loci for which the pattern of genotypes suggested multiple individuals with homozygous deletions. Variants were considered confirmed if the pattern of amplication success and failure matched prediction across a set of 12-24 individuals. The authors confirmed 51 of 60 candidate variants by this criterion. The authors performed quantitative PCR in all 269 HapMap DNA samples for 11 candidate deletions that overlapped the coding exons of genes and that were discovered in many individuals. At 10/11 loci, the authors observed three discrete clusters, identifying individuals with zero, one and two gene copies. All 60 trios displayed Mendelian inheritance for the ten deletions, as well as Hardy-Weinberg equilibrium in all four populations surveyed, and transmission rates close to 50%. This suggests that the deletions behave as a stable, heritable genetic polymorphism. Conrad genotype analysis The authors first tested 12 predicted deletions using quantitative PCR. For all 12 deletions they observed DNA concentrations consistent with transmission of a deletion from parent to child. To provide more extensive validation by comparative genome hybridization (CGH), the authors designed a custom oligonucleotide microarray comprised of 380,000 probes that tile across all 134 candidate deletions identified in nine HapMap offspring (8 YRI and 1 CEU). The results of this CGH analysis indicate that the majority (about 85%) of candidate deletions detected by the method are real. References Conrad, D., Andrews, T.D., Carter, N.P., Hurles, M.E., Pritchard, J.K. A high-resolution survey of deletion polymorphism in the human genome. Nature Genet 38(1), 75-81 (2006). Hinds, D., Kloek, A.P., Jen, M., Chen, X., Frazer, K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nature Genet 38(1), 82-85 (2006). Iafrate, J.A., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W. and Lee, C. Detection of large-scale variation in the human genome. Nature Genet 36(9), 949-51 (2004). McCarroll, S.A., Hadnott, T.N., Perry, G.H., Sabeti, P.C., Zody, M.C., Barrett, J.C., Dallaire, S., Gabriel, S., Lee, C., Daly, M.J., Altshuler, D.M. Common deletion polymorphisms in the human genome. Nature Genet 38(1), 86-92 (2006). Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. Large-scale copy number polymorphism in the human genome. Science 305(5683), 525-8 (2004). Sharp, A.J., Locke, D.P., McGrath, S.D., Cheng, Z., Bailey, J.A., Samonte, R.V., Pertz, L.M., Clark, R.A., Schwartz, S., Segraves, R. et al. Segmental duplications and copy number variation in the human genome. Am J Hum Genet 77(1), 78-88 (2005). Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M., Haugen, E., Hayden, H., Albertson, D. Pinkel, D. et al. Fine-scale structural variation of the human genome. Nature Genet 37(7), 727-32 (2005). delHinds Hinds Dels Deletions from Haploid Hybridization Analysis (Hinds) Variation and Repeats delConrad Conrad Dels Deletions from Genotype Analysis (Conrad) Variation and Repeats delMccarroll McCarroll Dels Deletions from Genotype Analysis (McCarroll) Variation and Repeats cnpFosmid Tuzun Fosmids Structural Variation identified by Fosmids (Tuzun) Variation and Repeats cnpSebat Sebat CNPs Copy Number Polymorphisms from ROMA (Sebat) Variation and Repeats Description This track shows 81 regions detected as putative copy number polymorphisms by representational oligonucleotide microarray analysis (ROMA) in a population of 20 normal individuals. Methods Following digestion of with BglII or HindIII, genomic DNA was hybridized to a custom array consisting of 85,000 oligonucleotide probes, probes were selected to be free of common repeats and have unique homology within the human genome. The average resolution of the array is ~35 kb, however only intervals in which 3 consecutive probes showed concordant signal were scored as CNPs. All hybridizations were performed in duplicate incorporating a dye-reversal, with the false positive rate estimated to be ~6%. Note that CNP intervals as detailed by Sebat et al. (2004) were converted from the April 2003 (build33) into the July 2003 (build34) assembly using liftover. References Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M (2004) Large-Scale Copy Number Polymorphism in the Human Genome. Science 305:525-528 cnpIafrate Iafrate CNPs Copy Number Polymorphisms from BAC Microarray Analysis (Iafrate) Variation and Repeats Description This track shows 255 regions detected as putative copy number polymorphisms by BAC microarray analysis in a population of 55 individuals, 16 of which had previously characterized chromosome abnormalities. Methods Hybridizations were all performed in duplicate incorporating a dye-reversal using proprietary 1 Mb GenomeChip V1.2 Human BAC Arrays consisting of 2,632 BAC clones (Spectral Genomics, Houston, TX). The false positive rate was estimated as ~1 clone per 5,264 tested. Further information is available at http://projects.tcag.ca/variation. References Iafrate JA, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C (2004) Detection of large-scale variation in the human genome. Nature Genet 36:949-951 cnpSharp Sharp CNPs Copy Number Polymorphisms from BAC Microarray Analysis (Sharp) Variation and Repeats Description This track shows 160 regions detected as putative copy number polymorphisms by BAC microarray analysis in a population of 47 individuals, comprising 8 Chinese, 4 Japanese, 10 Czech, 2 Druze, 7 Biaka, 9 Mbuti, and 7 Amerindians. Methods Hybridizations were all performed in duplicate incorporating a dye-reversal using a custom array consisting of 2194 end-sequence or FISH confirmed BACs, targeted to regions of the genome flanked by segmental duplications. The false positive rate was estimated as ~3 clones per 4,000 tested. References Sharp, A.J., Locke D..P, McGrath S.D., Cheng Z., Bailey J.A., Samonte R.V., Pertz L.M., Clark R.A., Schwartz S., Segraves R., Oseroff V.V., Albertson D.G., Pinkel D. and Eichler E..E Segmental duplications and copy number variation in the human genome. Am J Hum Genet 77(1), 78-88 (2005). tajD Tajima's D Tajima's D Variation and Repeats Description This track shows Tajima's D (Tajima, 1989), a measure of nucleotide diversity, estimated from the Perlegen data set (Hinds et al., 2005). Tajima's D is a statistic used to compare an observed nucleotide diversity against the expected diversity under the assumption that all polymorphisms are selectively neutral and constant population size. Methods Tajima's D was estimated in 100 kbp sliding windows across the autosomal genome, reporting the Tajima's D measure at the central 10 kbp of the window and stepping by 10 kbp. Thus, the Tajima's D for the window chr1:100,001-200,000 is reported at coordinates chr1:145,001-155,000, the Tajima's D for the window chr1:110,001-210,000 is reported at coordinates chr1:155,001-165,000, and so forth. The theoretical distribution of Tajima's D (95% c.i. between -2 and +2) assumes that polymorphism ascertainment is independent of allele frequency. High values of Tajima's D suggest an excess of common variation in a region, which can be consistent with balancing selection, population contraction. Negative values of Tajima's D, on the other hand, indicate an excess of rare variation, consistent with population growth, or positive selection. Population admixture can lead to either high or low Tajima's D values in theory. Demographic parameters would be expected to affect the genome more evenly than selective pressures, so previous analyses have suggested that using the empiric distribution of Tajima's D from a collection of regions across the genome provides advantages in assessing whether selection or demography might explain an observed deviation from expectation. Because of the ascertainment bias toward common polymorphism in the Perlegen data set, positive Tajima's D values are difficult to interpret, and modeling ascertainment is difficult. However, given that the ascertainment bias raises the mean of the distribution, extreme negative values in extended regions can be useful in qualitatively identifying interesting regions for full resequencing and more rigorous theoretical analysis of nucleotide diversity. For further discussion, see Carlson et al. (2005). In full display mode, this track shows the nucleotide diversity across three human populations: 23 individuals of African American Descent (AD), 24 individuals of European Descent (ED) and 24 individuals of Chinese Descent (XD), as well as the polymorphic sites within each population used to estimate nucleotide diversity. Only SNPs observed to be polymorphic within each subpopulation were used in the Tajima's D calculation. Nucleotide diversity is shown in dense display mode using a grayscale density gradient, with light colors indicating low diversity. Credits This track was created at the University of Washington using gfetch from the Nickerson Laboratory and the R statistical software package. References Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585-595 (1989). Carlson, C.S., Thomas, D.J., Eberle, M., Livingston, R., Rieder, M. Nickerson, D.A. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res 15, 1553-65 (2005). tajdXd Tajima's D XD Tajima's D from Chinese Descent Variation and Repeats tajdEd Tajima's D ED Tajima's D from European Descent Variation and Repeats tajdAd Tajima's D AD Tajima's D from African Descent Variation and Repeats tajdSnp Tajima's D SNPs Tajima's D SNPs Variation and Repeats Description This track shows the SNPs that were used in the calculation of Tajima's D (Tajima, 1989), a measure of nucleotide diversity, estimated from the Perlegen data set (Hinds et al., 2005). Tajima's D is a statistic used to compare an observed nucleotide diversity against the expected diversity under the assumption that all polymorphisms are selectively neutral and constant population size. Methods See the Tajima's D track or Carlson et al. for more details on the use of this track. Credits This track was created at the University of Washington using gfetch from the Nickerson Laboratory and the R statistical software package. References Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585-595 (1989). Carlson, C.S., Thomas, D.J., Eberle, M., Livingston, R., Rieder, M. Nickerson, D.A. Genomic regions exhibiting positive selection identified from dense genotype data. Genome Res 15, 1553-65 (2005). tajdSnpXd SNPs XD SNPs from Chinese Descent Variation and Repeats tajdSnpEd SNPs ED SNPs from European Descent Variation and Repeats tajdSnpAd SNPs AD SNPs from African Descent Variation and Repeats encodeTbaAlign TBA Alignment NHGRI/PSU TBA Alignments ENCODE Comparative Genomics Description This track displays human-centric multiple sequence alignments in the ENCODE regions for the 23 vertebrates in the May 2005 ENCODE MSA freeze, based on comparative sequence data generated for the ENCODE project. The alignments in this track were generated using the Threaded Blockset Aligner (TBA). A complete list of the vertebrates included in the May 2005 freeze may be found at the top of the description page for this track. The Genome Browser companion tracks, TBA Cons and TBA Elements, display conservation scoring and conserved elements for these alignments based on various conservation methods. Display Conventions and Configuration In full display mode, this track shows pairwise alignments of each species aligned to the human genome. The alignments are shown in dense display mode using a gray-scale density gradient. The checkboxes in the track configuration section allow the exclusion of species from the pairwise display. When zoomed-in to the base-display level, the track shows the base composition of each alignment. The numbers and symbols on the "human gap" line indicate the lengths of gaps in the human sequence at those alignment positions relative to the longest non-human sequence. If there is sufficient space in the display, the size of the gap is shown; if not, and if the gap size is a multiple of 3, a "*" is displayed, otherwise "+" is shown. To view detailed information about the alignments at a specific position, zoom in the display to 30,000 or fewer bases, then click on the alignment. Methods The TBA was used to align sequences in the May 2005 ENCODE sequence data freeze. Multiple alignments were seeded from a series of combinatorial pairwise blastz alignments (not referenced to any one species). The specific combinations were determined by the species guide tree. Additionally, a blastz.specs file was used to fine-tune the blastz parameters, based on the evolutionary distance of the species being compared. The resulting multiple alignments were projected onto the human reference sequence. Credits The TBA multiple alignments were created by Elliott Margulies of the Green Lab at NHGRI. The programs Blastz and TBA, which were used to generate the alignments, were provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. The phylogenetic tree is based on Murphy et al. (2001) and general consensus in the vertebrate phylogeny community. References Blanchette, M., Kent, W.J., Reimer, C., Elnitski, L., Smit, A., Roskin, K., Baertsch, R., Rosenbloom, K.R., Clawson, H. et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res 14, 708-15 (2004). Chiaromonte, F., Yap, V.B., and Miller, W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 2002, 115-26 (2002). Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R., Haussler, D. and Miller, W. Human-Mouse Alignments with BLASTZ. Genome Res 13(1):103-7 (2003). Murphy, W.J., et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294(5550), 2348-51 (2001). encodeTbaCons TBA Cons NHGRI/PSU/UCSC TBA Conservation ENCODE Comparative Genomics Description This track displays different measurements of conservation based on the Threaded Blockset Aligner (TBA) multiple sequence alignments of ENCODE regions shown in the TBA Alignment track. Three programs — binCons (binomial-based conservation method), phastCons (phylogenetic hidden-Markov model method), and GERP (Genomic Evolutionary Rate Profiling) — generated the conservation scoring used to create this track. A related track, TBA Elements, shows multi-species conserved sequences (MCSs) based on the conservation measurements displayed in this track. For details on the conservation scores generated by each program, refer to the individual Methods subsections. Display Conventions and Configuration The subtracks within this composite annotation track, which show data from the binCons, phastCons and GERP programs, may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of the subtracks. A subtrack may be hidden from view by checking the box to the left of the track name in the list. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different gene prediction methods. See the Methods section for display information specific to each subtrack. Methods The methods used to create the TBA alignments in the ENCODE regions are described in the TBA Alignment track description. BinCons The binCons score is based on the cumulative binomial probability of detecting the observed number of identical bases (or greater) in sliding 25 bp windows (moving one bp at a time) between the reference sequence and each other species, given the neutral rate at four-fold degenerate sites. Neutral rates are calculated separately at each targeted region. For targets with no gene annotations, the average percent identity across all alignable sequence was instead used to weight the individual species binomial scores (this latter weighting scheme was found to closely match 4D weights). The negative log of these P-values was then averaged across all human-referenced pairwise combinations, and the highest scoring overlapping 25 bp window for each base was the resulting score. This track shows the plotting of a ranked percentile score normalized between 0 and 1 across all ENCODE regions, such that the top 5% most conserved sequence across all ENCODE regions have a score of 0.95 or greater (top 10% have a score of 0.9 or greater, and so on). BinCons scores were normalized to represent a percentile to the power of 10. For example, scores representing the top 1 percent most conserved sequence, 99th percentile, have a score greater than or equal to 0.99^10 = 0.904. Transforming scores to the power of 10 was done for visual purposes only, in order to accentuate and distinguish the peaks of more highly conserved regions. More details on binCons can be found in Margulies et. al. (2003) cited below. PhastCons The phastCons program predicts conserved elements and produces base-by-base conservation scores using a two-state phylogenetic hidden Markov model. The model consists of a state for conserved regions and a state for nonconserved regions, each of which is associated with a phylogenetic model. These two models are identical except that the branch lengths of the conserved phylogeny are multiplied by a scaling parameter rho (0 < rho < 1). For determining the conservation for the ENCODE TBA alignments, the nonconserved model was estimated from four-fold degenerate coding sites within the ENCODE regions using the program phyloFit. The parameter rho was then estimated by maximum likelihood, conditional on the nonconserved model, using the EM algorithm implemented in phastCons. Parameter estimation was based on a single large alignment, constructed by concatenating the alignments for all conserved regions. PhastCons was run with the options --expected-lengths 15 and --target-coverage 0.05 to obtain the desired level of "smoothing" and a final coverage by conserved elements of 5%. The conservation score at each base is the posterior probability that the base was generated by the conserved state of the phylo-HMM. It can be interpreted as the probability that the base is in a conserved element, given the assumptions of the model and the estimated parameters. Scores range from 0 to 1, with higher scores corresponding to higher levels of conservation. More details on phastCons can be found in Siepel et. al. (2005) cited below. GERP The GERP score is the expected substitution rate divided by the observed substitution rate at a particular human base. Scores are estimated on a column-by-column basis using multiple sequence alignments of mammalian genomic DNA generated by MLAGAN. The scores range from 0 to 3; those greater than 3 are clipped to 3. The expected and observed rates are both calculated on a phylogenic tree using the same fixed topology. The branch lengths of the expected tree are based on the average substitutions at neutral sites. The branch lengths of the observed tree, which is calculated separately for each human base, are based on the substitutions seen at the column of the multiple alignment at that base. Species that have gaps at a particular column are not considered in the scoring for that column. Higher scores correspond to human bases in alignment columns with higher degrees of similarity, i.e. bases that have evolved slowly, some of which have been under purifying selection. The opposite holds true for swiftly evolving (low similarity) columns. Scores are deterministic, given a maximum-likelihood model of nucleotide substitution, species topology, neutral tree, and alignment. Credits BinCons was developed by Elliott Margulies of the Eric Green lab at NHGRI. PhastCons was developed by Adam Siepel in the Haussler lab at UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. The data for this track were generated by Elliott Margulies, with assistance from Adam Siepel. References Blanchette, M., Kent, W.J., Reimer, C., Elnitski, L., Smit, A., Roskin, K., Baertsch, R., Rosenbloom, K.R., Clawson, H. et al. Aligning Multiple Genomic Sequences With the Threaded Blockset Aligner. Genome Res 14, 708-15 (2004). Cooper, G.M., Stone, E.A., Asimenos, G., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S. and Sidow, A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901-13 (2005). Margulies, E.H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. and Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507-18 (2003). Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034-1050 (2005). encodeTbaGerpCons TBA GERP Cons TBA GERP Conservation ENCODE Comparative Genomics encodeTbaBinCons TBA BinCons TBA BinCons Conservation ENCODE Comparative Genomics encodeTbaPhastCons TBA PhastCons TBA PhastCons Conservation ENCODE Comparative Genomics encodeTbaElements TBA Elements NHGRI/PSU/UCSC TBA Conserved Elements ENCODE Comparative Genomics Description This track displays multi-species conserved sequences (MCSs) derived from binCons, phastCons, and genomic evolutionary rate profiling (GERP) conservation scoring of Threaded Blockset Aligner (TBA) multiple sequence alignments in the ENCODE regions. The combined-methods subtracks show the union/intersection of conserved elements produced by the three conservation methods. The multiple sequence alignments may be viewed in the TBA Alignment track. Another related track, TBA Cons, shows the conservation scoring. The descriptions accompanying these tracks detail the methods used to create the alignments and conservation. Display Conventions and Configuration The locations of conserved elements are indicated by blocks in the graphical display. This composite annotation track consists of several subtracks that show conserved elements derived by the three methods listed above, as well as both unions and intersections of the sets of conserved and non-coding conserved elements. To show only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Display characteristics specific to certain subtracks are described in the respective Methods sections below. Methods BinCons-based Elements For each ENCODE target, a conservation score threshold was picked to match the number of conserved bases predicted by phastCons, an alternative method for measuring conservation. This latter method has been found slightly more reliable for predicting the expected fraction of conserved sequence in each target. Clusters of bases that exceeded the given conservation score threshold were designated as MCSs. The minimum length of an MCS is 25 bases. Strict cutoffs were used: if even one base fell below the conservation score threshold, it separated an MCS into two distinct regions. PhastCons-based Elements The predicted MCSs are segments of the alignment that are likely to have been "generated" by the conserved state of the phylo-HMM, i.e., maximal segments in which the maximum-likelihood (Viterbi) path remains in the conserved state. GERP-based Elements GERP elements are scored according to the inferred intensity of purifying selection and are measured as "rejected substitutions" (RSs). RSs capture the magnitude of difference between the number of "observed" substitutions (estimated using maximum likelihood) and the number that would be "expected" under a neutral model of evolution. The RS is displayed as part of the item name. Items with higher RSs are displayed in a darker shade of blue. The score shown on the details page, which has been scaled by 300 for display purposes, is generally not as accurate as the RS count that is part of the item name. "Constrained elements" are identified as those groups of consecutive human bases that have an observed rate of evolution that is smaller than the expected rate. These groups of columns are merged if they are less than a few nucleotides apart and are scored according to the sum of the site-by-site difference between observed and expected rates (RS). Permutations of the actual alignments were analyzed, and the "constrained elements" identified in these permuted alignments were treated as "false positives". Subsequently, an RS threshold was picked such that the total length of "false positive" constrained elements (identified in the permuted alignments) was less than 5% of the length of constrained elements identified in the actual alignment. Thus, all annotated constrained elements are significant at better than 95% confidence, and the total fraction of the ENCODE regions annotated as constrained is 5-7%. PhastCons/BinCons/GERP Union/Intersection of Conserved Elements These subtracks were produced by creating unions and intersections of the constrained element data detected by binCons, phastCons, and GERP on TBA alignments. In these annotations, "non-coding" is defined as those regions not overlapping with CDS regions in any of the following UCSC gene tables: refFlat, knownGene, mgcGenes, vegaGene, or ensGene. Credits BinCons and phastCons MCS data were contributed by Elliott Margulies in the Eric Green lab at NHGRI, with assistance from Adam Siepel of UCSC. GERP was developed primarily by Greg Cooper in the lab of Arend Sidow at Stanford University (Depts of Pathology and Genetics), in close collaboration with Eric Stone (Biostatistics, NC State), and George Asimenos and Eugene Davydov in the lab of Serafim Batzoglou (Dept. of Computer Science, Stanford). TBA was provided by Minmei Hou, Scott Schwartz and Webb Miller of the Penn State Bioinformatics Group. References See the TBA Alignment and TBA Cons tracks for references. encodeTbaNcIntersectEl TBA NC Intersect TBA PhastCons/BinCons/GERP Intersection NonCoding Conserved Elements ENCODE Comparative Genomics encodeTbaIntersectEl TBA Intersect TBA PhastCons/BinCons/GERP Intersection Conserved Elements ENCODE Comparative Genomics encodeTbaNcUnionEl TBA NC Union TBA PhastCons/BinCons/GERP Union NonCoding Conserved Elements ENCODE Comparative Genomics encodeTbaUnionEl TBA Union TBA PhastCons/BinCons/GERP Union Conserved Elements ENCODE Comparative Genomics encodeTbaGerpEl TBA GERP TBA GERP Conserved Elements ENCODE Comparative Genomics encodeTbaBinConsEl TBA BinCons TBA BinCons Conserved Elements ENCODE Comparative Genomics encodeTbaPhastConsEl TBA PhastCons TBA PhastCons Conserved Elements ENCODE Comparative Genomics encode_tba23EvoFold TBA23 EvoFold EvoFold Predictions of RNA Secondary Structure Using TBA23 ENCODE Regions and Genes Description This track shows RNA secondary structure predictions made with the EvoFold program, a comparative method that exploits the evolutionary signal of genomic multiple-sequence alignments for identifying conserved functional RNA structures. Display Conventions and Configuration Track elements are labeled using the convention ID_strand_score. At the zoomed-out level, secondary structure prediction regions are indicated by blocks, with the stem-pairing regions shown in a darker shade than unpaired regions. Arrows indicate the predicted strand. When zoomed in to the base level, the specific secondary structure predictions are shown in parenthesis format. The confidence score for each position is indicated in grayscale, with darker shades corresponding to higher scores. The details page for each track element shows the predicted secondary structure (labeled SS anno), together with details of the multiple species alignments at that location. Substitutions relative to the human sequence are color-coded according to their compatibility with the predicted secondary structure (see the color legend on the details page). Each prediction is assigned an overall score and a sequence of position-specific scores. The overall score measures evidence for any functional RNA structures in the given region, while the position-specific scores (0 - 9) measure the confidence of the base-specific annotations. Base-pairing positions are annotated with the same pair symbol. The offsets are provided to ease visual navigation of the alignment in terms of the human sequence. The offset is calculated (in units of ten) from the start position of the element on the positive strand or from the end position when on the negative strand. The graphical display may be filtered to show only those track elements with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods Evofold makes use of phylogenetic stochastic context-free grammars (phylo-SCFGs), which are combined probabilistic models of RNA secondary structure and primary sequence evolution. The predictions consist of both a specific RNA secondary structure and an overall score. The overall score is essentially a log-odd score phylo-SCFG modeling the constrained evolution of stem-pairing regions and one which only models unpaired regions. The predictions for this track were based on the conserved elements of the 23-way threaded blockset aligner (TBA) alignments present in the ENCODE regions (see the TBA Alignment track for more information). Credits The EvoFold program and browser track were developed by Jakob Skou Pedersen of the UCSC Genome Bioinformatics Group. The 23-way TBA multiple alignments were created by Elliott Margulies of the Green Lab at NHGRI. References Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics. 1999 Jun;15(6):446-54. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol. 2006 Apr;2(4):e33. Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucl Acids Res. 2004 Sep 24;32(16):4925-36. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005 Aug;15(8):1034-50. tfbsCons TFBS Conserved HMR Conserved Transcription Factor Binding Sites Regulation Description This track contains the location and score of transcription factor binding sites conserved in the human/mouse/rat alignment. A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site in all 3 species. The score and threshold are computed with the Transfac Matrix Database (v4.0) created by Biobase. The data are purely computational, and as such not all binding sites listed here are biologically functional binding sites. In the graphical display, each box represents one conserved tfbs. The darker the box, the better the match of the binding site. Clicking on a box brings up detailed information on the binding site, namely its Transfac I.D., a link to its Transfac Matrix (free registration with Transfac required), its location in the human genome (chromosome, start, end, and strand), its length in bases, and its score. All binding factors that are known to bind to the particular binding site are listed along with their species, SwissProt ID, and a link to that factor's page on the UCSC Protein Browser if such an entry exists. Methods A binding site is considered to be conserved across the alignment if its score meets the threshold score for that binding site at exactly the same position in the alignment in all 3 species. If there is no orthologous sequence in the mouse or the rat, no prediction is made. The following is a brief discussion of the scoring and threshold system used for these data. The Transfac Matrix Database contains position-weight matrices for 336 transcription factor binding sites, as characterized through experimental results in the scientific literature. A typical (in this case ficticious) matrix will look something like: A C G T 01 15 15 15 15 N 02 20 10 15 15 N 03 0 0 60 0 G 04 60 0 0 0 A 05 0 0 0 60 T The above matrix specifies the results of 60 (the sum of each row) experiments. In the experiments, the first position of the binding site was A 15 times, C 15 times, G 15 times, and T 15 times (and so on for each position.) The consensus sequence of the above binding site as characterized by the matrix is NNGAT. The format of the consensus sequence is the deduced consensus in the IUPAC 15-letter code. The score of a segment of DNA is computed in relation to a matrix as follows: score = SUM over each position in the matrix of matrix[position][nucleotide_in_segment_at_this_position]. For example, the sequence "CCGAT" would have a score of: 15 + 10 + 60 + 60 + 60 = 205 for the above matrix. A score in relation to a matrix of length n can be computed for every DNA segment of length n. The threshold for a binding site is computed from its Transfac Matrix Database entry as follows: St = Smin + ((Smax - Smin) * C) where St is the target threshold score Smin is the minimum possible score Smax is the maximum possible score C is the cutoff value used by the scoring function For example, the above matrix has a minimum score of 15 + 10 + 0 + 0 + 0 = 25 and a maximum score of 15 + 20+ 60 + 60 + 60 = 215. Using a cutoff value of 0.85 (the value used for this track), the threshold for the above matrix is: 25 + ((215 - 25) * 0.85) = 186.5 As such the sequence "CCGAT" from above would be recorded as a hit with a cutoff value of 0.85, since its score (215) exceeds the threshold for this particular binding site (186.5.) The final score reported is the minimum cutoff value that the position would have been recorded as a hit (multiplied by 1000.) The final score of the above example is therefore: ((Score - Smin) / (Smax - Smin)) * 1000 = (205 - 25) / (215 - 25)) = 0.947 * 1000 = 947. Therefore, the final score for the sequence "CCGAT" would be 947. Although the scores of all three species in the alignment must exceed the threshold, the only final score that is reported for this track is the final score of the binding site in the human sequence. It should be noted that the positions of many of these conserved binding sites coincide with known exons and other highly conserved regions. Regions such as these are more likely to contain false positive matches, as the high sequence identity across the alignment increases the likelihood of a short motif that looks like a binding site to be conserved. Conversely, matches found in introns and intergenic regions are more likely to be real binding sites, since these regions are mostly poorly conserved. These data were obtained by running the program tfloc (Transcription Factor binding site LOCater) on multiz humor alignments of the Feb. 2003 mouse draft assembly (mm3) and the June 2003 rat assembly (rn3) to the July 2003 human genome assembly (hg16.) Tfloc was run on the subset of the Transfac Matrix Database containing human, mouse, and rat related binding sites (164 total.) Transcription factor information was culled from the Transfac Factor database. Credits These data were generated using the Transfac Matrix and Factor databases created by Biobase. The tfloc program was developed at The Pennsylvania State University by Matt Weirauch. This track was created by Matt Weirauch and Brian Raney at The University of California at Santa Cruz. tigrGeneIndex TIGR Gene Index Alignment of TIGR Gene Index TCs Against the Human Genome mRNA and EST Description This track displays alignments of the TIGR Gene Index (TGI) against the human genome. The TIGR Gene Index is based largely on assemblies of EST sequences in the public databases. See the following page for more information about TIGR Gene Indices. Credits Thanks to Foo Cheung and Razvan Sultana of the The Institute for Genomic Research, for converting these data into a track for the browser. twinscan Twinscan Twinscan Gene Predictions Using Mouse/Human Homology Genes and Gene Predictions Description The Twinscan program predicts genes in a manner similar to Genscan, except that Twinscan takes advantage of genome comparisons to improve gene prediction accuracy. In the version of Twinscan used to generate this track, intronless copies of known genes are masked out before gene prediction, reducing the number of non-processed pseudogenes in gene models. More information and a web server can be found at https://mblab.wustl.edu/. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks. The track description page offers the following filter and configuration options: Color track by codons: Select the genomic codons option to color and label each codon in a zoomed-in display to facilitate validation and comparison of gene predictions. Click the Help on codon coloring link on the track description page for more information about this feature. Methods The Twinscan algorithm is described in Korf, I. et al. 2001 in the References section below. Credits Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing these data. References Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001 Jun 1;17(90001)S140-8. encodeUCDavisE2F1Median UCD Ng E2F1 UC Davis ChIP/Chip NimbleGen - E2F1 ab, HeLa Cells ENCODE Chromatin Immunoprecipitation Description ChIP analysis was performed using an antibody to E2F1 and HeLa cell chromatin. E2F1 is a transcription factor important in controlling cell division. Three independently crosslinked preparations of HeLa cells were used to provide three independent biological replicates. ChIP assays were performed (with minor modifications which can be provided upon request) using the protocol found at The Farnham Laboratory. Array hybridizations were performed using standard NimbleGen Systems Inc. conditions. Display Conventions and Configuration This track may be configured in a variety of ways to highlight different aspects of the displayed data. For more information about the graphical configuration options, click the Graph configuration help link. Methods Ratio intensity values (E2F1 vs. total) for each of three biological replicates were calculated and converted to log2. Each set of ratio values was then independently scaled by its Tukey bi-weight mean. The three replicates were then combined by taking the median scaled log2 ratio for each oligo. Verification Primers were chosen to correspond to 13 individual peaks. PCR reactions were performed for each of the 13 primer sets using amplicons derived from each of three biological samples (39 reactions). The PCR reactions confirmed that all of the 13 chosen peaks were bound by E2F1 in all three biological samples. Credits These data were contributed by Mike Singer, Kyle Munn, Nan Jiang, Todd Richmond and Roland Green of NimbleGenSystems, Inc., and Matt Oberley, David Inman, Mark Bieda, Shally Xu and Peggy Farnham of Farnham Lab. encodeNoncodingTransFrags Unann Transfrags Yale and Affymetrix Unannotated TransFrags ENCODE Analysis Description This track shows selected non-coding Affy transcribed fragments (transfrags) and Yale transcriptionally-active regions (TARs) from the Affy Transfrags and Yale TAR annotation tracks, grouped into subsets based on their proximity to the transcript-coding and exon-coding regions of Gencode genes. The transfrags/TARs are divided into four categories: Intergenic distal subtracks: Transfrags/TARS in the intergenic ENCODE areas, i.e. outside the transcript-coding region of any Gencode gene and at least 5000 base pairs distant from any Gencode gene exon-coding region. Intergenic proximal subtracks: Transfrags/TARS in the intergenic ENCODE regions that are close to Gencode genes, i.e. outside the transcript-coding region of any Gencode gene, but within 5000 bases of a Gencode gene exon-coding region. Intronic distal subtracks: Transfrags/TARS within Gencode genes introns, i.e. within the transcript-coding region of a Gencode gene, but at least 5000 base pairs distant from the exon-coding region of any Gencode gene. Intronic proximal subtracks: Transfrags/TARS within Gencode genes introns that are close to Gencode genes exons, i.e. within the transcript-coding region of a Gencode gene and also within 5000 bases of a Gencode gene exon-coding region. Display Conventions and Configuration This annotation track is comprised of subtracks that show different aspects of the displayed data. The complete list of subtracks is displayed near the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The "Select subtracks" section allows you to select or unselect entire groups of subtracks based on a particular data characteristic, such as distance and region, cell type, data source, etc. To add or remove an entire subgroup of data from the graphical display, click the appropriate add or clear button. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing among the different data sets and subgroups. Methods This annotation track was derived from data contained in the Yale TAR and Affy Transfrags annotation tracks using the UCSC Table Browser. See the description pages associated with these annotation tracks for more information about the data sets. Credits This annotation was generated by Hiram Clawson of the UCSC Genome Bioinformatics Group. encodeYaleAffyPlacRNATarsIntergenicDistal Yale Ig Dst Plac Yale Intergenic Distal Placental TARs ENCODE Analysis encodeYaleAffyNeutRNATarsAllIntergenicDistal Yale Ig Dst Neu Yale Intergenic Distal Neutrophil TARs ENCODE Analysis encodeYaleAffyNB4UntrRNATarsIntergenicDistal Yale Ig Dst NB4 Un Yale Intergenic Distal Untreated NB4 TARs ENCODE Analysis encodeYaleAffyNB4TPARNATarsIntergenicDistal Yale Ig Dst NB4 TPA Yale Intergenic Distal TPA-Treated NB4 TARs ENCODE Analysis encodeYaleAffyNB4RARNATarsIntergenicDistal Yale Ig Dst NB4 RA Yale Intergenic Distal NB4 Retinoic TARs ENCODE Analysis encodeAffyRnaHl60SitesHr32IntergenicDistal Affy Ig Dst HL60 32h Affymetrix Intergenic Distal HL60 Retinoic 32hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr08IntergenicDistal Affy Ig Dst HL60 8h Affymetrix Intergenic Distal HL60 Retinoic 8hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr02IntergenicDistal Affy Ig Dst HL60 2h Affymetrix Intergenic Distal HL60 Retinoic 2hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr00IntergenicDistal Affy Ig Dst HL60 0h Affymetrix Intergenic Distal HL60 Transfrags ENCODE Analysis encodeAffyRnaHeLaSitesIntergenicDistal Affy Ig Dst HeLa Affymetrix Intergenic Distal HeLa Transfrags ENCODE Analysis encodeAffyRnaGm06990SitesIntergenicDistal Affy Ig Dst GM06990 Affymetrix Intergenic Distal GM06990 Transfrags ENCODE Analysis encodeYaleAffyPlacRNATarsIntergenicProximal Yale Ig Prx Plac Yale Intergenic Proximal Placental TARs ENCODE Analysis encodeYaleAffyNeutRNATarsAllIntergenicProximal Yale Ig Prx Neu Yale Intergenic Proximal Neutrophil TARs ENCODE Analysis encodeYaleAffyNB4UntrRNATarsIntergenicProximal Yale Ig Prx NB4 Un Yale Intergenic Proximal Untreated NB4 TARs ENCODE Analysis encodeYaleAffyNB4TPARNATarsIntergenicProximal Yale Ig Prx NB4 TPA Yale Intergenic Proximal NB4 TPA-Treated TARs ENCODE Analysis encodeYaleAffyNB4RARNATarsIntergenicProximal Yale Ig Prx NB4 RA Yale Intergenic Proximal NB4 Retinoic TARs ENCODE Analysis encodeAffyRnaHl60SitesHr32IntergenicProximal Affy Ig Prx HL60 32h Affymetrix Intergenic Proximal HL60 Retinoic 32hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr08IntergenicProximal Affy Ig Prx HL60 8h Affymetrix Intergenic Proximal HL60 Retinoic 8hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr02IntergenicProximal Affy Ig Prx HL60 2h Affymetrix Intergenic Proximal HL60 Retinoic 2hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr00IntergenicProximal Affy Ig Prx HL60 0h Affymetrix Intergenic Proximal HL60 Transfrags ENCODE Analysis encodeAffyRnaHeLaSitesIntergenicProximal Affy Ig Prx HeLa Affymetrix Intergenic Proximal Hela Transfrags ENCODE Analysis encodeAffyRnaGm06990SitesIntergenicProximal Affy Ig Prx GM06990 Affymetrix Intergenic Proximal GM06990 Transfrags ENCODE Analysis encodeYaleAffyPlacRNATarsIntronsDistal Yale In Dst Plac Yale Intronic Distal Placental TARs ENCODE Analysis encodeYaleAffyNeutRNATarsAllIntronsDistal Yale In Dst Neu Yale Intronic Distal Neutrophil TARs ENCODE Analysis encodeYaleAffyNB4UntrRNATarsIntronsDistal Yale In Dst NB4 Un Yale Intronic Distal Untreated NB4 TARs ENCODE Analysis encodeYaleAffyNB4TPARNATarsIntronsDistal Yale In Dst NB4 TPA Yale Intronic Distal NB4 TPA-Treated TARs ENCODE Analysis encodeYaleAffyNB4RARNATarsIntronsDistal Yale In Dst NB4 Yale Intronic Distal NB4 Retinoic TARs ENCODE Analysis encodeAffyRnaHl60SitesHr32IntronsDistal Affy In Dst HL60 32h Affy Intronic Distal Hl60 32hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr08IntronsDistal Affy In Dst HL60 8h Affy Intronic Distal Hl60 8hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr02IntronsDistal Affy In Dst HL60 2h Affy Intronic Distal Hl60 2hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr00IntronsDistal Affy In Dst HL60 Affy Intronic Distal Hl60 Transfrags ENCODE Analysis encodeAffyRnaHeLaSitesIntronsDistal Affy In Dst HeLa Affy Intronic Distal HeLa Transfrags ENCODE Analysis encodeAffyRnaGm06990SitesIntronsDistal Affy In Dst GM06990 Affy Intronic Distal GM06990 Transfrags ENCODE Analysis encodeYaleAffyPlacRNATarsIntronsProximal Yale In Prx Plac Yale Intronic Proximal Placental TARs ENCODE Analysis encodeYaleAffyNeutRNATarsAllIntronsProximal Yale In Prx Neu Yale Intronic Proximal Neutrophil TARs ENCODE Analysis encodeYaleAffyNB4UntrRNATarsIntronsProximal Yale In Prx NB4 Un Yale Intronic Proximal NB4 TARs ENCODE Analysis encodeYaleAffyNB4TPARNATarsIntronsProximal Yale In Prx NB4 TPA Yale Intronic Proximal TPA-Treated NB4 TARs ENCODE Analysis encodeYaleAffyNB4RARNATarsIntronsProximal Yale In Prx NB4 RA Yale Intronic Proximal NB4 Retinoic TARs ENCODE Analysis encodeAffyRnaHl60SitesHr32IntronsProximal Affy In Prx HL60 32h Affy Intronic Proximal HL60 Retinoic 32h Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr08IntronsProximal Affy In Prx HL60 8h Affy Intronic Proximal HL60 Retinoic 8h Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr02IntronsProximal Affy In PrxHL60 2h Affy Intronic Proximal HL60 Retinoic 2hr Transfrags ENCODE Analysis encodeAffyRnaHl60SitesHr00IntronsProximal Affy In Prx HL60 Affy Intronic Proximal HL60 Transfrags ENCODE Analysis encodeAffyRnaHeLaSitesIntronsProximal Affy In Prx HeLa Affy Intronic Proximal HeLa Transfrags ENCODE Analysis encodeAffyRnaGm06990SitesIntronsProximal Affy In Prx GM06990 Affy Intronic Proximal GM06990 Transfrags ENCODE Analysis uniGene_2 UniGene UniGene Hs 162 Alignments and SAGEmap Info mRNA and EST Description Serial analysis of gene expression (SAGE) is a quantitative measurement of gene expression. Data are presented for every cluster contained in the browser window and the selected cluster name is highlighted in the table. All data are from the repository at the SageMap project built on UniGene version Hs 162. Click on a UniGene cluster name on the track details page to display SageMap's page for that cluster. Please note that data are not available for every cluster. There is no data available for clusters that lie entirely within the bounds of larger clusters. Methods SAGE counts are produced by sequencing small "tags" of DNA believed to be associated with a gene. These tags were generated by attaching poly-A RNA to oligo-dT beads. After synthesis of double-stranded cDNA, transcripts were cleaved by an anchoring enzyme (usually NlaIII). Then, small tags were produced by ligation with a linker containing a type IIS restriction enzyme site and cleavage with the tagging enzyme (usually BsmFI). The tags were concatenated together and sequenced. The frequency of each tag was counted and used to infer expression level of transcripts that could be matched to that tag. Credits All SAGE data presented here were mapped to UniGene transcripts by the SageMap project at NCBI. encodeRegulomeAmplicon UW/Reg Amplicon ENCODE UW/Regulome Amplicon ENCODE Chromosome, Chromatin and DNA Structure Description This track shows a tiling path of PCR amplicons, along with their raw DNaseI sensitivity scores, across all ENCODE regions. It is one of a set of tracks that annotate continuous DNaseI sensitivity measurements and DNaseI hypersensitive sites (HSs) over the ENCODE regions. DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The data were produced using quantitative chromatin profiling (QCP) (Dorschner et al., 2004). DNaseI-treated and untreated chromatin samples from the following cell lines/phenotypes were studied: GM06990 (lymphoblastoid; all regions) CaCo2 (intestinal; most regions) SK-N-SH (neural; most regions) HepG2 (hepatic; ENm003/Apolipoprotein A1 locus) Huh7 (hepatic; ENm003, ENr131, ENr213, ENr321) K562 (erythroid; ENm009/Beta-globin locus) ERY (CD34-derived primary adult erythroblasts; ENm009/Beta-globin locus) Display Conventions and Configuration The display is separated into "odd" and "even" amplicons, to provide a visually distinct appearance among amplicons, so that adjacent amplicons are always in different subtracks. The details page for each amplicon reveals its start/stop coordinates and its raw DNaseI sensitivity score. The score is calculated by the formula (copies in DNaseI-treated / copies in DNaseI-untreated) * 1000. The graphical display may be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Methods QCP was performed as described in Dorschner et al. PCR amplicons of ~250 bp in size were tiled end-to-end across the study regions. An amplicon tiling path has been computed over all regions and is available through UniSTS. Chromatin preparation and DNaseI treatment were performed on the cell types list above as described in Dorschner et al. High-throughput real-time PCR was used to quantify DNaseI at each amplicon by measuring copies remaining in DNaseI-treated vs. untreated samples. The results were then analyzed with a statistical algorithm to compute the moving baseline of mean DNaseI sensitivity and to identify outliers that correspond with DNaseI hypersensitive sites. Verification QCP measurements were performed in replicate (6X) on pooled biological replicate samples. Validation of the results was carried out by conventional DNaseI hypersensitivity assays using end-labeling/Southern blotting. A total of 1.17 Mb have been evaluated by conventional assay. The specificity was defined as the number of true negative evaluable QCP amplicons divided by the sum of the true negatives plus false positives. Using 246.2 Kb from ENm002, the specificity was calculated to be 0.997. The sensitivity of the QCP assay was calculated as the true positives divided by the sum of the true positives plus false negatives. The sensitivity measured for ENm002 was 0.9487. Credits Data generation, analysis, and validation were performed jointly by groups at Regulome Corporation and the University of Washington (UW) in Seattle. Regulome Corp.: Michael O. Dorschner, Richard Humbert, Peter J. Sabo, Anthony Shafer, Jeff Goldy, Molly Weaver, Kristin Lee, Fidencio Neri, Brendan Henry, Mike Hawrylycz, Paul Tittel, Jim Wallace, Josh Mack, Janelle Kawamoto, John A. Stamatoyannopoulos. UW Medical Genetics: Patrick Navas, Man Yu, Hua Cao, Brent Johnson, Ericka Johnson, George Stamatoyannopoulos. UW Genome Sciences: Scott Kuehn, Robert Thurman, William S. Noble. References Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J. et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1(3), 219-25 (2004). encodeRegulomeAmpliconEven Amplicon (Even) Amplicon (Even) ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeAmpliconOdd Amplicon (Odd) Amplicon (Odd) ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProb UW/Reg DNaseI HSs ENCODE UW/Regulome DNaseI hypersensitive sites/scores ENCODE Chromosome, Chromatin and DNA Structure Description This track identifies amplicons overlying DNaseI hypersensitive sites (HSs) and provides an empirical P-value for each. The track is one of a set of tracks that annotate continuous DNaseI sensitivity measurements and DNaseI hypersensitive sites (HSs) over ENCODE regions. DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The data were produced using quantitative chromatin profiling (QCP) (Dorschner et al., 2004). See the UW/Reg Amplicon track for a list of the cell lines/phenotypes studied in these experiments. Display Conventions and Configuration Data values are represented on a vertical axis as a score between 0 and 3, corresponding to -log10(P-value) (i.e., a score of 3 indicates a P-value of less than 0.001). Note that these are empirically determined P-values, not binomial/Gaussian P-values. Also, the HSs are called only in the context of plates with an acceptable (though conservative) plate quality score (see the Regulome Quality track). The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different cell lines/phenotypes. Methods QCP was performed as described in Dorschner et al. See the UW/Reg Amplicon track description for more information. Verification See the UW/Reg Amplicon track description for verification information. Credits Data generation, analysis, and validation were performed jointly by groups at Regulome Corporation and the University of Washington (UW) in Seattle. Regulome Corp.: Michael O. Dorschner, Richard Humbert, Peter J. Sabo, Anthony Shafer, Jeff Goldy, Molly Weaver, Kristin Lee, Fidencio Neri, Brendan Henry, Mike Hawrylycz, Paul Tittel, Jim Wallace, Josh Mack, Janelle Kawamoto, John A. Stamatoyannopoulos. UW Medical Genetics: Patrick Navas, Man Yu, Hua Cao, Brent Johnson, Ericka Johnson, George Stamatoyannopoulos. UW Genome Sciences: Scott Kuehn, Robert Thurman, William S. Noble. References Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J. et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1(3), 219-25 (2004). encodeRegulomeProbERY ERY Adult Erythroblast DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbK562 K562 K562 DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbHepG2 HepG2 HepG2 DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbHuh7 Huh7 Huh7 DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbSKNSH SKNSH SKNSH DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbGM06990 GM06990 GM06990 DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeProbCACO2 CACO2 CACO2 DNaseI HSs ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBase UW/Reg DNaseI Sens ENCODE UW/Regulome Mean DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure Description This track shows the moving baseline of mean DNaseI sensitivity, computed over each PCR amplicon using a locally-weighted least squares (LOWESS)-based algorithm described in Dorschner et al (2004). The track is one of a set of tracks that annotate continuous DNaseI sensitivity measurements and DNaseI hypersensitive sites (HSs) over ENCODE regions. DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The data were produced using quantitative chromatin profiling (QCP) (Dorschner et al.). See the UW/Reg Amplicon track for a list of the cell lines/phenotypes studied in these experiments. Display Conventions and Configuration The displayed values are calculated as (copies in DNaseI-untreated / copies in DNaseI-treated). Thus, increasing values represent increasing sensitivity. The subtracks within this composite annotation track may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different cell lines/phenotypes. Methods QCP was performed as described in Dorschner et al. See the UW/Reg Amplicon track description for more information. Verification See the UW/Reg Amplicon track description for verification information. Credits Data generation, analysis, and validation were performed jointly by groups at Regulome Corporation and the University of Washington (UW) in Seattle. Regulome Corp.: Michael O. Dorschner, Richard Humbert, Peter J. Sabo, Anthony Shafer, Jeff Goldy, Molly Weaver, Kristin Lee, Fidencio Neri, Brendan Henry, Mike Hawrylycz, Paul Tittel, Jim Wallace, Josh Mack, Janelle Kawamoto, John A. Stamatoyannopoulos. UW Medical Genetics: Patrick Navas, Man Yu, Hua Cao, Brent Johnson, Ericka Johnson, George Stamatoyannopoulos. UW Genome Sciences: Scott Kuehn, Robert Thurman, William S. Noble. References Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J. et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1(3), 219-25 (2004). encodeRegulomeBaseERY ERY Adult Erythroblast DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseK562 K562 K562 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseHepG2 HepG2 HepG2 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseHuh7 Huh7 Huh7 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseSKNSH SKNSH SKNSH DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseGM06990 GM06990 GM06990 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeBaseCACO2 CACO2 CACO2 DNaseI Sensitivity ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQuality UW/Reg Plate Q/A ENCODE UW/Regulome Plate Quality Score ENCODE Chromosome, Chromatin and DNA Structure Description This track provides a visual representation of data quality scores, which range from 0 to 1, for each plate in the UW/Regulome experiments. It is one of a set of tracks that annotate continuous DNaseI sensitivity measurements and DNaseI hypersensitive sites (HSs) over ENCODE regions. DNaseI has long been used to map general chromatin accessibility and the DNaseI "hyperaccessibility" or "hypersensitivity" that is a universal feature of active cis-regulatory sequences. The data were produced using quantitative chromatin profiling (QCP) (Dorschner et al., 2004). Quality scores are available on the following cell lines/phenotypes and chromosomes: Cell Line/PhenotypeChromosomes CACO25, 7, 9, 11, 12, 16, X GM069902, 5, 7, 8, 9, 11, 12, 16, 18, X SKNSH5, 7, 9, 11, 12, 16, X Huh72, 8, 11, 18 HepG211 K56211 Adult Erythroblast11 See the UW/Reg Amplicon track for more information on the cell lines/phenotypes studied in these experiments. Display Conventions and Configuration Plates with scores greater than or equal to 0.5 were conservatively considered acceptable for reliable scoring of HSs. Scores are shown in greyscale, with darker colors indicating higher scores. This composite annotation track consists of several subtracks that show the quality scores for each cell line/phenotype. To show only selected subtracks, uncheck the boxes next to the tracks you wish to hide. The display may also be filtered to show only those items with unnormalized scores that meet or exceed a certain threshhold. To set a threshhold, type the minimum score into the text box at the top of the description page. Color differences among the subtracks are arbitrary; they provide a visual cue for distinguishing the different cell lines/phenotypes. Methods QCP was performed as described in Dorschner et al. See the UW/Reg Amplicon track description for more information. QCP assays were formatted into 384-well plates for high-throughput real-time PCR. Each plate was treated as a separate experiment. Plate quality scores were computed using a Support Vector Machine (SVM). Trained operators manually scored 500 plates, classifying each on a scale of 1 to 5 to rank the degree of experimental noise. The unified set was then used to train an SVM to classify and score "good" and "bad" plates. Good plates were conservatively assigned noise scores of 1 - 3; bad plates received scores of 4 - 5. By performing cross validation on a 90% subsample of the training set, the SVM achieved an ROC (receiver operating characteristic) score of 0.93. Verification See the UW/Reg Amplicon track description for verification information. Credits Data generation, analysis, and validation were performed jointly by groups at Regulome Corporation and the University of Washington (UW) in Seattle. Regulome Corp.: Michael O. Dorschner, Richard Humbert, Peter J. Sabo, Anthony Shafer, Jeff Goldy, Molly Weaver, Kristin Lee, Fidencio Neri, Brendan Henry, Mike Hawrylycz, Paul Tittel, Jim Wallace, Josh Mack, Janelle Kawamoto, John A. Stamatoyannopoulos. UW Medical Genetics: Patrick Navas, Man Yu, Hua Cao, Brent Johnson, Ericka Johnson, George Stamatoyannopoulos. UW Genome Sciences: Scott Kuehn, Robert Thurman, William S. Noble. References Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J. et al. High-throughput localization of functional elements by quantitative chromatin profiling. Nat Methods 1(3), 219-25 (2004). encodeRegulomeQualityERY ERY Adult Erythroblast Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualityK562 K562 K562 Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualityHepG2 HepG2 HepG2 Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualityHuh7 Huh7 Huh7 Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualitySKNSH SKNSH SKNSH Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualityGM06990 GM06990 GM06990 Quality ENCODE Chromosome, Chromatin and DNA Structure encodeRegulomeQualityCACO2 CACO2 CACO2 Quality ENCODE Chromosome, Chromatin and DNA Structure vegaGene Vega Genes Vega Annotations Genes and Gene Predictions Description and Methods This track shows gene annotations from the Vertebrate Genome Annotation (Vega) database. The following information is excerpted from the Vertebrate Genome Annotation home page: "The Vega database is designed to be a central repository for high-quality, frequently updated manual annotation of different vertebrate finished genome sequence. Vega attempts to present consistent high-quality curation of the published chromosome sequences. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases as well as a series of ab initio gene predictions (GENSCAN, Fgenes). The annotation is based on supporting evidence only." "In addition, comparative analysis using vertebrate datasets such as the Riken mouse cDNAs and Genoscope Tetraodon nigroviridis Ecores (Evolutionary Conserved Regions) are used for novel gene discovery." NOTE: VEGA annotations do not appear on every chromosome in this assembly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks using the following color scheme to indicate the status of the gene annotation: Known (Dark blue): genes that are identical to known human complementary DNA or protein sequences and have an entry in the species-specific model organism database. Novel (Dark blue): genes that have an open reading frame (ORF) and are identical or homologous to human cDNAs, human ESTs, or proteins in all species. Novel transcripts are genes that fit the criteria of novel genes with the exception that an unambiguous ORF cannot be assigned. Putative (Medium blue): genes whose sequences are identical or homologous to human ESTs but do not contain an ORF. Predicted (Light blue): genes based on ab initio prediction and for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes, or expression data from Rosetta). Unclassified (Gray). The details pages show the only the Vega gene type and not the transcript type. A single gene can have more than one transcript which can belong to different classes, so the gene as a whole is classified according to the transcript with the "highest" level of classification. Transcript type (and other details) may be found by clicking on the transcript identifier which forms the outside link to the Vega transcript details page. Further information on the gene and transcript classification may be found here. Credits Thanks to Steve Searle at the Sanger Institute for providing the GTF and FASTA files for the Vega annotations. Vega gene annotations are generated by manual annotation from the following groups: Chromosome 6: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Mungall AJ et al., The DNA sequence and analysis of human chromosome 6. Nature. 2003 Oct 23;425:805-11. Chromosome 7: Hillier et al., The Genome Center at Washington University Relevant publication: Hillier LW et al., The DNA sequence of human chromosome 7. Nature. 2003 Jul 10;424:157-64. Chromosome 9: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Humphray SJ et al., The DNA sequence and analysis of human chromosome 9. Nature. 2004 May 27;429;369-74. Chromosome 10: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 10. Nature. 2004 May 27;429:375-81. Chromosome 13: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Dunham A et al., The DNA sequence and analysis of human chromosome 13. Nature. 2001 Apr 1;428:522-8. Chromosome 14: Genoscope Relevant publication: Heilig R et al., The DNA sequence and analysis of human chromosome 14. Nature. 2003 Feb 6;421:601-7. Chromosome 20: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 20. Nature. 2001 Dec 20;414:865-71. Chromosome 22: Chromosome 22 Group, Wellcome Trust Sanger Institute Relevant publications: — Collins JE et al., Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22. Genome Research. 2003 Jan;13(1):27-36. — Dawson E et al., A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002 Aug 1;418:544-8. — Dunham I, et al., The DNA sequence of human chromosome 22. Nature. 1999 Dec 2;402:489-95. Chromosome X: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Ross MT et al., The DNA sequence and comparative analysis of human chromosome X. Nature 2005 Mar 17;434:325-37. vegaPseudoGene Vega Pseudogenes Vega Annotated Pseudogenes and Immunoglobulin Segments Genes and Gene Predictions Description and Methods This track shows pseudogene annotations from the Vertebrate Genome Annotation (Vega) database. The following information is excerpted from the Vertebrate Genome Annotation home page: "The Vega database is designed to be a central repository for high-quality, frequently updated manual annotation of different vertebrate finished genome sequence. Vega attempts to present consistent high-quality curation of the published chromosome sequences. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases as well as a series of ab initio gene predictions (GENSCAN, Fgenes). The annotation is based on supporting evidence only." "In addition, comparative analysis using vertebrate datasets such as the Riken mouse cDNAs and Genoscope Tetraodon nigroviridis Ecores (Evolutionary Conserved Regions) are used for novel gene discovery." NOTE: VEGA annotations do not appear on every chromosome in this assembly. Display Conventions and Configuration This track follows the display conventions for gene prediction tracks using the following color scheme to indicate the status of the gene annotation: Known (Dark blue): genes that are identical to known human complementary DNA or protein sequences and have an entry in the species-specific model organism database. Novel (Dark blue): genes that have an open reading frame (ORF) and are identical or homologous to human cDNAs, human ESTs, or proteins in all species. Novel transcripts are genes that fit the criteria of novel genes with the exception that an unambiguous ORF cannot be assigned. Putative (Medium blue): genes whose sequences are identical or homologous to human ESTs but do not contain an ORF. Predicted (Light blue): genes based on ab initio prediction and for which at least one exon is supported by biological data (unspliced ESTs, protein sequence similarity with mouse or tetraodon genomes, or expression data from Rosetta). Unclassified (Gray). The details pages show the only the Vega gene type and not the transcript type. A single gene can have more than one transcript which can belong to different classes, so the gene as a whole is classified according to the transcript with the "highest" level of classification. Transcript type (and other details) may be found by clicking on the transcript identifier which forms the outside link to the Vega transcript details page. Further information on the gene and transcript classification may be found here. Credits Thanks to Steve Searle at the Sanger Institute for providing the GTF and FASTA files for the Vega annotations. Vega gene annotations are generated by manual annotation from the following groups: Chromosome 6: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Mungall AJ et al., The DNA sequence and analysis of human chromosome 6. Nature. 2003 Oct 23;425:805-11. Chromosome 7: Hillier et al., The Genome Institute at Washington University Relevant publication: Hillier LW et al., The DNA sequence of human chromosome 7. Nature. 2003 Jul 10;424:157-64. Chromosome 9: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Humphray SJ et al., The DNA sequence and analysis of human chromosome 9. Nature. 2004 May 27;429;369-74. Chromosome 10: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 10. Nature. 2004 May 27;429:375-81. Chromosome 13: The HAVANA group, Wellcome Trust Sanger Institute Relevant publication: Dunham A et al., The DNA sequence and analysis of human chromosome 13. Nature. 2001 Apr 1;428:522-8. Chromosome 14: Genoscope Relevant publication: Heilig R et al., The DNA sequence and analysis of human chromosome 14. Nature. 2003 Feb 6;421:601-7. Chromosome 20: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Deloukas P et al., The DNA sequence and comparative analysis of human chromosome 20. Nature. 2001 Dec 20;414:865-71. Chromosome 22: Chromosome 22 Group, Wellcome Trust Sanger Institute Relevant publications: — Collins JE et al., Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22. Genome Research. 2003 Jan;13(1):27-36. — Dawson E et al., A first-generation linkage disequilibrium map of human chromosome 22. Nature. 2002 Aug 1;418:544-8. — Dunham I, et al., The DNA sequence of human chromosome 22. Nature. 1999 Dec 2;402:489-95. Chromosome X: The HAVANA Group, Wellcome Trust Sanger Institute Relevant publication: Ross MT et al., The DNA sequence and comparative analysis of human chromosome X. Nature 2005 Mar 17;434:325-37. celeraCoverage WSSD Coverage Regions Assayed for SDD Mapping and Sequencing Description This track represents coverage of clones that were assayed for segmental duplications using high-depth Celera reads. Absent regions were not assessed by this version of the Segmental Duplication Database (SDD). For a description of the whole-genome shotgun sequence detection (WSSD) "fuguization" method, see Bailey, J.A. et al. (2001) in the References section below. Credits The data were provided by Xinwei She and Evan Eichler as part of their effort to map human paralogy at the University of Washington. References Bailey, J.A., et al., Recent segmental duplications in the human genome. Science 297(5583), 945-7 (2002). Bailey, J.A., et al., Segmental duplications: organization and impact within the current human genome project assembly, Genome Res. 11(6), 1005-17 (2001). She, X., et al., Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431(7011), 927-30 (2004). celeraDupPositive WSSD Duplication Sequence Identified as Duplicate by High-Depth Celera Reads Mapping and Sequencing Description High-depth sequence reads from the Celera project were used to detect paralogy in the human genome reference sequence. This track shows confirmed segmental duplications, defined as having similarity to sequences in the Segmental Duplication Database (SDD) of greater than 90% over more than 250 bp of repeatmasked sequence. For a description of the whole-genome shotgun sequence detection (WSSD) "fuguization" method, see Bailey, J.A. et al. (2001) in the References section below. Credits The data were provided by Xinwei She and Evan Eichler as part of their efforts to map human paralogy at the University of Washington. References Bailey, J.A., et al., Recent segmental duplications in the human genome. Science 297(5583), 945-7 (2002). Bailey, J.A., et al., Segmental duplications: organization and impact within the current human genome project assembly, Genome Res. 11(6), 1005-17 (2001). She, X., et al., Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431(7011), 927-30 (2004). celeraOverlay WSSD Overlay Celera WGS Assembly Overlay on Public Assembly Mapping and Sequencing Description This track shows regions detected as overlays of Celera whole-genome shotgun sequence assembly on the public human assembly. Credits The data were provided by Xinwei She and Evan Eichler as part of their effort to map human paralogy at the University of Washington. References Bailey, J.A., et al., Recent segmental duplications in the human genome. Science 297(5583), 945-7 (2002). Bailey, J.A., et al., Segmental duplications: organization and impact within the current human genome project assembly, Genome Res. 11(6), 1005-17 (2001). She, X., et al., Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 431(7011), 927-30 (2004). encodeYaleChIPSTAT1Pval Yale ChIP pVal Yale ChIP/Chip (STAT1 ab, Hela cells, P-Values) ENCODE Chromatin Immunoprecipitation Description This track shows probable sites of STAT1 binding in HeLa cells as determined by chromatin immunoprecipitation followed by microarray analysis. STAT1 (Signal Transducer and Activator of Transcription) is a transcription factor that moves to the nucleus and binds DNA only in response to a cytokine signal such as interferon-gamma. HeLa cells are a common cell line derived from a cervical cancer. Each of the four subtracks represents a different microarray platform. The track as a whole can be used to compare results across microarray platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50-mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36-mer oligonucleotides tiled end to end Maskless design #3: 50-mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. The subtracks show the ratio of immunoprecipitated DNA from cytokine-stimulated cells vs. unstimulated cells in each of the four platforms. The ratio is calculated as -log10(p-value) in a 501-base window. The data shown is the combined result of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. These data are available at NCBI GEO as GSE2714, which also provides additional information about the experimental protocols. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other. After normalization, replicates were condensed to a single value. Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) is generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window (including replicates). Using the same procedure, a -log10(p-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows can be made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site is determined by thresholding both on fold enrichment and -log10(p-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(p-value) >=8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(p-value) >=4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(p-value) >=4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits This data was generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenPval Yale LI PVal Yale ChIP/Chip (STAT1 ab, Hela cells) LI/UCSD PCR Amplicon, P-Values ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpPval Yale 50-50 PVal Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 50bp Win, P-Values ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpPval Yale 50-38 PVal Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 38bp Win, P-Values ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpPval Yale 36-36 PVal Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 36-mer, 36bp Win, P-Values ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1Sig Yale ChIP Sig Yale ChIP/Chip: STAT1 ab, Hela cells, Signal ENCODE Chromatin Immunoprecipitation Description Each of these four tracks shows the map of signal intensity (estimating the fold enrichment [log2 scale] of ChIP DNA vs unstimulated DNA) for STAT1 ChIP-chip using Human Hela S3 cells hybridized to four different array designs/platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50-mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36-mer oligonucleotides tiled end to end Maskless design #3: 50-mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. Each track shows the combined results of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. These data are available at NCBI GEO as GSE2714, which also provides additional information about the experimental protocols. Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Methods Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other (both Cy3 and Cy5 channels). Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding both on fold enrichment and -log10(P-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(P-value) >= 8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenSig Yale LI Sig Yale ChIP/Chip (STAT1 ab, Hela cells) LI/UCSD PCR Amplicon, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpSig Yale 50-50 Sig Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 50bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpSig Yale 50-38 Sig Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 38bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpSig Yale 36-36 Sig Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 36-mer, 36bp Win, Signal ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1Sites Yale ChIP Sites Yale ChIP/Chip (STAT1 ab, Hela cells, Binding Sites) ENCODE Chromatin Immunoprecipitation Description Each of these four tracks shows the binding sites for STAT1 ChIP-chip using Human Hela S3 cells hybridized to four different array designs/platforms. The first three platforms are custom maskless photolithographic arrays with oligonucleotides tiling most of the non-repetitive DNA sequence of the ENCODE regions: Maskless design #1: 50mer oligonucleotides tiled every 38 bps (overlapping by 12 nts) Maskless design #2: 36mer oligonucleotides tiled end to end Maskless design #3: 50mer oligonucleotides tiled end to end The fourth array platform is an ENCODE PCR Amplicon array manufactured by Bing Ren's lab at UCSD. Each track shows the combined results of multiple biological replicates: five for the first maskless array (50-mer every 38 bp), two for the second maskless array (36-mer every 36 bp), three for the third maskless array (50-mer every 50 bp) and six for the PCR Amplicon array. For all arrays, the STAT1 ChIP DNA was labeled with Cy5 and the control DNA was labeled with Cy3. See NCBI GEO GSE2714 for details of the experimental protocols. Methods Maskless photolithographic arrays The data from replicates were median-scaled and quantile-normalized to each other (both Cy3 and Cy5 channels). Using a 501 bp sliding window centered on each oligonucleotide probe, a signal map (estimating the fold enrichment [log2 scale] of ChIP DNA) was generated by computing the pseudomedian signal of all log2(Cy5/Cy3) ratios (median of pairwise averages) within the window, including replicates. Using the same procedure, a -log10(P-value) map (measuring significance of enrichment of oligonucleotide probes in the window) for all sliding windows was made by computing P-values using the Wilcoxon paired signed rank test comparing fluorensent intensity between Cy5 and Cy3 for each oligonucleotide probe (Cy5 and Cy3 signals from the same array). A binding site was determined by thresholding both on fold enrichment and -log10(P-value) and requiring a maximum gap and a minimum run between oligonucleotide positions. For the first maskless array (50-mer every 38 bp):    log2(Cy5/Cy3) >= 1.25, -log10(P-value) >= 8.0, MaxGap <= 100 bp, MinRun >= 180 bp For the second maskless array (36-mer every 36 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp For the third maskless array (50-mer every 50 bp):    log2(Cy5/Cy3) >= 0.25, -log10(P-value) >= 4.0, MaxGap <= 250 bp, MinRun >= 0 bp PCR Amplicon Arrays The Cy5 and Cy3 array data were loess-normalized between channels on the same slide and then between slides. A z-score was then determined for each PCR amplicon from the distribution of log(Cy5/Cy3) in a local log(Cy5*Cy3) intensity window (see Quackenbush, 2002 and the Express Yourself website for more details). From the z-score, a P-value was then associated with each PCR amplicon. Hits were determined using a 3 sigma threshold and requiring a spot to be present on three out of six arrays. Verification ChIP-chip binding sites were verified by comparing "hit lists" generated from combinations of different biological replicates. Only experiments that yielded a significant overlap (greater than 50 percent) were accepted. As an independent check (for maskless arrays), data on the microarray were randomized with respect to position and re-scored; significantly fewer hits (consistent with random noise) were generated this way. Credits This data was generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University. The PCR Amplicon arrays were manufactured by Bing Ren's lab at UCSD. References Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J. et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116(4), 499-509 (2004). Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P. et al. CREB binds to multiple loci on human chromosome 22, Mol Cell Biol. 24(9), 3804-14 (2004). Luscombe, N.M., Royce, T.E., Bertone, P., Echols, N., Horak, C.E., Chang, J.T., Snyder, M. and Gerstein, M. ExpressYourself: A modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31(13), 3477-82 (2003). Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P. et al. Distribution of NF-kappaB-binding sites across human chromosome 22. Proc Natl Acad Sci U S A. 100(21), 12247-52 (2003). Quackenbush, J.. Microarray data normalization and transformation, Nat Genet. 32(Suppl), 496-501 (2002). encodeYaleChIPSTAT1HeLaBingRenSites Yale LI Sites Yale ChIP/Chip (STAT1 ab, Hela cells) LI/UCSD PCR Amplicon, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer50bpSite Yale 50-50 Sites Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 50bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess50mer38bpSite Yale 50-38 Sites Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 50-mer, 38bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation encodeYaleChIPSTAT1HeLaMaskLess36mer36bpSite Yale 36-36 Sites Yale ChIP/Chip (STAT1 ab, Hela cells) Maskless 36-mer, 36bp Win, Binding Sites ENCODE Chromatin Immunoprecipitation pseudoYale Yale Pseudo Yale Pseudogenes. Genes and Gene Predictions Description This track shows identified pseudogenes as recorded in the Yale Pseudogene Database. For information on how these pseudogenes were identified and access to the database, see http://www.pseudogene.org. encodeYaleAffyRNATransMap Yale RNA Yale RNA Transcript Map (Neutrophil, Placenta and NB4 cells) ENCODE Transcript Levels Description This track shows the transcript map of signal intensity (estimating RNA abundance) for the following, hybridized to the Affymetrix ENCODE oligonucleotide microarray: human neutrophil (PMN) total RNA (10 biological samples from different individuals) human placental Poly(A)+ RNA (3 biological replicates) total RNA from human NB4 cells (4 biological replicates), each sample divided into three parts and treated as follows: untreated, treated with retinoic acid (RA), and treated with 12-O-tetradecanoylphorbol-13 acetate (TPA) (three out of the four original samples). Total RNA was extracted from each treated sample and applied to arrays in duplicate (2 technical replicates). The human NB4 cell can be made to differentiate towards either monocytes (by treatment with TPA) or neutrophils (by treatment with RA). See Kluger et al., 2004 in the References section for more details about the differentiation of hematopoietic cells. This array has 25-mer oligonucleotide probes tiled approximately every 22 bp, covering all the non-repetitive DNA sequence of the ENCODE regions. The transcript map is a combined signal for both strands of DNA. This is derived from the number of different biological samples indicated above, each with at least two technical replicates. See the following NCBI GEO accessions for details of experimental protocols: GSE2678 GSE2671 GSE2679 Display Conventions and Configuration This annotation follows the display conventions for composite "wiggle" tracks. The subtracks within this annotation may be configured in a variety of ways to highlight different aspects of the displayed data. The graphical configuration options are shown at the top of the track description page, followed by a list of subtracks. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. For more information about the graphical configuration options, click the Graph configuration help link. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different data samples. Methods The data from technical replicates were median-scaled and quantile-normalized to each other. Using a 101 bp sliding window centered on each oligonucleotide probe, a signal map (estimating RNA abundance) was generated by computing the pseudomedian signal of all PM-MM pairs (median of pairwise PM-MM averages) within the window, including replicates. Biological replicate signal maps were combined by quantile-normalizing them between replicates and computing the median signal at each oligonucleotide probe location. Verification Independent biological replicates (as indicated above) were generated, and each was hybridized to at least two different arrays (technical replicates). Transcribed regions were then identified using a signal theshold of 90 percentile of signal intensities, as well as a maximum gap of 50 bp and a minimum run of 50 bp (between oligonucleotide positions). Transcribed regions, as determined by individual biological samples, were compared to ensure significant overlap. Credits This data was generated and analyzed by the Yale/Affymetrix collaboration between the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. References Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242-6 (2004). Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725), 1149-54 (2005). Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-9 (2002). Kluger, Y., Tuck, D.P., Chang, J.T., Nakayama, Y., Poddar, R., Kohya, N., Lian, Z., Ben Nasr, A., Halaban, H.R. et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A 101(17), 6508-13 (2004). Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P. et al. The transcriptional activity of human Chromosome 22. Genes Dev 17(4), 529-40 (2003). encodeYaleAffyNB4UntrRNATransMap Yale RNA NB4 Un Yale NB4 RNA Transcript Map, Untreated ENCODE Transcript Levels encodeYaleAffyNB4TPARNATransMap Yale RNA NB4 TPA Yale NB4 RNA Transcript Map, Treated with 12-O-tetradecanoylphorbol-13 Acetate (TPA) ENCODE Transcript Levels encodeYaleAffyNB4RARNATransMap Yale RNA NB4 RA Yale NB4 RNA Transcript Map, Treated with Retinoic Acid ENCODE Transcript Levels encodeYaleAffyPlacRNATransMap Yale RNA Plcnta Yale Placenta RNA Transcript Map ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap10 Yale RNA Neu 10 Yale Neutrophil RNA Transcript Map, Sample 10 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap09 Yale RNA Neu 9 Yale Neutrophil RNA Transcript Map, Sample 9 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap08 Yale RNA Neu 8 Yale Neutrophil RNA Transcript Map, Sample 8 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap07 Yale RNA Neu 7 Yale Neutrophil RNA Transcript Map, Sample 7 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap06 Yale RNA Neu 6 Yale Neutrophil RNA Transcript Map, Sample 6 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap05 Yale RNA Neu 5 Yale Neutrophil RNA Transcript Map, Sample 5 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap04 Yale RNA Neu 4 Yale Neutrophil RNA Transcript Map, Sample 4 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap03 Yale RNA Neu 3 Yale Neutrophil RNA Transcript Map, Sample 3 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap02 Yale RNA Neu 2 Yale Neutrophil RNA Transcript Map, Sample 2 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMap01 Yale RNA Neu 1 Yale Neutrophil RNA Transcript Map, Sample 1 ENCODE Transcript Levels encodeYaleAffyNeutRNATransMapAll Yale RNA Neu Sum Yale Neutrophil RNA Transcript Map, Summary ENCODE Transcript Levels encodeYaleAffyRNATars Yale TAR Yale RNA Transcriptionally Active Regions (TARs) ENCODE Transcript Levels Description This track shows the locations of transcriptionally active regions (TARs)/transcribed fragments (transfrags) for the following, hybridized to the Affymetrix ENCODE oligonucleotide microarray: human neutrophil (PMN) total RNA (10 biological samples from different individuals) human placental Poly(A)+ RNA (3 biological replicates) total RNA from human NB4 cells (4 biological replicates), each sample divided into three parts and treated as follows: untreated, treated with retinoic acid (RA), and treated with 12-O-tetradecanoylphorbol-13 acetate (TPA) (three out of the four original samples). Total RNA was extracted from each treated sample and applied to arrays in duplicate (2 technical replicates). The human NB4 cell can be made to differentiate towards either monocytes (by treatment with TPA) or neutrophils (by treatment with RA). See Kluger et al., 2004 in the References section for more details about the differentiation of hematopoietic cells. This array has 25-mer oligonucleotide probes tiled approximately every 22 bp, covering all the non-repetitive DNA sequence of the ENCODE regions. The transcript map is a combined signal for both strands of DNA. This is derived from the number of different biological samples indicated above, each with at least two technical replicates. See the following NCBI GEO accessions for details of experimental protocols: GSE2678 GSE2671 GSE2679 Display Conventions and Configuration TARs are represented by blocks in the graphical display. This composite annotation track consists of several subtracks that are listed at the top of the track description page. To display only selected subtracks, uncheck the boxes next to the tracks you wish to hide. Color differences among the subtracks are arbitrary. They provide a visual cue for distinguishing between the different data samples. Methods The data from technical replicates were median-scaled and quantile-normalized to each other. Using a 101 bp sliding window centered on each oligonucleotide probe, a signal map estimating RNA abundance was generated by computing the pseudomedian signal of all PM-MM pairs (median of pairwise PM-MM averages) within the window, including replicates. Biological replicate signal maps were combined by quantile-normalizing them between replicates and computing the median signal at each oligonucleotide probe location. Independent biological replicates (as described above) were generated, and each was hybridized to at least two different arrays (technical replicates). Transcribed regions (TARs/transfrags) were then identified using a signal theshold of 90 percentile of signal intensities, as well as a maximum gap of 50 bp and a minimum run of 50 bp (between oligonucleotide positions). Verification Transcribed regions (TARs/transfrags), as determined by individual biological samples, were compared to ensure significant overlap. Credits These data were generated and analyzed by the Yale/Affymetrix collaboration between the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University and Tom Gingeras at Affymetrix. References Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M. et al. Global identification of human transcribed sequences with genome tiling arrays. Science 306(5705), 2242-6 (2004). Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H. et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308(5725), 1149-54 (2005). Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P. and Gingeras, T.R. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296(5569), 916-9 (2002). Kluger, Y., Tuck, D.P., Chang, J.T., Nakayama, Y., Poddar, R., Kohya, N., Lian, Z., Ben Nasr, A., Halaban, H.R. et al. Lineage specificity of gene expression patterns. Proc Natl Acad Sci U S A 101(17), 6508-13 (2004). Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P. et al. The transcriptional activity of human Chromosome 22. Genes Dev 17(4), 529-40 (2003). encodeYaleAffyNB4UntrRNATars Yale TAR NB4 Un Yale NB4 RNA, TAR, Untreated ENCODE Transcript Levels encodeYaleAffyNB4TPARNATars Yale TAR NB4 TPA Yale NB4 RNA, TAR, Treated with 12-O-tetradecanoylphorbol-13 Acetate (TPA) ENCODE Transcript Levels encodeYaleAffyNB4RARNATars Yale TAR NB4 RA Yale NB4 RNA, TAR, Treated with Retinoic Acid ENCODE Transcript Levels encodeYaleAffyPlacRNATars Yale RNA Plcnta Yale Placenta RNA Transcriptionally Active Region ENCODE Transcript Levels encodeYaleAffyNeutRNATars10 Yale TAR Neu 10 Yale Neutrophil RNA Transcriptionally Active Region, Sample 10 ENCODE Transcript Levels encodeYaleAffyNeutRNATars09 Yale TAR Neu 9 Yale Neutrophil RNA Transcriptionally Active Region, Sample 9 ENCODE Transcript Levels encodeYaleAffyNeutRNATars08 Yale TAR Neu 8 Yale Neutrophil RNA Transcriptionally Active Region, Sample 8 ENCODE Transcript Levels encodeYaleAffyNeutRNATars07 Yale TAR Neu 7 Yale Neutrophil RNA Transcriptionally Active Region, Sample 7 ENCODE Transcript Levels encodeYaleAffyNeutRNATars06 Yale TAR Neu 6 Yale Neutrophil RNA Transcriptionally Active Region, Sample 6 ENCODE Transcript Levels encodeYaleAffyNeutRNATars05 Yale TAR Neu 5 Yale Neutrophil RNA Transcriptionally Active Region, Sample 5 ENCODE Transcript Levels encodeYaleAffyNeutRNATars04 Yale TAR Neu 4 Yale Neutrophil RNA Transcriptionally Active Region, Sample 4 ENCODE Transcript Levels encodeYaleAffyNeutRNATars03 Yale TAR Neu 3 Yale Neutrophil RNA Transcriptionally Active Region, Sample 3 ENCODE Transcript Levels encodeYaleAffyNeutRNATars02 Yale TAR Neu 2 Yale Neutrophil RNA Transcriptionally Active Region, Sample 2 ENCODE Transcript Levels encodeYaleAffyNeutRNATars01 Yale TAR Neu 1 Yale Neutrophil RNA Transcriptionally Active Region, Sample 1 ENCODE Transcript Levels encodeYaleAffyNeutRNATarsAll Yale TAR Neu Sum Yale Neutrophil RNA Transcriptionally Active Region (TAR), Summary ENCODE Transcript Levels chainNetRn3 Rat Chain/Net Rat (June 2003 (Baylor 3.1/rn3)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (June 2003 (Baylor 3.1/rn3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the June 2003 (Baylor 3.1/rn3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRn3Viewnet Net Rat (June 2003 (Baylor 3.1/rn3)), Chain and Net Alignments Comparative Genomics netRn3 Rat Net Rat (June 2003 (Baylor 3.1/rn3)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (June 2003 (Baylor 3.1/rn3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the June 2003 (Baylor 3.1/rn3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetRn3Viewchain Chain Rat (June 2003 (Baylor 3.1/rn3)), Chain and Net Alignments Comparative Genomics chainRn3 Rat Chain Rat (June 2003 (Baylor 3.1/rn3)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of rat (June 2003 (Baylor 3.1/rn3)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both rat and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the rat assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best rat/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The rat sequence used in this annotation is from the June 2003 (Baylor 3.1/rn3) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the rat/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single rat chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "1000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm6 Mouse Chain/Net Mouse (Mar. 2005 (NCBI34/mm6)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Mar. 2005 (NCBI34/mm6)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Mar. 2005 (NCBI34/mm6) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm6Viewnet Net Mouse (Mar. 2005 (NCBI34/mm6)), Chain and Net Alignments Comparative Genomics netMm6 Mouse Net Mouse (Mar. 2005 (NCBI34/mm6)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Mar. 2005 (NCBI34/mm6)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Mar. 2005 (NCBI34/mm6) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetMm6Viewchain Chain Mouse (Mar. 2005 (NCBI34/mm6)), Chain and Net Alignments Comparative Genomics chainMm6 Mouse Chain Mouse (Mar. 2005 (NCBI34/mm6)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of mouse (Mar. 2005 (NCBI34/mm6)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both mouse and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the mouse assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best mouse/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The mouse sequence used in this annotation is from the Mar. 2005 (NCBI34/mm6) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the mouse/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single mouse chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-114-31-123 C-114100-125-31 G-31-125100-114 T-123-31-11491 Chains scoring below a minimum score of "3000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=medium tableSize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900 bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2 Chicken Chain/Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2Viewnet Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics netGalGal2 Chicken Net Chicken (Feb. 2004 (WUGSC 1.0/galGal2)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetGalGal2Viewchain Chain Chicken (Feb. 2004 (WUGSC 1.0/galGal2)), Chain and Net Alignments Comparative Genomics chainGalGal2 Chicken Chain Chicken (Feb. 2004 (WUGSC 1.0/galGal2)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of chicken (Feb. 2004 (WUGSC 1.0/galGal2)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both chicken and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the chicken assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best chicken/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The chicken sequence used in this annotation is from the Feb. 2004 (WUGSC 1.0/galGal2) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the chicken/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single chicken chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetFr1 Fugu Chain/Net Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetFr1Viewnet Net Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics netFr1 Fugu Net Fugu (Aug. 2002 (JGI 3.0/fr1)) Alignment Net Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 chainNetFr1Viewchain Chain Fugu (Aug. 2002 (JGI 3.0/fr1)), Chain and Net Alignments Comparative Genomics chainFr1 Fugu Chain Fugu (Aug. 2002 (JGI 3.0/fr1)) Chained Alignments Comparative Genomics Description This track shows regions of the genome that are alignable to other genomes ("chain" subtracks) or in synteny ("net" subtracks). The alignable parts are shown with thick blocks that look like exons. Non-alignable parts between these are shown like introns. Chain Track The chain track shows alignments of fugu (Aug. 2002 (JGI 3.0/fr1)) to the human genome using a gap scoring system that allows longer gaps than traditional affine gap scoring systems. It can also tolerate gaps in both fugu and human simultaneously. These "double-sided" gaps can be caused by local inversions and overlapping deletions in both species. The chain track displays boxes joined together by either single or double lines. The boxes represent aligning regions. Single lines indicate gaps that are largely due to a deletion in the fugu assembly or an insertion in the human assembly. Double lines represent more complex gaps that involve substantial sequence in both species. This may result from inversions, overlapping deletions, an abundance of local mutation, or an unsequenced gap in one species. In cases where multiple chains align over a particular region of the human genome, the chains with single-lined gaps are often due to processed pseudogenes, while chains with double-lined gaps are more often due to paralogs and unprocessed pseudogenes. In the "pack" and "full" display modes, the individual feature names indicate the chromosome, strand, and location (in thousands) of the match for each matching alignment. Net Track The net track shows the best fugu/human chain for every part of the human genome. It is useful for finding syntenic regions, possibly orthologs, and for studying genome rearrangement. The fugu sequence used in this annotation is from the Aug. 2002 (JGI 3.0/fr1) assembly. Display Conventions and Configuration Chain Track By default, the chains to chromosome-based assemblies are colored based on which chromosome they map to in the aligning organism. To turn off the coloring, check the "off" button next to: Color track based on chromosome. To display only the chains of one chromosome in the aligning organism, enter the name of that chromosome (e.g. chr4) in box next to: Filter by chromosome. Net Track In full display mode, the top-level (level 1) chains are the largest, highest-scoring chains that span this region. In many cases gaps exist in the top-level chain. When possible, these are filled in by other chains that are displayed at level 2. The gaps in level 2 chains may be filled by level 3 chains and so forth. In the graphical display, the boxes represent ungapped alignments; the lines represent gaps. Click on a box to view detailed information about the chain as a whole; click on a line to display information about the gap. The detailed information is useful in determining the cause of the gap or, for lower level chains, the genomic rearrangement. Individual items in the display are categorized as one of four types (other than gap): Top - the best, longest match. Displayed on level 1. Syn - line-ups on the same chromosome as the gap in the level above it. Inv - a line-up on the same chromosome as the gap above it, but in the opposite orientation. NonSyn - a match to a chromosome different from the gap in the level above. Methods Chain track Transposons that have been inserted since the fugu/human split were removed from the assemblies. The abbreviated genomes were aligned with lastz, and the transposons were added back in. The resulting alignments were converted into axt format using the lavToAxt program. The axt alignments were fed into axtChain, which organizes all alignments between a single fugu chromosome and a single human chromosome into a group and creates a kd-tree out of the gapless subsections (blocks) of the alignments. A dynamic program was then run over the kd-trees to find the maximally scoring chains of these blocks. The following matrix was used:  ACGT A91-90-25-100 C-90100-100-25 G-25-100100-90 T-100-25-9091 Chains scoring below a minimum score of "5000" were discarded; the remaining chains are displayed in this track. The linear gap matrix used with axtChain: -linearGap=loose tablesize 11 smallSize 111 position 1 2 3 11 111 2111 12111 32111 72111 152111 252111 qGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 tGap 325 360 400 450 600 1100 3600 7600 15600 31600 56600 bothGap 625 660 700 750 900 1400 4000 8000 16000 32000 57000 Net track Chains were derived from lastz alignments, using the methods described on the chain tracks description pages, and sorted with the highest-scoring chains in the genome ranked first. The program chainNet was then used to place the chains one at a time, trimming them as necessary to fit into sections not already covered by a higher-scoring chain. During this process, a natural hierarchy emerged in which a chain that filled a gap in a higher-scoring chain was placed underneath that chain. The program netSyntenic was used to fill in information about the relationship between higher- and lower-level chains, such as whether a lower-level chain was syntenic or inverted relative to the higher-level chain. The program netClass was then used to fill in how much of the gaps and chains contained Ns (sequencing gaps) in one or both species and how much was filled with transposons inserted before and after the two organisms diverged. Credits Lastz (previously known as blastz) was developed at Pennsylvania State University by Minmei Hou, Scott Schwartz, Zheng Zhang, and Webb Miller with advice from Ross Hardison. Lineage-specific repeats were identified by Arian Smit and his RepeatMasker program. The axtChain program was developed at the University of California at Santa Cruz by Jim Kent with advice from Webb Miller and David Haussler. The browser display and database storage of the chains and nets were created by Robert Baertsch and Jim Kent. The chainNet, netSyntenic, and netClass programs were developed at the University of California Santa Cruz by Jim Kent. References Harris, R.S. (2007) Improved pairwise alignment of genomic DNA Ph.D. Thesis, The Pennsylvania State University Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence alignments. Pac Symp Biocomput. 2002:115-26. PMID: 11928468 Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep 30;100(20):11484-9. PMID: 14500911; PMC: PMC208784 Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003 Jan;13(1):103-7. PMID: 12529312; PMC: PMC430961 ecoresFr1 Fugu Ecores Human/Fugu (Aug. 2002/fr1) Evolutionary Conserved Regions Comparative Genomics Description This track shows Evolutionary Conserved Regions computed by the Exofish program at Genoscope. Each singleton block corresponds to an "ecore"; two blocks connected by a thin line correspond to an "ecotig", a set of colinear ecores in a syntenic region. Methods Genome-wide sequence comparisons were done at the protein-coding level between the genome sequences of human, Homo sapiens, and Fugu (Japanese pufferfish), Takifugu rubripes, to detect evolutionarily conserved regions (ECORES). The sequence versions used in the comparison were Human (July 2003) and Fugu (August 2002). Credits Thanks to Olivier Jaillon at Genoscope for contributing the data. ecoresTetraodon Tetraodon Ecores Human/Tetraodon Evolutionary Conserved Regions Comparative Genomics Description This track shows Evolutionary Conserved Regions computed by the Exofish program at Genoscope. Each singleton block corresponds to an "ecore"; two blocks connected by a thin line correspond to an "ecotig", a set of colinear ecores in a syntenic region. Methods Genome-wide sequence comparisons were done at the protein-coding level between the genome sequences of human, Homo sapiens, and Tetraodon, Tetraodon nigroviridis, to detect evolutionarily conserved regions (ECORES). The sequence versions used in the comparison were Human (July 2003) and Tetraodon (March 2004). Credits Thanks to Olivier Jaillon at Genoscope for contributing the data.