Common Pathema Terms

      

Common Pathema Annotation Terms

  • All vs. All data

  • Parameters used for determing blastp All vs All results:
    • expect=.001 - the evalue cutoff we can have, the range is 0.001-1000 (default is 10) lower numbers are more stringent
    • cutoff=120 - the cutoff score for reporting 'HSP's, the range is 50-120 (default value is calculated from the EXPECT value) higher numbers are more stringent
    • alignments=500 - gives us at most 500 different accessions in the btab file (default is 250)
    • descriptions=500 - gives us at most 500 different accessions in the alignment file (default is 500)
    • filter=seg+xnu - added filters. These filters mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993), or segments consisting of short-periodicity internal repeats, as determined by the XNU program of Claverie & States (Computers and Chemistry, 1993). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

  • All v/s All Protein Blast Searches

  • All of the proteins are routinely blasted against one another. We call these the "All v/s All" searches. When a new genome is added to the database the all v/s all searches are re-run. The information from the All v/s All blast searches is displayed thoughout the Pathema - Pathema.

  • BER (Blast-Extend-Repraze) Searches

  • The Smith-Waterman alignment (what we call Blast-Extend-Repraze or BER) is computed dynamically on the website when the user selects the results to see in the alignment. Here is a description of the BER/Smith-Waterman process we use to generate the alignments for the website on the fly:

    The reference protein is the protein for which the blast results are being shown. The user then selects the comparison proteins for which they wish to see the alignments (the proteins in the table for the above example), these proteins make up the mini-database of comparison BLAST hits. Then a modified Smith-Waterman alignment (Smith, 1981) is performed on the reference protein against the mini-database of comparison BLAST hits. In order to identify potential frameshifts or point mutations in the sequence, the nucleotide gene of the reference protein is extended 300 nucleotides upstream and downstream of the predicted coding region. If significant homology to a comparison protein exists and extends into a different frame from that predicted, or extends through a stop codon, the program will continue the alignment past the boundaries of the predicted coding region.

  • NCBI COGs

  • COGs or Clusters of Orthologous Proteins are phylogenetic classifications of proteins encoded in complete genomes. COGs were delineated by comparing protein sequences encoded in 21 complete genomes, representing 17 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. For more information, visit the COG Home Page at NCBI.

  • Enzyme Commission Number

  • Enzyme Commission Numbers or EC#s are numbers assigned to enzymes that reflect their function. For more information and a complete list of all EC#s, visit the Enzyme Nomenclature Page.

  • MUMmer

  • MUMmer or the Whole Genome Alignment Tool is a system for aligning whole genome sequences. Using an efficient data structure called a suffix tree, the system is able rapidly to align sequences containing millions of nucleotides. It is fully described in: A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes. Nucleic Acids Research, 27:11 (1999), 2369-2376.

  • Paralogous Gene Families

  • Paralogous gene families are genes which have been duplicated within a particular organism during evolution. Not all genomes in the Omniome database have paralogous gene families assigned.

  • Pfam

  • Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. For more information on Pfam, visit the Pfam Home Page.

  • PROSITE

  • Prosite is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. To get more information on Prosite, visit the Prosite Home Page.



    Terms associated with TIGRFAMs


  • Domain

  • A region of sequence homology among sets of proteins that are not all full-length homologs. Homology domains often, but not always, correspond to recognizable protein folding domains.

  • Equivalog

  • Equivalogs describe members of a set of homologous proteins that are conserved with respect to function since their last common ancestor. Related proteins are grouped into equivalog families where possible, and otherwise into protein families with other hierarchically defined homology types.

  • HMM

  • A Hidden Markov Model, or HMM, is a statistical model for any system that can be represented as a succession of transitions between discrete states. In this case, the discrete states correspond to the successive columns of a protein multiple sequence alignment. In principle, HMMs can be developed from unaligned sequences by successive rounds of optimization, but in practice, protein profile HMMs are simply built from curated multiple sequence alignments. HMM searches resemble later round PSI-BLAST searches (although based on curated alignments), with position-specific scoring for each of the amino acid, insertion, and deletion over the length of the sequence. Scores are reported both in bits of information and as an E-value.

  • Motif

  • Generally, a small region of sequence similarity (not necessarily homology) characterized by distinct patterns of amino acids at specific positions. An example of a motif is the N-glycosylation site motif N{P}[ST] (Asn, anything but Pro, choice of Ser or Thr).

  • Noise Cutoff

  • The HMM score below which hits to the HMM are considered uninteresting.

  • Orthologs

  • Proteins related to each other by descent from a common ancestral sequence by speciation. Orthologs may differ in function.

  • Superfamily

  • The complete set of proteins having sequence homology over essentially their full length. Domain A region of sequence homology among sets of proteins that are not all full-length homologs. Homology domains often, but not always, correspond to recognizable protein folding domains.

  • TIGRFAMs

  • TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects. To download or get more information on TIGRFAMs, go to the TIGRFAMs Home Page.

  • Trusted Cutoff

  • The HMM score above which there should be no false positive hits.


Contact Us | ©1999-2009 The J. Craig Venter Institute