| |
 |
 |
 |
 |
Common Pathema Annotation Terms
All vs. All data
Parameters used for determing blastp All vs All results:
- expect=.001 - the evalue cutoff we can have, the range
is 0.001-1000 (default is 10) lower numbers are more
stringent
- cutoff=120 - the cutoff score for reporting 'HSP's, the
range is 50-120 (default value is calculated from the EXPECT
value) higher numbers are more stringent
- alignments=500 - gives us at most 500 different
accessions in the btab file (default is 250)
- descriptions=500 - gives us at most 500 different
accessions in the alignment file (default is 500)
- filter=seg+xnu - added filters. These filters mask off
segments of the query sequence that have low compositional
complexity, as determined by the SEG program of Wootton &
Federhen (Computers and Chemistry, 1993), or segments
consisting of short-periodicity internal repeats, as
determined by the XNU program of Claverie & States (Computers
and Chemistry, 1993). Filtering can eliminate statistically
significant but biologically uninteresting reports from the
blast output (e.g., hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically
interesting regions of the query sequence available for
specific matching against database sequences.
All v/s All Protein Blast Searches
All of the proteins are routinely blasted against
one another. We call these the "All v/s All" searches. When a
new genome is added to the database the all v/s all searches
are re-run. The information from the All v/s All blast
searches is displayed thoughout the Pathema - Pathema.
BER (Blast-Extend-Repraze) Searches
The Smith-Waterman alignment (what we call Blast-Extend-Repraze
or BER) is computed dynamically on the website when the user
selects the results to see in the alignment. Here is a
description of the BER/Smith-Waterman process we use to generate
the alignments for the website on the fly:
The reference protein is the protein for which the blast results are being
shown. The user then selects the comparison proteins for which they wish to see the
alignments (the proteins in the table for the above example), these proteins
make up the mini-database of comparison BLAST hits. Then a modified
Smith-Waterman alignment (Smith, 1981) is performed on the reference protein
against the mini-database of comparison BLAST hits. In order to identify
potential frameshifts or point mutations in the sequence, the nucleotide gene
of the reference protein is extended 300 nucleotides upstream and downstream of
the predicted coding region. If significant homology to a comparison protein
exists and extends into a different frame from that predicted, or extends
through a stop codon, the program will continue the alignment past the boundaries of the predicted coding region.
NCBI COGs
COGs or Clusters of Orthologous Proteins are phylogenetic
classifications of proteins encoded in complete genomes.
COGs were delineated by comparing protein sequences encoded
in 21 complete genomes, representing 17 major phylogenetic
lineages. Each COG consists of individual proteins or groups
of paralogs from at least 3 lineages and thus corresponds to
an ancient conserved domain. For more information, visit the
COG Home Page at
NCBI.
Enzyme Commission Number
Enzyme Commission Numbers or EC#s are numbers assigned to
enzymes that reflect their function. For more information and
a complete list of all EC#s, visit the Enzyme Nomenclature Page.
MUMmer
MUMmer or the Whole Genome Alignment Tool is a system for
aligning whole genome sequences. Using an efficient data
structure called a suffix tree, the system is able rapidly to
align sequences containing millions of nucleotides. It is
fully described in: A.L. Delcher, S. Kasif, R.D. Fleischmann,
J. Peterson, O. White, and S.L. Salzberg. Alignment of whole
genomes. Nucleic Acids Research, 27:11 (1999), 2369-2376.
Paralogous Gene Families
Paralogous gene families are genes which have been duplicated
within a particular organism during evolution. Not all
genomes in the Omniome database have paralogous gene families
assigned.
Pfam
Pfam is a large collection of multiple sequence alignments
and hidden Markov models covering many common protein
domains. For more information on Pfam, visit the Pfam Home Page.
PROSITE
Prosite is a database of protein families and domains. It
consists of biologically significant sites, patterns and
profiles that help to reliably identify to which known
protein family (if any) a new sequence belongs. To get more
information on Prosite, visit the Prosite Home Page.
Terms associated with TIGRFAMs
Domain
A region of sequence homology among sets of proteins that are not all full-length homologs. Homology domains often,
but not always, correspond to recognizable protein folding domains.
Equivalog
Equivalogs describe members of a set of homologous proteins
that are conserved with respect to function since their last
common ancestor. Related proteins are grouped into equivalog
families where possible, and otherwise into protein families
with other hierarchically defined homology types.
HMM
A Hidden Markov Model, or HMM, is a statistical model for
any system that can be represented as a succession of
transitions between discrete states. In this case, the
discrete states correspond to the successive columns of a
protein multiple sequence alignment. In principle, HMMs can
be developed from unaligned sequences by successive rounds
of optimization, but in practice, protein profile HMMs are
simply built from curated multiple sequence alignments. HMM
searches resemble later round PSI-BLAST searches (although
based on curated alignments), with position-specific scoring
for each of the amino acid, insertion, and deletion over the
length of the sequence. Scores are reported both in bits of
information and as an E-value.
Motif
Generally, a small region of sequence similarity (not
necessarily homology) characterized by distinct patterns of
amino acids at specific positions. An example of a motif is
the N-glycosylation site motif N{P}[ST] (Asn, anything but
Pro, choice of Ser or Thr).
Noise Cutoff
The HMM score below which hits to the HMM are considered
uninteresting.
Orthologs
Proteins related to each other by descent from a common
ancestral sequence by speciation. Orthologs may differ in
function.
Superfamily
The complete set of proteins having sequence homology over
essentially their full length. Domain A region of sequence
homology among sets of proteins that are not all full-length
homologs. Homology domains often, but not always, correspond
to recognizable protein folding domains.
TIGRFAMs
TIGRFAMs are a collection of protein families featuring
curated multiple sequence alignments, Hidden Markov Models
(HMMs) and associated information designed to support the
automated functional identification of proteins by sequence
homology. Classification by equivalog family (see below),
where achievable, complements classification by orthologs,
superfamily, domain or motif. It provides the information
best suited for automatic assignment of specific functions
to proteins from large scale genome sequencing projects. To
download or get more information on TIGRFAMs, go to the
TIGRFAMs
Home Page.
Trusted Cutoff
The HMM score above which there should be no false positive hits.
|
 |
|
|
 |
 |
 |
 |
|
|
|
|