Bellerophon: a program to detect chimeric sequences in multiple sequence
Thomas Huber and Philip Hugenholtz#
#DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA
ComBinE group, Advanced Computational Modelling Centre, The University of Queensland, Brisbane 4072,
Summary: Bellerophon is a program for detecting chimeric sequences in
multiple sequence datasets by an adaption of partial treeing analysis.
Bellerophon was specifically developed to detect 16S rRNA gene chimeras in
PCR-clone libraries of environmental samples but can be applied to other
nucleotide sequence alignments.
Availability: Bellerophon is available as an interactive web server at
Contact: [email protected]
Bellerophon detects chimeras based on a partial treeing approach (Wang and Wang,
1997; Hugenholtz and Huber, 2003), i.e. phylogenetic trees are inferred from
independent regions (fragments) of a multiple sequence alignment and the branching
patterns are compared for incongruencies that may be indicative of chimeric
sequences. No trees are actually built during the procedure and the only calculations
required are distance (sequence similarity) calculations. A full matrix of distances
(dm) between all pairs of sequences are calculated for fragments left and right of an
assumed break point. The total absolute deviation of the distance matrices (distance
matrix error, dme) of n sequences is then
dme dmleft [i ][ j ] dm right [i ][ j ]
where dm[i][j] denotes the distance between two sequences i and j. The largest
contribution to the dme is expected to arise from chimeras, since fragments from
these sequences have distinctly different locations relative to all other sequences in
the dataset, and therefore distinctly different distance matrices. To rank the sequences
by their contribution to the dme value, we calculate the ratio of the dme value from
all sequences over the dme value (dme[i]) of a reference dataset lacking the sequence
i under consideration. This ratio is called the preference score of the sequence.
The ratio for chimeric sequences will have a preference score >1, whereas nonchimeric sequence scores are expected to be ~1. To detect all putative chimeras in a
dataset, preference scores have to be calculated for all sequences. Naively, the
calculation would require a computationally expensive distance matrix comparison
for each sequence in the dataset. This can, however, be implemented more efficiently
by taking advantage of previously performed calculations. Because the calculation of
the dme involves column sums in the form of
col[i ] dm [i ][ j ] dm
[i ][ j ]
and the distances between identical sequences dm[i][i] are by definition zero,
equation (1) can be rewritten as:
(dme 2 col[i ])
A PCR-generated chimeric sequence usually comprises two phylogenetically
distinct parent sequences and occurs when a prematurely terminated amplicon
reanneals to a foreign DNA strand and is copied to completion in the following
PCR cycles. The point at which the chimeric sequence changes from one
parent to the next is called the conversion, recombination or break point.
Chimeras are problematic in culture-independent surveys of microbial
communities because they suggest the presence of non-existent organisms
(von Wintzingerode et al., 1997). Several methods have been developed for
detecting chimeric sequences (Cole et al., 2003; Komatsoulis and Waterman,
1997; Liesack et al., 1991; Robinson-Cox et al., 1995) that generally rely on
direct comparison of individual sequences to one or two putative parent
sequences at a time. Here we present an alternative approach based on how
well sequences fit into their complete phylogenetic context.
which only involves calculation of a single matrix and some intermediate storage of
the column sums. To determine the optimal break point for putative chimeras, all
sequences are scanned along their length by dividing the alignment into fragment
pairs at 10 character intervals. Distances are calculated from equally sized windows
(200, 300 or 400 characters) of the fragments left and right of the break point to
obtain similar signal-to-noise ratios for each fragment. The highest preference score
calculated for each sequence in all fragment pairs indicates the optimal break point.
Sequences are ranked according to their highest recorded preference score and
reported as potentially chimeric if that score is >1. Absolute preference scores are
dataset-dependent and should only be used for relative ranking of putative chimeras
within a given dataset. For manual confirmation of identified chimeras and
phylogenetic placement of the chimeric halves, it is necessary to specify the most
likely parent sequences in the dataset, giving rise to the chimera.
Parent sequences are assigned to each putative chimera by selecting the two
sequences with the highest opposing paired distance contributions (dm[i][j]) to
the dme at the optimal break point. The parent sequences of a chimera are most
likely to be found in the same PCR-clone library and therefore as many
sequences as possible from this one library should be included in the analysis.
However, even if the exact parent sequences of a given chimera are not present
in the dataset, Bellerophon will identify and report the closest phylogenetic
neighbours of the parents. In addition, the output from Bellerophon includes
the location of the optimal break point relative to an Escherichia coli reference
alignment (Brosius et al., 1978) and the percentage identities of the parent
sequences to the chimera either side of the break point. These features aid in
verification of chimeras. Mutually incompatible chimeras are screened from
the Bellerophon output. That is, once a sequence (A) has been identified as
chimeric, subsequent putative chimeras with lower preference scores, that
identify sequence A as one of the parents, areremoved from the output list.
Usage: More than 275 users worldwide
Up to date (10 November 2004)
Bellerophon has been used by more
than 275 researchers world wide
(figure 1) to detect chimeric
sequences in more than 2500 PCR
clone libraries. Figure 2 shows the
total number of monthly requests
processed by Bellerophon. The
screening of approximately 250
sequences each month is a direct
popularity. This has to be seen in
particular in context of the
importance of 16S marker genes in
molecular microbial biology to
identify new species in microbial
communities and the experimental
time involved in generating a single
PCR clone library from an
Fig. 1: user locations.
Fig. 2: Server usage.
Hugenholtz,P. and Huber,T. (2003) Chimeric 16S rDNA sequences
of diverse origin are accumulating in the public databases.
Int. J. Syst. Evol. Microbiol., 53, 289293.
Huber, T., Faulkner, G. and Hugenholtz, P. (2004) Bellerophon; a program to
detect chimeric sequences in multlipe sequence alignments, Bioinformatics, 20
Exchange Rate Management ... If we are going to analyze the policy options, we need a structured framework to proceed. Domestic Money Market PPP Foreign Money Market This should give us the long run trend The US is pegging at...
What Friendship Means. When it hurts to look back, and you're scared to look ahead, you can look beside you and your best friend will be there . What Friendship Means. Remember, no man is a failure who has friends....
General Motors Company Profile Prepared for: BAA605 Marketing Concepts and Practices By: Jose Alvares J. Sándor Cheka John Kalengaii Table of Contents History of GM GM Segments New Offerings Philanthropic Endeavors Financial Strengths Weaknesses News Summary References Appendix A: Chart...
Fig. 5 - Path model representing species richness on 11 islands of the Boundary Waters Canoe Area as a function of island area, number of habitats, and distance to nearest shore. Island area positively affected the number of habitats (r=0.81,...
The idea. To solve a connectivity problem on a graph, we could first improve its connectivity. More specifically, we will apply some transformation which turns each connected component of the original graph into an expander of constant degree.
IDA* and Memory-Bounded Search Algorithms CSCE 580 ANDREW SMITH JOHNNY FLOWERS Alpha value : never greater than the true score Beta value: never less than the true score * Pruned in the order GKJIHF, memory holds 10 nodes and becomes...
Selection Does Not Mean Evolution!!! Evolution is a change in the frequencies of alleles in a population. Selection can lead to evolution if the difference in reproductive success is tied to genetic variation
Ready to download the document? Go ahead and hit continue!