Calculation of trees from alignments
Trees are calculated on either the complete alignment, or just the
currently selected group of sequences, via the calculations dialog opened from the Calculate→Calculate
Tree or PCA... menu entry. Once calculated, trees are displayed in a new tree viewing
window. There are four different calculations, using one of two
distance measures and constructing the tree from one of two
algorithms :
Distance Measures
Trees are calculated on the basis of a measure of similarity
between each pair of sequences in the alignment :
- PID
The percentage identity
between the two sequences at each aligned position.
- PID = Number of equivalent aligned non-gap symbols *
100 / Smallest number of non-gap positions in either of both
sequences
This is essentially the 'number of
identical bases (or residues) per 100 base pairs (or
residues)'.
- BLOSUM62, PAM250, DNA
These options
use one of the available substitution matrices to compute a sum of
scores for the residue pairs at each aligned position.
- Sequence Feature Similarity
Trees
are constructed from a distance matrix formed from Jaccard
distances between sequence features observed at each column of the
alignment.
- Similarity at column i = (Total number of
features displayed - Sum of number of features in common at i)
Similarities are summed over all columns and divided by
the number of columns.
Since the total number of
feature types is constant over all columns of the alignment,
we do not scale the matrix, so tree distances can be
interpreted as the average number of features that differ over
all sites in the aligned region.
Distances are computed based on the currently displayed feature
types. Sequences with similar distributions of features of the
same type will be grouped together in trees computed with this
metric. This measure was introduced in Jalview 2.9
- Secondary Structure Similarity
Trees are
generated using a distance matrix, which is constructed from Jaccard
distances that specifically consider the secondary structure annotation
observed at each column of the alignment.
- For secondary structure similarity analysis, at any given column
i, the range of unique secondary structures is between 0 and 2,
reflecting the presence of helices, sheets, coils and gaps.
The similarity at column i = Total
number of unique secondary structures (which can range from 0 to 2)
- Sum of the number of secondary structures in common at column
i (which can be either 0 or 1)
The similarity scores are
summed across all columns and then divided by the total number of
columns to calculate an average similarity score.
Distance calculations are based on the secondary structures
currently displayed. Sequences with similar distributions of secondary
structures will be grouped together in trees.
The distance between two sequences is maximum when one
sequence has a defined secondary structure annotation track and the
other does not, indicating complete dissimilarity between them.
Whereas, the distance between two sequences is minimum when both of
the sequences within the comparison do not have a defined secondary
structure annotation track.
Tree Construction Methods
Jalview currently supports two kinds of agglomerative
clustering methods. These are not intended to substitute for
rigorous phylogenetic tree construction, and may fail on very large
alignments.
- UPGMA tree
UPGMA stands for
Unweighted Pair-Group Method using Arithmetic averages. Clusters
are iteratively formed and extended by finding a non-member
sequence with the lowest average dissimilarity over the cluster
members.
- Neighbour Joining tree
First
described in 1987 by Saitou and Nei, this method applies a greedy
algorithm to find the tree with the shortest branch lengths.
This method, as implemented in Jalview, is considerably more
expensive than UPGMA.
A newly calculated tree will be displayed in a new tree viewing
window. In addition, a new entry with the same tree viewer window
name will be added in the Sort menu so that the alignment can be
reordered to reflect the ordering of the leafs of the tree. If the
tree was calculated on a selected region of the alignment, then the
title of the tree view will reflect this.
External Sources for Phylogenetic Trees
A number of programs exist for the reliable construction of
phylogenetic trees, which can cope with large numbers of sequences,
use better distance methods and can perform bootstrapping. Jalview
can read Newick
format tree files using the 'Load Associated Tree' entry of the
alignment's File menu. Sequences in the alignment will be
automatically associated to nodes in the tree, by matching Sequence
IDs to the tree's leaf names.