Least squares inference in phylogeny generates aphylogenetic tree based on anobserved matrix of pairwisegenetic distances andoptionally a weightmatrix. The goal is to find a tree which satisfies the distance constraints asbest as possible.
The discrepancy between the observed pairwise distancesand the distances over a phylogenetic tree (i.e. the sumof the branch lengths in the path from leaf to leaf) is measured by
where the weights depend on the least squares method used.Least squaresdistance tree construction aims to find the tree (topology and branch lengths)with minimal S. This is a non-trivial problem. It involves searching thediscrete space of unrooted binary tree topologies whose size is exponential inthe number of leaves. For n leaves there are1 • 3 • 5 • ... • (2n-3)different topologies. Enumerating them is not feasible already for a smallnumber of leaves. Heuristic search methods are used to find a reasonablygood topology. The evaluation of S for a given topology (which includes thecomputation of the branch lengths) is alinear least squares problem.There are several ways to weight the squared errors,depending on the knowledge and assumptions about the variances of the observeddistances. When nothing is known about the errors, or if they are assumed to beindependently distributed and equal for all observed distances, then all theweights are set to one. This leads to an ordinary leastsquares estimate.In the weighted least squares case the errors are assumed to be independent(or their correlations are not known). Given independent errors, a particularweight should ideally be set to the inverse of the variance of the corresponding distanceestimate. Sometimes the variances may not be known, but theycan be modeled as a function of the distance estimates. In the Fitch andMargoliash method[1]for instance it is assumed that the variances are proportional to the squareddistances.
The ordinary and weighted least squares methods described aboveassume independent distance estimates. If the distancesare derived from genomic data their estimates covary, because evolutionaryevents on internalbranches (of the true tree) can push several distances up or down atthe same time. The resulting covariances can be taken into account using themethod of generalized least squares, i.e. minimizing the following quantity
where are the entries of the inverse of thecovariance matrix of the distance estimates.
Finding the tree and branch lengths minimizing the least squares residual is anNP-complete problem.[2] However, for a given tree, the optimal branch lengths can be determined in time for ordinary least squares, time for weighted least squares, and time for generalised least squares (given the inverse of thecovariance matrix).[3]