                                  fdnapars



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   DNA parsimony algorithm

Description

   Estimates phylogenies by the parsimony method using nucleic acid
   sequences. Allows use the full IUB ambiguity codes, and estimates
   ancestral nucleotide states. Gaps treated as a fifth nucleotide state.
   It can also fo transversion parsimony. Can cope with multifurcations,
   reconstruct ancestral states, use 0/1 character weights, and infer
   branch lengths

Algorithm

   This program carries out unrooted parsimony (analogous to Wagner trees)
   (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA sequences. The
   method of Fitch (1971) is used to count the number of changes of base
   needed on a given tree. The assumptions of this method are analogous to
   those of MIX:
    1. Each site evolves independently.
    2. Different lineages evolve independently.
    3. The probability of a base substitution at a given site is small
       over the lengths of time involved in a branch of the phylogeny.
    4. The expected amounts of change in different branches of the
       phylogeny do not vary by so much that two changes in a high-rate
       branch are more probable than one change in a low-rate branch.
    5. The expected amounts of change do not vary enough among sites that
       two changes in one site are more probable than one change in
       another.

   That these are the assumptions of parsimony methods has been documented
   in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b,
   1988b). For an opposing view arguing that the parsimony methods make no
   substantive assumptions such as these, see the papers by Farris (1983)
   and Sober (1983a, 1983b, 1988), but also read the exchange between
   Felsenstein and Sober (1986).

   Change from an occupied site to a deletion is counted as one change.
   Reversion from a deletion to an occupied site is allowed and is also
   counted as one change. Note that this in effect assumes that a deletion
   N bases long is N separate events.

   Dnapars can handle both bifurcating and multifurcating trees. In doing
   its search for most parsimonious trees, it adds species not only by
   creating new forks in the middle of existing branches, but it also
   tries putting them at the end of new branches which are added to
   existing forks. Thus it searches among both bifurcating and
   multifurcating trees. If a branch in a tree does not have any
   characters which might change in that branch in the most parsimonious
   tree, it does not save that tree. Thus in any tree that results, a
   branch exists only if some character has a most parsimonious
   reconstruction that would involve change in that branch. It also saves
   a number of trees tied for best (you can alter the

   number it saves using the V option in the menu). When rearranging
   trees, it tries rearrangements of all of the saved trees. This makes
   the algorithm slower than earlier versions of Dnapars.

   The input data is standard. The first line of the input file contains
   the number of species and the number of sites.

   Next come the species data. Each sequence starts on a new line, has a
   ten-character species name that must be blank-filled to be of that
   length, followed immediately by the species data in the one-letter
   code. The sequences must either be in the "interleaved" or "sequential"
   formats described in the Molecular Sequence Programs document. The I
   option selects between them. The sequences can have internal blanks in
   the sequence but there must be no extra blanks at the end of the
   terminated line. Note that a blank is not a valid symbol for a
   deletion.

Usage

   Here is a sample session with fdnapars


% fdnapars
DNA parsimony algorithm
Input (aligned) nucleotide sequence set(s): dnapars.dat
Phylip tree file (optional):
Phylip dnapars program output file [dnapars.fdnapars]:

Adding species:
   1. Alpha
   2. Beta
   3. Gamma
   4. Delta
   5. Epsilon

Doing global rearrangements on the first of the trees tied for best
  !---------!
   .........
   .........

Collapsing best trees
   .

Output written to file "dnapars.fdnapars"

Tree also written onto file "dnapars.treefile"

Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

DNA parsimony algorithm
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers:
  [-sequence]          seqsetall  File containing one or more sequence
                                  alignments
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fdnapars] Phylip dnapars program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -weights            properties Weights file
   -maxtrees           integer    [10000] Number of trees to save (Integer
                                  from 1 to 1000000)
*  -[no]thorough       toggle     [Y] More thorough search
*  -[no]rearrange      boolean    [Y] Rearrange on just one best tree
   -transversion       boolean    [N] Use transversion parsimony
*  -njumble            integer    [0] Number of times to randomise (Integer 0
                                  or more)
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
   -outgrno            integer    [0] Species number to use as outgroup
                                  (Integer 0 or more)
   -thresh             toggle     [N] Use threshold parsimony
*  -threshold          float      [1.0] Threshold value (Number 1.000 or more)
   -[no]trout          toggle     [Y] Write out trees to tree file
*  -outtreefile        outfile    [*.fdnapars] Phylip tree output file
                                  (optional)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run
   -stepbox            boolean    [N] Print out steps in each site
   -ancseq             boolean    [N] Print sequences at all nodes of tree
   -[no]treeprint      boolean    [Y] Print out tree
*  -[no]dotdiff        boolean    [Y] Use dot differencing to display results

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   fdnapars reads any normal sequence USAs.

  Input files for usage example

  File: dnapars.dat

   5   13
Alpha     AACGUGGCCAAAU
Beta      AAGGUCGCCAAAC
Gamma     CAUUUCGUCACAA
Delta     GGUAUUUCGGCCU
Epsilon   GGGAUCUCGGCCC

Output file format

   fdnapars output is standard: if option 1 is toggled on, the data is
   printed out, with the convention that "." means "the same as in the
   first species". Then comes a list of equally parsimonious trees. Each
   tree has branch lengths. These are computed using an algorithm
   published by Hochbaum and Pathria (1997) which I first heard of from
   Wayne Maddison who invented it independently of them. This algorithm
   averages the number of reconstructed changes of state over all sites a
   over all possible most parsimonious placements of the changes of state
   among branches. Note that it does not correct in any way for multiple
   changes that overlay each other. If option 2 is toggled on a table of
   the number of changes of state required in each character is also
   printed. If option 5 is toggled on, a table is printed out after each
   tree, showing for each branch whether there are known to be changes in
   the branch, and what the states are inferred to have been at the top
   end of the branch. This is a reconstruction of the ancestral sequences
   in the tree. If you choose option 5, a menu item "." appears which
   gives you the opportunity to turn off dot-differencing so that complete
   ancestral sequences are shown. If the inferred state is a "?" or one of
   the IUB ambiguity symbols, there will be multiple equally-parsimonious
   assignments of states; the user must work these out for themselves by
   hand. A "?" in the reconstructed states means that in addition to one
   or more bases, a deletion may or may not be present. If option 6 is
   left in its default state the trees found will be written to a tree
   file, so that they are available to be used in other programs. If the U
   (User Tree) option is used and more than one tree is supplied, the
   program also performs a statistical test of each of these trees against
   the best tree. This test, which is a version of the test proposed by
   Alan Templeton (1983) and evaluated in a test case by me (1985a). It is
   closely parallel to a test using log likelihood differences due to
   Kishino and Hasegawa (1989), and uses the mean and variance of step
   differences between trees, taken across sites. If the mean is more than
   1.96 standard deviations different then the trees are declared
   significantly different. The program prints out a table of the steps
   for each tree, the differences of each from the best one, the variance
   of that quantity as determined by the step differences at individual
   sites, and a conclusion as to whether that tree is or is not
   significantly worse than the best one. If the U (User Tree) option is
   used and more than one tree is supplied, and the program is not told to
   assume autocorrelation between the rates at different sites, the
   program also performs a statistical test of each of these trees against
   the one with highest likelihood. If there are two user trees, this is a
   version of the test proposed by Alan Templeton (1983) and evaluated in
   a test case by me (1985a). It is closely parallel to a test using log
   likelihood differences due to Kishino and Hasegawa (1989) It uses the
   mean and variance of the differences in the number of steps between
   trees, taken across sites. If the two trees' means are more than 1.96
   standard deviations different, then the trees are declared
   significantly different. If there are more than two trees, the test
   done is an extension of the KHT test, due to Shimodaira and Hasegawa
   (1999). They pointed out that a correction for the number of trees was
   necessary, and they introduced a resampling method to make this
   correction. In the version used here the variances and covariances of
   the sums of steps across sites are computed for all pairs of trees. To
   test whether the difference between each tree and the best one is
   larger than could have been expected if they all had the same expected
   number of steps, numbers of steps for all trees are sampled with these
   covariances and equal means (Shimodaira and Hasegawa's "least favorable
   hypothesis"), and a P value is computed from the fraction of times the
   difference between the tree's value and the lowest number of steps
   exceeds that actually observed. Note that this sampling needs random
   numbers, and so the program will prompt the user for a random number
   seed if one has not already been supplied. With the two-tree KHT test
   no random numbers are used. In either the KHT or the SH test the
   program prints out a table of the number of steps for each tree, the
   differences of each from the lowest one, the variance of that quantity
   as determined by the differences of the numbers of steps at individual
   sites, and a conclusion as to whether that tree is or is not
   significantly worse than the best one. Option 6 in the menu controls
   whether the tree estimated by the program is written onto a tree file.
   The default name of this output tree file is "outtree". If the U option
   is in effect, all the user-defined trees are written to the output tree
   file. If the program finds multiple trees tied for best, all of these
   are written out onto the output tree file. Each is followed by a
   numerical weight in square brackets (such as [0.25000]). This is needed
   when we use the trees to make a consensus tree of the results of
   bootstrapping or jackknifing, to avoid overrepresenting replicates that
   find many tied trees.

  Output files for usage example

  File: dnapars.fdnapars


DNA parsimony algorithm, version 3.69


One most parsimonious tree found:


                                            +-----Epsilon
               +----------------------------3
  +------------2                            +-------Delta
  |            |
  |            +----------------Gamma
  |
  1----Beta
  |
  +---------Alpha


requires a total of     19.000

  between      and       length
  -------      ---       ------
     1           2       0.217949
     2           3       0.487179
     3      Epsilon      0.096154
     3      Delta        0.134615
     2      Gamma        0.275641
     1      Beta         0.076923
     1      Alpha        0.173077


  File: dnapars.treefile

(((Epsilon:0.09615,Delta:0.13462):0.48718,Gamma:0.27564):0.21795,
Beta:0.07692,Alpha:0.17308);

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                         Description
                    distmat      Create a distance matrix from a multiple sequence alignment
                    ednacomp     DNA compatibility algorithm
                    ednadist     Nucleic acid sequence Distance Matrix program
                    ednainvar    Nucleic acid sequence Invariants method
                    ednaml       Phylogenies from nucleic acid Maximum Likelihood
                    ednamlk      Phylogenies from nucleic acid Maximum Likelihood with clock
                    ednapars     DNA parsimony algorithm
                    ednapenny    Penny algorithm for DNA
                    eprotdist    Protein distance algorithm
                    eprotpars    Protein parsimony algorithm
                    erestml      Restriction site Maximum Likelihood method
                    eseqboot     Bootstrapped sequences algorithm
                    fdiscboot    Bootstrapped discrete sites algorithm
                    fdnacomp     DNA compatibility algorithm
                    fdnadist     Nucleic acid sequence Distance Matrix program
                    fdnainvar    Nucleic acid sequence Invariants method
                    fdnaml       Estimates nucleotide phylogeny by maximum likelihood
                    fdnamlk      Estimates nucleotide phylogeny by maximum likelihood
                    fdnamove     Interactive DNA parsimony
                    fdnapenny    Penny algorithm for DNA
                    fdolmove     Interactive Dollo or Polymorphism Parsimony
                    ffreqboot    Bootstrapped genetic frequencies algorithm
                    fproml       Protein phylogeny by maximum likelihood
                    fpromlk      Protein phylogeny by maximum likelihood
                    fprotdist    Protein distance algorithm
                    fprotpars    Protein parsimony algorithm
                    frestboot    Bootstrapped restriction sites algorithm
                    frestdist    Distance matrix from restriction sites or fragments
                    frestml      Restriction site maximum Likelihood method
                    fseqboot     Bootstrapped sequences algorithm
                    fseqbootall  Bootstrapped sequences algorithm

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.
