                                   ffactor



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Multistate to binary recoding program

Description

   Takes discrete multistate data with character state trees and produces
   the corresponding data set with two states (0 and 1). Written by
   Christopher Meacham. This program was formerly used to accomodate
   multistate characters in MIX, but this is less necessary now that PARS
   is available.

Algorithm

   This program factors a data set that contains multistate characters,
   creating a data set consisting entirely of binary (0,1) characters
   that, in turn, can be used as input to any of the other discrete
   character programs in this package, except for PARS. Besides this
   primary function, FACTOR also provides an easy way of deleting
   characters from a data set. The input format for FACTOR is very similar
   to the input format for the other discrete character programs except
   for the addition of character-state tree descriptions.

   Note that this program has no way of converting an unordered multistate
   character into binary characters. Fortunately, PARS has joined the
   package, and it enables unordered multistate characters, in which any
   state can change to any other in one step, to be analyzed with
   parsimony.

   FACTOR is really for a different case, that in which there are multiple
   states related on a "character state tree", which specifies for each
   state which other states it can change to. That graph of states is
   assumed to be a tree, with no loops in it.

Usage

   Here is a sample session with ffactor


% ffactor
Multistate to binary recoding program
Phylip factor program input file: factor.dat
Phylip factor program output file [factor.ffactor]:


Data matrix written on file "factor.ffactor"

Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Multistate to binary recoding program
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers:
  [-infile]            infile     Phylip factor program input file
  [-outfile]           outfile    [*.ffactor] Phylip factor program output
                                  file

   Additional (Optional) qualifiers:
   -anc                boolean    [N] Put ancestral states in output file
   -factors            boolean    [N] Put factors information in output file
   -[no]progress       boolean    [Y] Print indications of progress of run

   Advanced (Unprompted) qualifiers:
   -outfactorfile      outfile    [*.ffactor] Phylip factor data output file
                                  (optional)
   -outancfile         outfile    [*.ffactor] Phylip ancestor data output file
                                  (optional)

   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory2        string     Output directory

   "-outfactorfile" associated qualifiers
   -odirectory         string     Output directory

   "-outancfile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   ffactor reads character state tree data.

   This program factors a data set that contains multistate characters,
   creating a data set consisting entirely of binary (0,1) characters
   that, in turn, can be used as input to any of the other discrete
   character programs in this package, except for PARS. Besides this
   primary function, FACTOR also provides an easy way of deleting
   characters from a data set. The input format for FACTOR is very similar
   to the input format for the other discrete character programs except
   for the addition of character-state tree descriptions.

   Note that this program has no way of converting an unordered multistate
   character into binary characters. Fortunately, PARS has joined the
   package, and it enables unordered multistate characters, in which any
   state can change to any other in one step, to be analyzed with
   parsimony.

   FACTOR is really for a different case, that in which there are multiple
   states related on a "character state tree", which specifies for each
   state which other states it can change to. That graph of states is
   assumed to be a tree, with no loops in it.

  First line

   The first line of the input file should contain the number of species
   and the number of multistate characters. This first line is followed by
   the lines describing the character-state trees, one description per
   line. The species information constitutes the last part of the file.
   Any number of lines may be used for a single species.

   The first line is free format with the number of species first,
   separated by at least one blank (space) from the number of multistate
   characters, which in turn is separated by at least one blank from the
   options, if present.

  Character-state tree descriptions

   The character-state trees are described in free format. The character
   number of the multistate character is given first followed by the
   description of the tree itself. Each description must be completed on a
   single line. Each character that is to be factored must have a
   description, and the characters must be described in the order that
   they occur in the input, that is, in numerical order.

   The tree is described by listing the pairs of character states that are
   adjacent to each other in the character-state tree. The two character
   states in each adjacent pair are separated by a colon (":"). If
   character fifteen has this character state tree for possible states
   "A", "B", "C", and "D":

                         A ---- B ---- C
                                |
                                |
                                |
                                D

   then the character-state tree description would be

                        15  A:B B:C D:B

   Note that either symbol may appear first. The ancestral state is
   identified, if desired, by putting it "adjacent" to a period. If we
   wanted to root character fifteen at state C:

                         A <--- B <--- C
                                |
                                |
                                V
                                D

   we could write

                      15  B:D A:B C:B .:C

   Both the order in which the pairs are listed and the order of the
   symbols in each pair are arbitrary. However, each pair may only appear
   once in the list. Any symbols may be used for a character state in the
   input except the character that signals the connection between two
   states (in the distribution copy this is set to ":"), ".", and, of
   course, a blank. Blanks are ignored completely in the tree description
   so that even B:DA:BC:B.:C or B : DA : BC : B. : C would be equivalent
   to the above example. However, at least one blank must separate the
   character number from the tree description.

  Deleting characters from a data set

   If no description line appears in the input for a particular character,
   then that character will be omitted from the output. If the character
   number is given on the line, but no character-state tree is provided,
   then the symbol for the character in the input will be copied directly
   to the output without change. This is useful for characters that are
   already coded "0" and "1". Characters can be deleted from a data set
   simply by listing only those that are to appear in the output.

  Terminating the list of tree descriptions

   The last character-state tree description should be followed by a line
   containing the number "999". This terminates processing of the trees
   and indicates the beginning of the species information.

  Species information

   The format for the species information is basically identical to the
   other discrete character programs. The first ten character positions
   are allotted to the species name (this value may be changed by altering
   the value of the constant nmlngth at the beginning of the program). The
   character states follow and may be continued to as many lines as
   desired. There is no current method for indicating polymorphisms. It is
   possible to either put blanks between characters or not.

   There is a method for indicating uncertainty about states. There is one
   character value that stands for "unknown". If this appears in the input
   data then "?" is written out in all the corresponding positions in the
   output file. The character value that designates "unknown" is given in
   the constant unkchar at the beginning of the program, and can be
   changed by changing that constant. It is set to "?" in the distribution
   copy.

  Input files for usage example

  File: factor.dat

   4   6
1 A:B B:C
2 A:B B:.
4
5 0:1 1:2 .:0
6 .:# #:$ #:%
999
Alpha     CAW00#
Beta      BBX01%
Gamma     ABY12#
Epsilon   CAZ01$

Output file format

   The first line of ffactor output will contain the number of species and
   the number of binary characters in the factored data set followed by
   the letter "A" if the A option was specified in the input. If option F
   was specified, the next line will begin "FACTORS". If option A was
   specified, the line describing the ancestor will follow next. Finally,
   the factored characters will be written for each species in the format
   required for input by the other discrete programs in the package. The
   maximum length of the output lines is 80 characters, but this maximum
   length can be changed prior to compilation.

   In fact, the format of the output file for the A and F options is not
   correct for the current release of PHYLIP. We need to change their
   output to write a factors file and an ancestors file instead of putting
   the Factors and Ancestors information into the data file.

  Output files for usage example

  File: factor.ffactor

    4    5
Alpha     CA00#
Beta      BB01%
Gamma     AB12#
Epsilon   CA01$

  File: factor.factor


  File: factor.ancestor


Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   The output should be checked for error messages. Errors will occur in
   the character-state tree descriptions if the format is incorrect
   (colons in the wrong place, etc.), if more than one root is specified,
   if the tree contains loops (and hence is not a tree), and if the tree
   is not connected, e.g.


                             A:B B:C D:E

   describes

                  A ---- B ---- C          D ---- E

   This "tree" is in two unconnected pieces. An error will also occur if a
   symbol appears in the data set that is not in the tree description for
   that character. Blanks at the end of lines when the species information
   is continued to a new line will cause this kind of error.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                Description
                    eclique      Largest clique program
                    edollop      Dollo and polymorphism parsimony algorithm
                    edolpenny    Penny algorithm Dollo or polymorphism
                    efactor      Multistate to binary recoding program
                    emix         Mixed parsimony algorithm
                    epenny       Penny algorithm, branch-and-bound
                    fclique      Largest clique program
                    fdollop      Dollo and polymorphism parsimony algorithm
                    fdolpenny    Penny algorithm Dollo or polymorphism
                    fmix         Mixed parsimony algorithm
                    fmove        Interactive mixed method parsimony
                    fpars        Discrete character parsimony
                    fpenny       Penny algorithm, branch-and-bound

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.
