.. role:: bolditalic .. role:: hlight .. _config: ********************** The Configuration File ********************** .. _sample_config: Configuration files are used to control analysis settings for |dpp-msbayes|_ and |msbayes|_. |pmb|_ allows you to use |dpp-msbayes|_ and |msbayes|_ configuration files (and their respective models) interchangeably. Here is an example of a configuration file for |dpp-msbayes|_:: concentrationShape = 1000.0 concentrationScale = 0.00437 thetaShape = 4.0 thetaScale = 0.001 ancestralThetaShape = 0 ancestralThetaScale = 0 thetaParameters = 000 tauShape = 1.0 tauScale = 0.02 timeInSubsPerSite = 1 bottleProportionShapeA = 0 bottleProportionShapeB = 0 bottleProportionShared = 0 migrationShape = 0 migrationScale = 0 numTauClasses = 0 BEGIN SAMPLE_TBL species-1 locus-1 1.0 1.0 10 8 32.42 389 0.27 0.24 0.26 species-1-locus-1.fasta species-1 locus-2 1.0 1.0 8 6 5.51 500 0.25 0.22 0.24 species-1-locus-2.fasta species-1 locus-3 1.0 1.0 6 8 8.38 524 0.26 0.23 0.26 species-1-locus-3.fasta species-1 locus-4 1.0 1.0 8 10 5.20 345 0.25 0.23 0.24 species-1-locus-4.fasta species-1 locus-5 1.0 1.0 8 8 29.59 417 0.27 0.23 0.21 species-1-locus-5.fasta species-1 mito-1 0.25 4.0 5 5 8.15 600 0.22 0.24 0.27 species-1-mito-1.fasta species-2 locus-1 1.0 1.0 6 10 7.53 400 0.25 0.24 0.26 species-2-locus-1.fasta species-2 locus-3 1.0 1.0 10 8 11.14 550 0.27 0.22 0.24 species-2-locus-3.fasta species-2 locus-4 1.0 1.0 8 8 9.39 350 0.24 0.24 0.23 species-2-locus-4.fasta species-2 locus-5 1.0 1.0 10 10 13.32 450 0.26 0.24 0.22 species-2-locus-5.fasta species-2 mito-1 0.25 4.0 4 5 7.59 549 0.23 0.26 0.23 species-2-mito-1.fasta species-3 locus-1 1.0 1.0 10 6 17.03 367 0.25 0.23 0.27 species-3-locus-1.fasta species-3 locus-3 1.0 1.0 8 10 59.17 541 0.26 0.22 0.25 species-3-locus-3.fasta species-3 locus-4 1.0 1.0 6 8 6.90 333 0.28 0.23 0.21 species-3-locus-4.fasta species-3 mito-1 0.25 4.0 5 4 11.42 587 0.22 0.22 0.25 species-3-mito-1.fasta END SAMPLE_TBL A configuration file has two parts: #. The preamble_ containing keyword arguments for specifying prior probability distributions of model parameters. #. The sample_table_ containing information and parameter values for each sequence alignment. This section is delimited by the ``BEGIN SAMPLE_TBL`` and ``END SAMPLE_TBL`` lines. .. _preamble: The Preamble ============ There are two main differences to the preamble between |pmb|_ and |msbayes|_. #. The |dpp-msbayes|_ preamble gives you more (and different) options to control the number of parameters in the model and the priors on those parameters. #. The |msbayes|_- and |dpp-msbayes|_-style preambles are processed differently in |pmb|_: The keywords in the preamble are case insensitive, and if any unrecognized keywords are encountered, an error is reported and the process exits (i.e., crashes). In |msbayes|_, the keywords are case-sensitive and, more importantly, the the default value for a setting is :hlight:`quietly` used when any options are mis-typed. For example, if you specify ``uppertheta = 0.01`` in an |msbayes|_ config file (note the lower case "t"), it will quietly use the default setting for ``upperTheta`` and :hlight:`not` report any warning or error. So, if you prefer the model implemented in |msbayes|_, |pmb|_ is useful as a "safer" interface for it. Below we walk through all of the options for the |dpp-msbayes|_ preamble. If you want to use the |msbayes|_ model with |pmb|_, please see the `msBayes documentation `_ for information about the preamble. .. contents:: Preamble Elements :local: .. _concentration_parameter_prior: ``concentrationShape`` / ``concentrationScale`` ----------------------------------------------- The shape and scale parameters of a gamma-distributed prior on the concentration parameter of the Dirichlet-process prior on divergence models. See the ":ref:`dpp`" section for more information on the Dirichlet process and the concentration parameter that controls it. .. _theta_parameterization: ``thetaParameters`` ------------------- |dpp-msbayes|_ gives you full control over the parameterization of the population sizes for each pair of populations. We will refer to the sizes of the ancestral and two descendant populations as :math:`\ancestralTheta{}`, :math:`\descendantTheta{1}{}`, and :math:`\descendantTheta{2}{}`. This setting is controlled by a sequence of three integers that always starts with ``0`` and increments by 1 whenever a free parameter is added. The first, second, and third integers represent :math:`\descendantTheta{1}{}`, :math:`\descendantTheta{2}{}`, and :math:`\ancestralTheta{}`, respectively. ``000`` is one extreme, where :math:`\ancestralTheta{}`, :math:`\descendantTheta{1}{}`, and :math:`\descendantTheta{2}{}` are all constrained to be equal for each population pair (population sizes will still vary among pairs). ``012`` is the other extreme, where :math:`\ancestralTheta{}`, :math:`\descendantTheta{1}{}`, and :math:`\descendantTheta{2}{}` are all estimated as independent parameters for each pair of populations. This is most similar to the original |msbayes|_. However, the descendant population sizes are *not* constrained to be negatively correlated, as they are in |msbayes|_ (see :cite:`Oaks2014dpp` and :cite:`Oaks2012`). Another example is ``001``: the descendant populations share the same size parameter, but the ancestral population size is free to vary. For ``011`` and ``010``, one of the descendant population is constrained to the same size as the ancestral, and the other is free to vary. .. _theta_prior: ``thetaShape`` / ``thetaScale`` ------------------------------- These settings define the shape and scale parameters of a gamma prior on the effective population sizes. Population sizes are scaled by the per-site mutation rate (:math:`\mu`): :math:`4\effectivePopSize\mutationRate`. .. _ancestral_theta_prior: ``ancestralThetaShape`` / ``ancestralThetaScale`` ------------------------------------------------- If these settings are both provided, and are both positive, they define the shape and scale parameters of a gamma prior on the effective population size of the ancestral population of each pair of populations. If they are excluded, or both are zero, the :ref:`theta_prior` settings are used for the gamma prior on ancestral population sizes (i.e., all population size parameters have the same prior). .. note:: :class: keypoint If you specify a setting such as ``000`` for the :ref:`theta_parameterization` option, the ancestral population size is constrained to be the same as the descendant populations, and thus the :ref:`ancestral_theta_prior` options will be overridden and have no affect. .. _divergence_time_prior: ``tauShape`` / ``tauScale`` --------------------------- These settings define the shape and scale parameters of a gamma prior on divergence times. See the :ref:`timescale_setting` setting for the information on the time units. .. _timescale_setting: ``timeInSubsPerSite`` --------------------- This setting controls the time scale of the model and has two settings: * ``timeInSubsPerSite = 1``: Time units are in expected substitutions per site. For example, a divergence of 0.05 means that, on average, 5% of sites have changed since the populations diverged (so you expect 10% divergence between the populations since the population divergence). Thus, you can convert these units to the number of generations or years by dividing by the mutation rate. * ``timeInSubsPerSite = 0``: Time units are in coalescent units, :math:`\globalcoalunit` generations, where :math:`\globalpopsize` is the size of a constant reference population based on the mean of the prior on theta (defined by the :ref:`theta_prior` settings ). If we use :math:`\globaltheta` to represent the mean of the theta prior, then .. math:: \globalcoalunit = \frac{\globaltheta}{\mutationRate}, where :math:`\mutationRate` is the per-site mutation rate. Thus, you can convert these ":math:`\globalcoalunit` generations" units to the number of generations by assuming a mutation rate and multiplying by :math:`(\globaltheta/\mutationRate)`. See :cite:`Oaks2014dpp` for more details. |pmb|_ supports the use of the ``timeInSubsPerSite`` option for |msbayes|_ configuration files as well. .. note:: :hlight:`Why use the mean of the prior on theta to scale time?` I have no idea. This is legacy from |msbayes|_, and is the default setting. However, I strongly discourage using this time scale, because it makes it very difficult to compare results across analyses with different settings for the theta_prior_. It also requires you to re-scale the divergence_time_prior_ every time you change the theta_prior_. Having time in units of expected substitutions per site is much more straight forward (i.e., ``timeInSubsPerSite = 1``). .. _bottleneck_prior: ``bottleProportionShapeA`` / ``bottleProportionShapeB`` ------------------------------------------------------- If both are positive, these settings define a beta prior on the magnitude of a post-divergence bottleneck in each of the descendant populations. ``bottleProportionShapeA`` and ``bottleProportionShapeB`` correspond to the shape parameters alpha and beta, respectively, of the beta prior. The bottleneck magnitude is the proportion of the effective population size that remains during the bottleneck. For example, a value of 0.95 would mean that bottleneck reduces the effective population size by 5%. If either or both are zero or less, there is no post-divergence population bottleneck in the descendant populations (i.e., the bottleneck-magnitude parameters, along with the timing of each bottleneck, are removed from the model). .. note:: There are also parameters in the model for the timing of the end of the bottleneck (it begins at speciation in forward time). There is one of these parameters for each pair of populations (i.e., the descendant populations of each pair share the same bottleneck-end-time parameter). Thus if either or both of the :ref:`bottleneck_prior` settings are zero or less, you are also removing these bottleneck timing parameters from the model. This means you are removing :math:`3\npairs{}` parameters from the model, where ":math:`\npairs{}`" is the number of pairs of populations. .. _bottleneck_parameterization: ``bottleProportionShared`` -------------------------- If ``bottleProportionShared = 0``, then there are two free bottleneck-magnitude parameters for each population pair (one for each descendant population). If ``bottleProportionShared = 1``, then there is one bottleneck-magnitude parameter for each population pair (i.e., the descendant populations of each pair share the same bottleneck magnitude; the bottleneck magnitude still varies among the pairs). .. note:: This setting is overridden if either or both of the ``bottleProportionShapeA`` and ``bottleProportionShapeB`` settings is zero or less (because then there is no bottleneck at all). ``migrationShape`` / ``migrationScale`` --------------------------------------- These settings define the shape and scale parameters of a gamma prior on the symmetric migration between the descendant populations of each pair (in units of the number of gene copies per generation). If either or both settings are zero or less, there is no migration in the model. ``recombinationShape`` / ``recombinationScale`` ----------------------------------------------- These settings define the shape and scale parameters of a gamma prior on the intragenic recombination rate. If either or both are zero or less, there is no recombination in the model. .. note:: I recommend :hlight:`not` including intragenic recombination in the model, because the current implementation is very inefficient and poorly tested. ``numTauClasses`` ----------------- If this setting is zero (the default), the number of divergence events is free to vary according to the Dirichlet process prior on divergence models. If it is greater than zero, then the model is constrained to ``numTauClasses`` divergence events. This is useful for simulation-based power analyses, but should not be used for empirical analyses. .. _sample_table: The Sample Table ================ Delimited by the ``BEGIN SAMPLE_TBL`` and ``END SAMPLE_TBL`` lines is the sample table section of our :ref:`example configuration file` above. This is a tab-delimited table where each column represents information or parameter values for each row, which represents a locus for a particular taxon (i.e., each row will correspond to a sequence alignment we wish to include in our analysis). .. _sample_table_columns: .. admonition:: Sample Table Columns :class: definitions Column 1---Taxon label A unique label must be used to identify each pair of populations in the analysis. Column 2---Locus label A unique label must be used to identify each locus in the analysis. Column 3---Ploidy and/or generation-time multiplier The number in this column is used to scale for differences in ploidy among loci or for differences in generation-times among taxa. In our :ref:`example configuration file` 1.0 is used for loci from a diploid nuclear genome, whereas 0.25 is used for a region of the mitochondrial genome (because its haploid and maternally inherited). However, if a taxon "species-3" had :math:`1/4` the generation times of the other two taxa, we would specify "1.0" for the third column for its mitochondrial locus, and "4.0" for the third column for its nuclear loci. Column 4---Mutation-rate multiplier The number in this column is used to scale for differences in mutation rates among taxa and/or loci. In our :ref:`example configuration file`, we are assuming that the mitochondrial locus ("mito-1") is evolving four-times faster than the other loci (hence, the "4.0" in this fourth column for the three rows representing the mitochondrial locus). Columns 5 and 6---Number of gene copies from Populations 1 and 2 For example, for "locus-1" of "species-1" in our :ref:`example configuration file`, we have "10" and "8" in Columns 5 and 6. Thus, the first 10 sequences in the "``species-1-locus-1.fasta``" file specified in Column 12 must be from one population, and the last 8 sequences must be from the other population. NOTE, which population is "1" or "2" is completely arbitrary. However, the numbers in Columns 5 and 6 must correspond to the first and last sequences, respectively, in the sequence file specified in Column 12. Column 7---The transition/transversion rate ratio This is the transition/transversion rate ratio ("Kappa") of the HKY85 model of nucleotide substitution :cite:`HKY` for this alignment. NOTE: This is the transition/transversion *rate* ratio, not the "count" ratio. I.e., Kappa = 1 is equal to the Jukes-Cantor model. Column 8---The length of the alignment The number of columns in the alignment specified in Column 12. Columns 9, 10, and 11---Base frequencies Equilibrium frequencies of nucleotide bases A, C, and G, respectively. The frequency of T is simply 1 minus the sum of these three columns. Column 12---Path to sequence alignment The path to the fasta-formatted file containing the aligned sequences. The first sequences in the file should correspond to Column 5, and the last sequences in the file should correspond to Column 6. If only the name of the file is given (as in the :ref:`example configuration file` above), the file must be in the same directory ("folder") as the configuration file. The paths can be relative (e.g., ``../alignments/species-1-locus-1.fasta``).