This is a major update to the underlying data structures in msprime to
generalise the information that can be modelled, and allow
for data from external sources to be efficiently processed. The
new Tables API enables efficient interchange of tree sequence data using
numpy arrays. Many updates have also been made to the tree sequence
API to make it more Pythonic and general. Most changes are backwards
compatible, however.
**Breaking changes**:
- The `SparseTree.mutations()` and `TreeSequence.mutations()` iterators no
longer support tuple-like access to values. For example, code like
> > for x, u, j in ts.mutations():
>
> : print("mutation at position", x, "node = ", u)
>
will no longer work. Code using the old `Mutation.position` and
`Mutation.index` will still work through deprecated aliases,
but new code should access these values through `Site.position`
and `Site.id`, respectively.
- The `TreeSequence.diffs()` method no longer works. Please use
the `TreeSequence.edge_diffs()` method instead.
- `TreeSequence.get_num_records()` no longer works. Any code using
this or the `records()` iterator should be rewritten to work with
the `edges()` iterator and num_edges instead.
- Files stored in the HDF5 format will need to upgraded using the
`msp upgrade` command.
**New features**:
- The API has been made more Pythonic by replacing (e.g.)
`tree.get_parent(u)` with `tree.parent(u)`, and
`tree.get_total_branch_length()` with `tree.total_branch_length`.
The old forms have been maintained as deprecated aliases. (64)
- Efficient interchange of tree sequence data using the new Tables
API. This consists of classes representing the various
tables (e.g. `NodeTable`) and some utility functions (such
as `load_tables`, `sort_tables`, etc).
- Support for a much more general class of tree sequence topologies.
For example, trees with multiple roots are fully supported.
- Substantially generalised mutation model. Mutations now occur at
specific sites, which can be associated with zero to many mutations.
Each site has an ancestral state (any character string) and
each mutation a derived state (any character string).
- Substantially updated documentation to rigorously define the
underlying data model and requirements for imported data.
- The `variants()` method now returns a list of alleles for each
site, and genotypes are indexes into this array. This is both
consistent with existing usage and works with the newly generalised
mutation model, which allows arbitrary strings of characters as
mutational states.
- Add the formal concept of a sample, and distinguished from 'leaves'.
Change `tracked_leaves`, etc. to `tracked_samples` (225).
Also rename `sample_size` to `num_samples` for consistency (227).
- The simplify() method returns subsets of a large tree sequence.
- TreeSequence.first() returns the first tree in sequence.
- Windows support. Msprime is now routinely tested on Windows as
part of the suite of continuous integration tests.
- Newick output is not supported for more general trees. (117)
- The `genotype_matrix` method allows efficient access to the
full genotype matrix. (306)
- The variants iterator no longer uses a single buffer for
genotype data, removing a common source of error (253).
- Unicode and ASCII output formats for `SparseTree.draw()`.
- `SparseTree.draw()` renders tree in the more conventional 'square
shoulders' format.
- `SparseTree.draw()` by default returns an SVG string, so it can
be easily displayed in a Jupyter notebook. (204)
- Preliminary support for a broad class of site-based statistics,
including Patterson's f-statistics, has been added, through
the {}`SiteStatCalculator`, and its branch length analog,
{}`BranchLengthStatCalculator`. The interface is still in development,
and is expected may change.
**Bug fixes**:
- Duplicate site no longer possible (159)
- Fix for incorrect population sizes in DemographyDebugger (66).
**Deprecated**:
- The `records` iterator has been deprecated, and the underlying data
model has moved away from the concept of coalescence records. The
structure of a tree sequence is now defined in terms of a set of nodes
and edges, essentially a normalised version of coalescence records.
- Changed `population_id` to `population` in various DemographicEvent
classes for consistency. The old `population_id` argument is kept as a
deprecated alias.
- Changed `destination` to `dest` in MassMigrationEvent. The old
`destination` argument is retained as a deprecated alias.
- Changed `sample_size` to `num_samples` in TreeSequence and
SparseTree. The older versions are retained as deprecated aliases.
- Change `get_num_leaves` to `num_samples` in SparseTree. The
`get_num_leaves` method (and other related methods) that have
been retained for backwards compatibility are semantically incorrect,
in that they now return the number of **samples**. This should have
no effect on existing code, since samples and leaves were synonymous.
New code should use the documented `num_samples` form.
- Accessing the `position` attribute on a `Mutation` or
`Variant` object is now deprecated, as this is a property of a `Site`.
- Accessing the `index` attribute on a `Mutation` or `Variant` object
is now deprecated. Please use `variant.site.id` instead. In general,
objects with IDs (i.e., derived from tables) now have an `id` field.
- Various `get_` methods in TreeSequence and SparseTree have been
replaced by more Pythonic alternatives.