This is a major feature release that was motivated in many aspects by the migration of `statstmodels` from `patsy` to `formulaic`. Many thanks to bashtage for driving those invasive changes forward. There are some semantic breaking changes, but unless you are deep in the internals of `formulaic` (which I do not believe to be the case for any external library) these are not expected to break common usage.
**Breaking changes:**
- `Formula` is no longer always "structured" with special cases to handle the
case where it has no structure. Legacy shims have been added to support old
patterns, with `DeprecationWarning`s raised when they are used. It is not
expected to break anyone not explicitly checking whether the `Formula.root` is
a list instance (which formerly should have been simply assumed) [it is a now
`SimpleFormula` instance that acts like an ordered sequence of `Term`
instances].
- The column names associated with categorical factors has changed. Previously,
a prefix was unconditionally added to the level in the column name like
`feature[T.A]`, whether nor not the encoding will result in that term acting
as a contrast. Now, in keeping with `patsy`, we only add the prefix if the
categorical factor is encoded with reduced rank. Otherwise, `feature[A]` will
be used instead.
- `formulaic.parsers.types.structured` has been promoted to
`formulaic.utils.structured`.
**New features and enhancements:**
- `Formula` now instantiates to `SimpleFormula` or `StructuredFormula`, the
latter being a tree-structure of `SimpleFormula` instances (as compared to
`List[Term]`) previously. This simplifies various internal logic and makes the
propagation of formula metadata more explicit. (222)
- Added support for restricting the set of features used by the default formula
parser so that libraries can more easily restrict the structure of output
formulae. (207)
- `dict` and `recarray` types are no associated with the `pandas` materializer
by default (rather than raising), simplifying some user workflows. (225)
- Added support for the `.` operator (which is replaced with all variables not
used on the left-hand-side of formulae). (216)
- Added **experimental** support for nested formulae of form `[ ... ~ ... ]`.
This is useful for (e.g.) generating formulae for IV 2SLS. (108)
- Add support for subsettings `ModelSpec[s]` based on an arbitrary
strictly reduced `FormulaSpec`. (208)
- Added `Formula.required_variables` to more easily surface the expected data
requirements of the formula. (205)
- Added support for extracting rows dropped during materialization. (197)
- Added cubic spline support for cyclic (`cc`) and natural (`cr`). See
`formulaic.materializers.transforms.cubic_spline.cubic_spline` for
more details.
- Added a `lag()` transform.
- Constructing `LinearConstraints` can now be done from a list of strings (for
increased parity with `patsy`). (201)
- Categorical factors are now preceded with (e.g.) `T.` when they actully
describe contrasts (i.e. when they are encoded with reduced rank). (220)
- Contrasts metadata is now added to the encoder state via `encode_categorical`;
which is surfaced via `ModelSpec.factor_contrasts`. (204)
- `Operator` instances now received `context` which is optionally specified by
the user during formula parsing, and updated by the parser. This is what makes
the `.` implementation possible. (216)
- Given the generic usefulness of `Structured`, it has been promoted to
`formulaic.utils`. (223)
- Added explicit support and testing for Python 3.13. (202)
**Bugfixes and cleanups:**
- Fixed nested ordering of `Formula` instance. (200)
- Allow Python tokens to multiple chained parentheses and brackets without using
quotes as long as the parentheses are balanced. (214, 218)
- Reduced the number of redundant initialisation operations in `Structured`
instances. (200)
- Fixed pickling `ModelMatrix` and `FactorValues` instances (whenever wrapped
objects are picklable). (209; thanks bashtage)
- `basis_spline`: Fixed evaluation involving datasets with null values, and
disallow out-of-bounds knots. (217; thanks bashtage)
- Improved robustness of data contexts involving PyArrow datasets.
- We now use the same sentiles throughout the code-base, rather than having
module specific sentinels in some places.
- Migrated to `ruff` for linting, and updated `mypy` and `pre-commit` tooling.
- Automatic fixes from `ruff` are automatically applied when using
`hatch run lint:format`.
**Documentation:**
- Fixed and updated docsite build, as well as other minor tweaks.