Added
- Added function `dt.models.kfold(nrows, nsplits)` to prepare indices for
k-fold splitting. This function will return `nsplits` pairs of row selectors
such that when these selectors are applied to an `nrows`-rows frame, that
frame will be split into train and test part according to the K-fold
splitting scheme.
- Added function `dt.models.kfold_random(nrows, nsplits, seed)`, which is
similar to `kfold(nrows, nsplits)`, except that the assignment of rows into
folds is randomized, not deterministic.
- `Frame.rbind()` can now also accept a list or tuple of frames (previously
only a vararg sequence was allowed).
- Method `.len()` can be applied to a string column to obtain the lengths
of strings in each row.
- Method `.re_match(re)` applies to a string column, and produces boolean
indicator whether each value matches the regular expression `re` or not.
The method matches the entire string, not just the beginning. Thus, it
most closely resembles Python function `re.fullmatch()`.
- Added early stopping support to FTRL algo, that can now do binomial and
multinomial classification for categorical targets, as well as regression
for continuous targets.
- New function `dt.median()` can be used to compute median of a certain
column or expression, either per group or for the entire Frame (1530).
- `Frame.__str__()` now returns a string containing the preview of the
frame's data. This allows datatable frames to be used with `print()`.
- Added method `dt.options.describe()`, which will print the available
options together with their values and descriptions.
- Added `dt.options.context(option=value)`, which can be used in a with-
statement to temporarily change the value of one or more options, and
then go back to their original values at the end of the with-block.
- Added options `fread.log.escape_unicode` (controls treatment of unicode
characters in fread's verbose log); and `display.use_colors` (allows
to turn on/off colored output in the console).
- `dt.options` now helps the user when they make a typo: if an option
with a certain name does not exist, the error message will suggest the
correct spelling.
- most long-running operations in `datatable` will now show a progress bar.
Its behavior can be controlled via `dt.options.progress` set of options.
- internal function `dt.internal.compiler_version()`.
- New `datatable.math` module is a library of various mathematical functions
that can be applied to datatable Frames. The set of functions is close to
what is available in the standard python `math` module. See documentation
for more details.
- New module `datatable.sphinxext.dtframe_directive`, which can be used as
a plugin for Sphinx. This module adds directive `.. dtframe` that allows
to easily include a Frame display in an .rst document.
- Frame can now be treated as an iterable over the columns. Thus, a Frame
object can now be used in a for-loop, producing its individual columns.
- A Frame can now be treated as a mapping; in particular both `dict(frame)`
and `**frame` are now valid.
- Single-column frames can be be used as sources for Frame construction.
- CSV writer now quotes fields containing single-quote mark (`'`).
- Added parameter `quoting=` to method `Frame.to_csv()`. The accepted values
are 4 constants from the standard `csv` module: `csv.QUOTE_MINIMAL`
(default), `csv.QUOTE_ALL`, `csv.QUOTE_NONNUMERIC` and `csv.QUOTE_NONE`.
Fixed
- Fixed crash in certain circumstances when a key was applied after a
groupby (1639).
- `Frame.to_numpy()` now returns a numpy `masked_array` if the frame has
any NA values (1619).
- A keyed frame will now be rendered correctly when viewing it in python
console via `Frame.view()` (1672).
- Str32 column can no longer overflow during the `.replace()` operation,
or when converting from python, numpy or pandas, etc. In all these cases
we will now transparently create a Str64 column instead (1694).
- The reported frame size (`sys.getsizeof(DT)`) is now more accurate; in
particular the content of string columns is no longer ignored (1697).
- Type casting into str32 no longer produces an error if the resulting column
is larger than 2GB. Now a str64 column will be returned instead (1695).
- Fixed memory leak during computation of a generic `DT[i, j]` expression.
Another memory leak was during generation of string columns, now also fixed
(1705).
- Fixed crash upon exiting from a python terminal, if the user ever called
function `frame_column_rowindex().type` (1703).
- Pandas "boolean column with NAs" (of dtype `object`) now converts into
datatable `bool8` column when pandas DataFrame is converted into a datatable
Frame (1730).
- Fixed conversion to numpy of a view Frame which contains NAs (1738).
- `datatable` can now be safely used with `multiprocessing`, or other modules
that perform fork-without-exec (1758). The child process will spawn its
own thread pool that will have the same number of threads as the parent.
Adjust `dt.options.nthreads` in the child process(es) if different number
of threads is required.
- The interactive mode is no longer improperly turned on in IPython (1789).
- Fixed issue with mis-aligned frame headers in IPython, caused by IPython
inserting `Out[X]:` in front of the rendered Frame display (1793).
- Improved rendering of Frames in terminals with white background: we no longer
use 'bright_white' color for emphasis, only 'bold' (1793).
- Fixed crash when a new column was created via partial assignment, i.e.
`DT[i, "new_col"] = expr` (1800).
- Fixed memory leaks/crashes when materializing an object column (1805).
- Fixed creating a Frame from a pandas DataFrame that has duplicate column
names (1816).
- Fixed a UnicodeDecodeError that could be thrown when viewing a Frame with
unicode characters in Jupyter notebook. The error only manifested for
strings that were longer than 50 bytes in length (1825).
- Fixed crash when `Frame.colindex()` was used without any arguments, now this
raises an exception instead (1834).
- Fixed possible crash when writing to disk that doesn't have enough free space
on it (1837).
- Fixed invalid Frame being created when reading a large string column (str64)
with fread, and the column contains NA values.
- Fixed FTRL model not resuming properly after unpickling (1846).
- Fixed crash that occurred when sorting by multiple columns, and the first
column is of low cardinality (1857).
- Fixed display of NA values produced during a join, when a Frame was displayed
in Jupyter Lab (1872).
- Fixed a crash when replacing values in a str64 column (1890).
- `cbind()` no longer throws an error when passed a generator producing
temporary frames (1905).
- Fixed comparison of string columns vs. value `None` (1912).
- Fixed a crash when trying to select individual cells from a joined Frame,
for the cells that were un-matched during the join (1917).
- Fixed a crash when writing a joined frame into CSV (1919).
- Fixed a crash when writing into CSV string view columns, especially of
str64 type (1921).
Changed
- A Frame will no longer be shown in "interactive" mode in console by default.
The previous behavior can be restored with
`dt.options.display.interactive = True`. Alternatively, you can explore a
Frame interactively using `frame.view(True)`.
- Improved performance of type-casting a view column: now the code avoids
materializing the column before performing the cast.
- `Frame` class is now defined fully in C++, improving code robustness and
performance. The property `Frame.internal` was removed, as it no longer
represents anything. Certain internal properties of `Frame` can be accessed
via functions declared in the `dt.internal.` module.
- `datatable` no longer uses OpenMP for parallelism. Instead, we use our own
thread pool to perform multi-threaded computations (1736).
- Parameter `progress_fn` in function `dt.models.aggregate()` is removed.
In its place you can set the global option `dt.options.progress.callback`.
- Removed deprecated Frame methods `.topython()`, `.topandas()`, `.tonumpy()`,
and `Frame.__call__()`.
- Syntax `DT[col]` has been restored (was previously deprecated in 0.7.0),
however it works only for `col` an integer or a string. Support for slices
may be added in the future, or not: there is a potential to confuse
`DT[a:b]` for a row selection. A column slice may still be selected via
the i-j selector `DT[:, a:b]`.
- The `nthreads=` parameter in `Frame.to_csv()` was removed. If needed, please
set the global option `dt.options.nthreads`.
Deprecated
- Frame method `.scalar()` is now deprecated and will be removed in release
0.10.0. Please use `frame[0, 0]` instead.
- Frame method `.append()` is now deprecated and will be removed in release
0.10.0. Please use `.rbind()` instead.
- Frame method `.save()` was renamed into `.to_jay()` (for consistency with
other `.to_*()` methods). The old name is still usable, but marked as
deprecated and will be removed in 0.10.0.
Notes
- Thanks to everyone who helped make `datatable` more stable by discovering
and reporting bugs that were fixed in this release:
- [Arno Candel][] (1619, 1730, 1738, 1800, 1803, 1846, 1857, 1890,
1891, 1919, 1921),
- [Antorsae][] (1639),
- [Olivier][] (1872),
- [Hawk Berry][] (1834),
- [Jonathan McKinney][] (1816, 1837),
- [Mateusz Dymczyk][] (1912),
- [NachiGithub][] (1789, 1793),
- [Pasha Stetsenko][] (1672, 1694, 1695, 1697, 1703, 1705, 1905,
1917),
- [Tom Kraljevic][] (1805),
- [XiaomoWu][] (1825)