Diive

Latest version: v0.75.0

Safety actively analyzes 623490 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 12

0.75.0

XGBoost gap-filling

[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) can now be used to fill gaps in time series data.
In `diive`, `XGBoost` is implemented in class `XGBoostTS`, which adds additional options for easily including e.g.
lagged variants of feature variables, timestamp info (DOY, month, ...) and a continuous record number. It also allows
direct feature reduction by including a purely random feature (consisting of completely random numbers) and calculating
the 'permutation importance'. All features where the permutation importance is lower than for the random feature can
then be removed from the dataset, i.e., the list of features, before building the final model.

`XGBoostTS` and `RandomForestTS` both use the same base class `MlRegressorGapFillingBase`. This base class will also
facilitate the implementation of other gap-filling algorithms in the future.

Another fun (for me) addition is the new class `TimeSince`. It allows to calculate the time since the last occurrence of
specific conditions. One example where this class can be useful is the calculation of 'time since last precipitation',
expressed as number of records, which can be helpful in identifying dry conditions. More examples: 'time since freezing
conditions' based on air temperature; 'time since management' based on management info, e.g. fertilization events.
Please see the notebook for some illustrative examples.

**Please note that `diive` is still under developement and bugs can be expected.**

New features

- Added gap-filling class `XGBoostTS` for time series data,
using [XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) (`diive.pkgs.gapfilling.xgboost_ts.XGBoostTS`)
- Added new class `TimeSince`: counts number of records (inceremental number / counter) since the last time a time
series was inside a specified range, useful for e.g. counting the time since last precipitation, since last freezing
temperature, etc. (`diive.pkgs.createvar.timesince.TimeSince`)

Additions

- Added base class for machine learning regressors, which is basically the code shared between the different
methods. At the moment used by `RandomForestTS` and `XGBoostTS`. (`diive.core.ml.common.MlRegressorGapFillingBase`)
- Added option to change line color directly in `TimeSeries` plots (`diive.core.plotting.timeseries.TimeSeries.plot`)

Notebooks

- Added new notebook for gap-filling using `XGBoostTS` with mininmal settings (`notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb`)
- Added new notebook for gap-filling using `XGBoostTS` with more extensive settings (`notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb`)
- Added new notebook for creating `TimeSince` variables (`notebooks/CalculateVariable/TimeSince.ipynb`)

Tests

- Added test case for XGBoost gap-filling (`tests.test_gapfilling.TestGapFilling.test_gapfilling_xgboost`)
- Updated test case for random forest gap-filling (`tests.test_gapfilling.TestGapFilling.test_gapfilling_randomforest`)
- Harmonized test case for XGBoostTS with test case of RandomForestTS
- Added test case for `TimeSince` variable creation (`tests.test_createvar.TestCreateVar.test_timesince`)

0.74.1

This update adds the first notebooks (and tests) for outlier detection methods. Only two tests are included so far and
both tests are relatively simple, but both notebooks already show in principle how outlier removal is handled. An
important aspect is that `diive` single outlier methods do not remove outliers by default, but instead a flag is created
that shows where the outliers are located. The flag can then be used to remove the data points.
This update also includes the addition of a small function that creates artificial spikes in time series data and is
therefore very useful for testing outlier detection methods.
More outlier removal notebooks will be added in the future, including a notebook that shows how to combine results from
multiple outlier tests into one single overall outlier flag.

New features

- **Added**: new function to add impulse noise to time series (`diive.pkgs.createvar.noise.impulse`)

Notebooks

- **Added**: new notebook for outlier detection: absolute limits, separately for daytime and nighttime
data (`notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb`)
- **Added**: new notebook for outlier detection: absolute limits (`notebooks/OutlierDetection/AbsoluteLimits.ipynb`)

Tests

- **Added**: test case for outlier detection: absolute limits, separately for daytime and
nighttime data (`tests.test_outlierdetection.TestOutlierDetection.test_absolute_limits`)
- **Added**: test case for outlier detection: absolute
limits (`tests.test_outlierdetection.TestOutlierDetection.test_absolute_limits`)

0.74.0

Additions

- **Added**: new function to remove rows that do not have timestamp
info (`NaT`) (`diive.core.times.times.remove_rows_nat` and `diive.core.times.times.TimestampSanitizer`)
- **Added**: new settings `VARNAMES_ROW` and `VARUNITS_ROW` in filetypes YAML files, allows better and more specific
configuration when reading data files (`diive/configs/filetypes`)
- **Added**: many (small) example data files for various filetypes, e.g. `ETH-RECORD-TOA5-CSVGZ-20HZ`
- **Added**: new optional check in `TimestampSanitizer` that compares the detected time resolution of a time series with
the nominal (expected) time resolution. Runs automatically when reading files with `ReadFileType`, in which case
the `FREQUENCY` from the filetype configs is used as the nominal time
resolution. (`diive.core.times.times.TimestampSanitizer`, `diive.core.io.filereader.ReadFileType`)
- **Added**: application of `TimestampSanitizer` after inserting a timestamp and setting it as index with
function `insert_timestamp`, this makes sure the freq/freqstr info is available for the new timestamp
index (`diive.core.times.times.insert_timestamp`)

Notebooks

- General: Ran all notebook examples to make sure they work with this version of `diive`
- **Added**: new notebook for reading EddyPro _fluxnet_ output file with `DataFileReader`
parameters (`notebooks/ReadFiles/Read_single_EddyPro_fluxnet_output_file_with_DataFileReader.ipynb`)
- **Added**: new notebook for reading EddyPro _fluxnet_ output file with `ReadFileType` and pre-defined
filetype `EDDYPRO-FLUXNET-CSV-30MIN` (`notebooks/ReadFiles/Read_single_EddyPro_fluxnet_output_file_with_ReadFileType.ipynb`)
- **Added**: new notebook for reading multiple EddyPro _fluxnet_ output files with `MultiDataFileReader` and pre-defined
filetype `EDDYPRO-FLUXNET-CSV-30MIN` (`notebooks/ReadFiles/Read_multiple_EddyPro_fluxnet_output_files_with_MultiDataFileReader.ipynb`)

Changes

- **Renamed**: function `get_len_header` to `parse_header`(`diive.core.dfun.frames.parse_header`)
- **Renamed**: exampledata files (`diive/configs/exampledata`)
- **Renamed**: filetypes YAML files to always include the file extension in the file name (`diive/configs/filetypes`)
- **Reduced**: file size for most example data files

Tests

- **Added**: various test cases for loading filetypes (`tests/test_loaddata.py`)
- **Added**: test case for loading and merging multiple
files (`tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_multiple_EDDYPRO_FLUXNET_CSV_30MIN`)
- **Added**: test case for reading EddyPro _fluxnet_ output file with `DataFileReader`
parameters (`tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN_datafilereader_parameters`)
- **Added**: test case for resampling series to 30MIN time
resolution (`tests.test_time.TestTime.test_resampling_to_30MIN`)
- **Added**: test case for inserting timestamp with a different convention (middle, start,
end) (`tests.test_time.TestTime.test_insert_timestamp`)
- **Added**: test case for inserting timestamp as index (`tests.test_time.TestTime.test_insert_timestamp_as_index`)

Bugfixes

- **Fixed**: bug in class `DetectFrequency` when inferred frequency is `None` (`diive.core.times.times.DetectFrequency`)
- **Fixed**: bug in class `DetectFrequency` where `pd.Timedelta()` would crash if the input frequency does not have a
number. `Timedelta` does not accept e.g. the frequency string `min` for minutely time resolution, even though
e.g. `pd.infer_freq()` outputs `min` for data in 1-minute time resolution. `TimeDelta` requires a number, in this
case `1min`. Results from `infer_freq()` are now checked if they contain a number and if not, `1` is added at the
beginning of the frequency string. (`diive.core.times.times.DetectFrequency`)
- **Fixed**: bug in notebook `WindDirectionOffset`, related to frequency detection during heatmap plotting
- **Fixed**: bug in `TimestampSanitizer` where the script would crash if the timestamp contained an element that could
not be converted to datetime, e.g., when there is a string mixed in with the regular timestamps. Data rows with
invalid timestamps are now parsed as `NaT` by using `errors='coerce'`
in `pd.to_datetime(data.index, errors='coerce')`. (`diive.core.times.times.convert_timestamp_to_datetime`
and `diive.core.times.times.TimestampSanitizer`)
- **Fixed**: bug when plotting heatmap (`diive.core.plotting.heatmap_datetime.HeatmapDateTime`)

0.73.0

New features

- Added new function `trim_frame` that allows to trim the start and end of a dataframe based on available records of a
variable (`diive.core.dfun.frames.trim_frame`)
- Added new option to export borderless
heatmaps (`diive.core.plotting.heatmap_base.HeatmapBase.export_borderless_heatmap`)

Additions

- Added more info in comments of class `WindRotation2D` (`diive.pkgs.echires.windrotation.WindRotation2D`)
- Added example data for EddyPro full_output
files (`diive.configs.exampledata.load_exampledata_eddypro_full_output_CSV_30MIN`)
- Added code in an attempt to harmonize frequency detection from data: in class `DetectFrequency` the detected
frequency strings are now converted from `Timedelta` (pandas) to `offset` (pandas) to `.freqstr`. This will yield
the frequency string as seen by (the current version of) pandas. The idea is to harmonize between different
representations e.g. `T` or `min` for minutes (
see [here](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)). (`diive.core.times.times.DetectFrequency`)

Changes

- Updated class `DataFileReader` to comply with new `pandas` kwargs when
using `.read_csv()` (`diive.core.io.filereader.DataFileReader._parse_file`)
- Environment: updated `pandas` to v2.2.2 and `pyarrow` to v15.0.2
- Updated date offsets in config filetypes to be compliant with `pandas` version 2.2+ (
see [here](https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html)
and [here](https://pandas.pydata.org/docs/user_guide/timeseries.html#dateoffset-objects)), e.g., `30T` was changed
to `30min`. This seems to work without raising a warning, however, if frequency is inferred from available data,
the resulting frequency string shows e.g. `30T`, i.e. still showing `T` for minutes instead
of `min`. (`diive/configs/filetypes`)
- Changed variable names in `WindRotation2D` to be in line with the variable names given in the paper by Wilczak et
al. (2001) https://doi.org/10.1023/A:1018966204465

Removals

- Removed function `timedelta_to_string` because this can be done with pandas `to_offset().freqstr`
- Removed function `generate_freq_str` (unused)

Tests

- Added test case for reading EddyPro full_output
files (`tests.test_loaddata.TestLoadFiletypes.test_load_exampledata_eddypro_full_output_CSV_30MIN`)
- Updated test for frequency detection (`tests.test_timestamps.TestTime.test_detect_freq`)

0.72.1

- `pyproject.toml` now uses the inequality syntax `>=` instead of caret syntax `^` because the version capping is
restrictive and prevents compatibility in conda installations. See [74](https://github.com/holukas/diive/pull/74)
- Added badges in `README.md`
- Smaller `diive` logo in `README.md`

0.72.0

New feature

- Added new heatmap plotting class `HeatmapYearMonth` that allows to plot a variable in year/month
classes(`diive.core.plotting.heatmap_datetime.HeatmapYearMonth`)

![DIIVE](images/plotHeatmapYearMonth_diive_v0.72.0.png)

Changes

- Refactored code for class `HeatmapDateTime` (`diive.core.plotting.heatmap_datetime.HeatmapDateTime`)
- Added new base class `HeatmapBase` for heatmap plots. Currently used by `HeatmapYearMonth`
and `HeatmapDateTime` (`diive.core.plotting.heatmap_base.HeatmapBase`)

Notebooks

- Added new notebook for `HeatmapDateTime` (`notebooks/Plotting/HeatmapDateTime.ipynb`)
- Added new notebook for `HeatmapYearMonth` (`notebooks/Plotting/HeatmapYearMonth.ipynb`)

Bugfixes

- Fixed bug in `HeatmapDateTime` where the last record of each day was not shown

Page 1 of 12

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.