Diive

Latest version: v0.84.2

Safety actively analyzes 682404 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 9 of 15

0.58.0

Random forest update

The class `RandomForestTS` has been refactored. In essence, it still uses the same
`RandomForestRegressor` as before, but now outputs feature importances additionally
as computed by permutation. More details about permutation importance can be found
in scikit's official documentation
here: [Permutation feature importance](https://scikit-learn.org/stable/modules/permutation_importance.html).

When the model is trained using `.trainmodel()`, a random variable is included as additional
feature. Permutation importances of all features - including the random variable - are then
analyzed. Variables that yield a lower importance score than the random variables are removed
from the dataset and are not used to build the model. Typically, the permutation importance for
the random variable is very close to zero or even negative.

The built-in importance calculation in the `RandomForestRegressor` uses the Gini importance,
an impurity-based feature importance that favors high cardinality features over low cardinality
features. This is not ideal in case of time series data that is combined with categorical data.
Permutation importance is therefore a better indicator whether a variable included in the model
is an important predictor or not.

The class now splits input data into training and testing datasets (holdout set). By
default, the training set comprises 75% of the input data, the testing set 25%. After
the model was trained, it is tested on the testing set. This should give a
better indication of how well the model works on unseen data.

Once `.trainmodel()` is finished, the model is stored internally and can be used to gap-fill
the target variable by calling `.fillgaps()`.

In addition, the class now offers improved output with additional text output and plots
that give more info about model training, testing and application during gap-filling.

`RandomForestTS` has also been streamlined. The option to include timestamp info as features
(e.g., a column describing the season of the respective record) during model building is now
its own function (`.include_timestamp_as_cols()`) and was removed from the class.

New features

- New class `QuickFillRFTS` that uses `RandomForestTS` in the background to quickly fill time series
data (`pkgs.gapfilling.randomforest_ts.QuickFillRFTS`)
- New function to include timestamp info as features, e.g. YEAR and DOY (`core.times.times.include_timestamp_as_cols`)
- New function to calculate various model scores, e.g. mean absolute error, R2 and
more (`core.ml.common.prediction_scores_regr`)
- New function to insert the meteorological season (Northern hemisphere) as variable (`core.times.times.insert_season`).
For each record in the time series, the seasonal info between spring (March, April, May) and winter (December,
January, February) is added as integer number (0=spring, summer=1, autumn=2, winter=3).

Additions

- Added new example dataset, comprising ecosystem fluxes between 1997 and 2022 from the
[ICOS Class 1 Ecosystem station CH-Dav](https://www.swissfluxnet.ethz.ch/index.php/sites/ch-dav-davos/site-info-ch-dav/).
This dataset will be used for testing code on long-term time series. The dataset is stored in the `parquet`
file format, which allows fast loading and saving of datafiles in combination with good compression.
The simplest way to load the dataset is to use:

python
from diive.configs.exampledata import load_exampledata_parquet

df = load_exampledata_parquet()


Changes

- Updated README with installation details

Notebooks

- Updated notebook `notebooks/CalculateVariable/Calculate_VPD_from_TA_and_RH.ipynb`

0.57.1

Changes

Updates to class `FormatEddyProFluxnetFileForUpload`, for quickly formatting the EddyPro _fluxnet_
output file to comply with [FLUXNET](https://fluxnet.org/) requirements for uploading data.

Additions

- **Formatting EddyPro _fluxnet_ files for upload to FLUXNET**: `FormatEddyProFluxnetFileForUpload`

- Added new method to rename variables from the EddyPro _fluxnet_ file to comply
with [FLUXNET variable codes](http://www.europe-fluxdata.eu/home/guidelines/how-to-submit-data/variables-codes).
`._rename_to_variable_codes()`
- Added new method to remove errneous time periods from dataset `.remove_erroneous_data()`
- Added new method to remove fluxes from time periods of insufficient signal strength / AGC
`.remove_low_signal_data()`

Bugfixes

- Fixed bug: when data points are removed manually using class `ManualRemoval` and the data to be removed
is a single datetime (e.g., `2005-07-05 23:15:00`) then the removal now also works if the
provided datetime is not found in the time series. Previously, the class raised the error that
the provided datetime is not part of the index. (`pkgs.outlierdetection.manualremoval.ManualRemoval`)

Notebooks

- Updated notebook `notebooks/Formats\FormatEddyProFluxnetFileForUpload.ipynb` to version `3`

0.57.0

Changes

- Relaxed conditions a bit when inferring time resolution of time
series (`core.times.times.timestamp_infer_freq_progressively`, `core.times.times.timestamp_infer_freq_from_timedelta`)

Additions

- When reading parquet files, the TimestampSanitizer is applied by default to detect e.g. the time resolution
of the time series. Parquet files do not store info on time resolution like it is stored in pandas dataframes
(e.g. `30T` for 30MIN time resolution), even if the dataframe containing that info was saved to a parquet file.

Bugfixes

- Fixed bug where interactive time series plot did not show in Jupyter notebooks (`core.plotting.timeseries.TimeSeries`)
- Fixed bug where certain parts of the flux processing chain could not be used for the sensible heat flux `H`.
The issue was that `H` is calculated from sonic temperature (`T_SONIC` in EddyPro `_fluxnet_` output files),
which was not considered in function `pkgs.flux.common.detect_flux_basevar`.
- Fixed bug: interactive plotting in notebooks using `bokeh` did not work. The reason was that the `bokeh` plot
tools (controls) `ZoomInTool()` and `ZoomOutTool()` do not seem to work anymore. Both tools are now deactivated.

Notebooks

- Added new notebook for simple (interactive) time series plotting `notebooks/Plotting/TimeSeries.ipynb`
- Updated notebook `notebooks/FluxProcessingChain/FluxProcessingChain.ipynb` to version 3

0.55.0

This update focuses on the flux processing chain, in particular the creation of the extended
quality flags, the flux storage correction and the creation of the overall quality flag `QCF`.

New Features

- Added new class `StepwiseOutlierDetection` that can be used for general outlier detection in
time series data. It is based on the `StepwiseMeteoScreeningDb` class introduced in v0.50.0,
but aims to be more generally applicable to all sorts of time series data stored in
files (`pkgs.outlierdetection.stepwiseoutlierdetection.StepwiseOutlierDetection`)
- Added new outlier detection class that identifies outliers based on seasonal-trend decomposition
and z-score calculations (`pkgs.outlierdetection.seasonaltrend.OutlierSTLRZ`)
- Added new outlier detection class that flags values based on absolute limits that can be defined
separately for daytime and nighttime (`pkgs.outlierdetection.absolutelimits.AbsoluteLimitsDaytimeNighttime`)
- Added small functions to directly save (`core.io.files.save_as_parquet`) and
load (`core.io.files.load_parquet`) parquet files. Parquet files offer fast loading and saving in
combination with good compression. For more information about the Parquet format
see [here](https://parquet.apache.org/)

Additions

- **Angle-of-attack**: The angle-of-attack test can now be used during QC flag creation
(`pkgs.fluxprocessingchain.level2_qualityflags.FluxQualityFlagsLevel2.angle_of_attack_test`)
- Various smaller additions

Changes

- Renamed class `FluxQualityFlagsLevel2` to `FluxQualityFlagsLevel2EddyPro` because it is directly based
on the EddyPro output (`pkgs.fluxprocessingchain.level2_qualityflags.FluxQualityFlagsLevel2EddyPro`)
- Renamed class `FluxStorageCorrectionSinglePoint`
to `FluxStorageCorrectionSinglePointEddyPro` (
`pkgs.fluxprocessingchain.level31_storagecorrection.FluxStorageCorrectionSinglePointEddyPro`)
- Refactored creation of flux quality
flags (`pkgs.fluxprocessingchain.level2_qualityflags.FluxQualityFlagsLevel2EddyPro`)
- **Missing storage correction terms** are now gap-filled using random forest before the storage terms are
added to the flux. For some records, the calculated flux was available but the storage term was missing, resulting
in a missing storage-corrected flux (example: 97% of fluxes had storage term available, but for 3% it was missing).
The gap-filling makes sure that each flux values has a corresponding storage term and thus more values are
available for further processing. The gap-filling is done solely based on timestamp information, such as DOY
and hour. (`pkgs.fluxprocessingchain.level31_storagecorrection.FluxStorageCorrectionSinglePoint`)
- The **outlier detection using z-scores for daytime and nighttime data** uses latitude/longitude settings to
calculate daytime/nighttime via `pkgs.createvar.daynightflag.nighttime_flag_from_latlon`. Before z-score
calculation, the time resolution of the time series is now checked and assigned automatically.
(`pkgs.outlierdetection.zscore.zScoreDaytimeNighttime`)
- Removed `pkgs.fluxprocessingchain.level32_outlierremoval.FluxOutlierRemovalLevel32` since flux outlier
removal is now done in the generally applicable class `StepwiseOutlierDetection` (see new features)
- Various smaller changes and refactorings

Environment

- Updated `poetry` to newest version `v1.5.1`. The `lock` files have a new format since `v1.3.0`.
- Created new `lock` file for `poetry`.
- Added new package `pyarrow`.
- Added new package `pymannkendall` (see [GitHub](https://pypi.org/project/pymannkendall/)) to analyze
time series data for trends. Functions of this package are not yet implemented in `diive`.

Notebooks

- Added new notebook for loading and saving parquet files in `notebooks/Formats/LoadSaveParquetFile.ipynb`
- **Flux processing chain**: Added new notebook for flux post-processing
in `notebooks/FluxProcessingChain/FluxProcessingChain.ipynb`.

0.54.0

New Features

- Identify critical heat days for ecosytem flux NEE (net ecosystem exchange, based on air temperature and VPD
(`pkgs.flux.criticalheatdays.FluxCriticalHeatDaysP95`)
- Calculate z-aggregates in classes of x and y (`pkgs.analyses.quantilexyaggz.QuantileXYAggZ`)
- Plot heatmap from pivoted dataframe, using x,y,z values (`core.plotting.heatmap_xyz.HeatmapPivotXYZ`)
- Calculate stats for time series and store results in dataframe (`core.dfun.stats.sstats`)
- New helper function to load and merge files of a specific filetype (`core.io.files.loadfiles`)

Additions

- Added more parameters when formatting EddyPro _fluxnet_ file for FLUXNET
(`pkgs.formats.fluxnet.FormatEddyProFluxnetFileForUpload`)

Changes

- Removed left-over code
- Multiple smaller refactorings

Notebooks

- Added new notebook for calculating VPD in `notebooks/CalculateVariable/Calculate_VPD_from_TA_and_RH.ipynb`
- Added new notebook for calculating time series stats `notebooks/Stats/TimeSeriesStats.ipynb`
- Added new notebook for formatting EddyPro output for upload to
FLUXNET `notebooks/Formats/FormatEddyProFluxnetFileForUpload.ipynb`

0.53.3

Notebooks

- Added new notebooks for reading data files (ICOS BM files)
- Added additional output to other notebooks
- Added new notebook section `Workbench` for practical use cases

Additions

- New filetype `configs/filetypes/ICOS_H1R_CSVZIP_1MIN.yml`

Page 9 of 15

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.