New release of gordo-components!
Small changes:
- All dependencies are updated, including pandas (0.24.2->0.25.0)
- Fix issue where IROC reader used 1 thread by default. (409)
- Add exponential retries to influx forwarder (413)
- Filter bad data (code 0) from the datalake (423)
- Wrapper enabling use of standard scikit-learn scorers (427)
Major change:
Change all our keras neural networks to take an explicit `y` instead of using the passed (and possibly scaled) `X` as the target.
This gives more freedom in several ways:
- It allows training towards a un-scaled `y` with a scaled `X`, or having them xscaled in different ways.
- It allows the `y` and `X` to be different sets of tags. The target `y` can be a subset of `X` or even a completely different set of tags.
- It follows the standard scikit-learn pattern, making it easier to use e.g. standard scikit-learn scorers. (more about this below)
But it also involves some changes in the model definitions to get the same behavior as before.
Change in model format:
Previous model definition:
yaml
model:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.data.MinMaxScaler
- gordo_components.model.models.KerasLSTMAutoEncoder:
kind: lstm_hourglass
lookback_window: 10
New model definition:
yaml
model:
gordo_components.model.anomaly.diff.DiffBasedAnomalyDetector:
base_estimator:
sklearn.compose.TransformedTargetRegressor:
transformer: sklearn.preprocessing.data.MinMaxScaler
regressor:
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.data.MinMaxScaler
- gordo_components.model.models.KerasLSTMAutoEncoder:
kind: lstm_hourglass
lookback_window: 10
Explanation:
The first class, `gordo_components.model.anomaly.diff.DiffBasedAnomalyDetector` is a class which takes a base estimator as a parameter, and provides a new method `anomaly` in addition to any methods the `base_estimator` already has (like `fit` and `predict`). In the case of `DiffBasedAnomalyDetector` the call to `anomaly(X,y)` is implemented by calling `predict` on the `base_estimator`, scaling the output, scaling the passed `y`, calculating the absolute value of the differences, and then calculating the norm. The output of `anomaly(X,y)` is a multi-level dataframe with the original input and output to the base-estimator, in addition to per-sensor calculated errors (abs of differences) and the complete error score. The major difference from before is that the error-calculations are now an explicit class which can be used in e.g. notebooks, instead of existing as a function in the server-class.
The second new class in the config above is `sklearn.compose.TransformedTargetRegressor`. This is a standard scikit-learn class which allows one to scale the target `y` before the model is fitted, and then inverse scales the output of the `base_estimator` when `predict` is called. This class is needed if you want the Keras network to train towards scaled `y` as it was before, if you do not want this then you can omit the `sklearn.compose.TransformedTargetRegressor`.
Using scikit learn scores
It is now possible to use standard scikit-learn scorers with a simple wrapper.
Example:
python
from gordo_components import serializer
import yaml
import numpy
from sklearn.metrics import r2_score
config = yaml.load(
"""
sklearn.pipeline.Pipeline:
steps:
- sklearn.preprocessing.data.MinMaxScaler
- gordo_components.model.models.KerasLSTMAutoEncoder:
kind: lstm_hourglass
lookback_window: 10
epochs: 20
"""
)
model = serializer.pipeline_from_definition(config)
X = numpy.random.rand(100,10)
y = numpy.random.rand(100,10)
model.fit(X,y)
This will fail since the output and the target is of different length
r2_score(X,model.predict(X))
The fix:
from gordo_components.model.utils import metric_wrapper
metric_wrapper(r2_score)(X, model.predict(X))