
Latest version: v0.2.9

Safety actively analyzes 688087 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 3


Release Notes

Util class for dividing dataset based on task type
> If it's regression task, it's simple to split data
> If it's classification task, we need to split data based on labels, because we need to ensure the divided data contain all labels available

import numpy as np

from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import DataSplitter

x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
data_splitter = DataSplitter().fit(dataset)
labels in result will keep its ratio
result = data_splitter.split(3)
[0 0 1]
result = data_splitter.split(0.5)
[0 0 1]
at least one sample of each class will be kept
y[-2] = 0
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
data_splitter = DataSplitter().fit(dataset)
result = data_splitter.split(2)
[0 0 0 0 0 1] [0 1]
print(y, result.dataset.y.ravel())

Util class which can perform k-fold data splitting:
1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
2. Xi ∩ Xj = ∅, ∀ i, j = 1,..., K
3. X1 ∪ X2 ∪ ... ∪ Xk = X

> Notice that `KFold` does not always hold the principles listed above, because `DataSplitter` will ensure that at least one sample of each class will be kept. In this case, when we apply `KFold` to an imbalance dataset, `KFold` may slightly violate principle 2. and 3.

+ k : int, number of folds
+ dataset : TabularDataset, dataset which we want to split
+ **kwargs : used to initialize `DataSplitter` instance

import numpy as np

from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import KFold

x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
k_fold = KFold(3, dataset)
for train_fold, test_fold in k_fold:
print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))

Util class which can perform k-random data splitting:
1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
2. idx{X1} ≠ idx{X2} ≠ ... ≠ idx{Xk}, where idx{X} = {1, 2, ..., n}
3. X1 = X2 = ... = Xk = X

+ k : int, number of folds
+ num_test : {int, float}
+ if float and < 1 : ratio of the test dataset
+ if int and > 1 : exact number of test samples
+ dataset : TabularDataset, dataset which we want to split
+ **kwargs : used to initialize `DataSplitter` instance

import numpy as np

from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import KRandom

x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
k_random = KRandom(3, 2, dataset)
for train_fold, test_fold in k_random:
print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))

Util class which can sample imbalance dataset in a balanced way

+ data : TabularData, data which we want to sample from
+ imbalance_threshold : float
+ for binary class cases, if n_pos / n_neg < threshold, we'll treat data as imbalance data
+ for multi class cases, if n_min_class / n_max_class < threshold, we'll treat data as imbalance data
+ shuffle : bool, whether shuffle the returned indices
+ sample_method : str, sampling method used in `cftool.misc.Sampler`
+ currently only 'multinomial' is supported
+ verbose_level : int, verbose level used in `LoggingMixin`

import numpy as np

from cfdata.types import np_int_type
from cfdata.tabular import TabularData
from cfdata.tabular.utils import ImbalancedSampler
from cftool.misc import get_counter_from_arr

n = 20
x = np.arange(2 * n).reshape([n, 2])
create an imbalance dataset
y = np.zeros([n, 1], np_int_type)
y[-1] = [1]
data = TabularData().read(x, y)
sampler = ImbalancedSampler(data)
Counter({1: 12, 0: 8})
This may vary, but will be rather balanced
You might notice that positive samples are even more than negative samples!

Util class which can generated batches from `ImbalancedSampler`

import numpy as np

from cfdata.types import np_int_type
from cfdata.tabular import TabularData
from cfdata.tabular.utils import DataLoader
from cfdata.tabular.utils import ImbalancedSampler
from cftool.misc import get_counter_from_arr

n = 20
x = np.arange(2 * n).reshape([n, 2])
y = np.zeros([n, 1], np_int_type)
y[-1] = [1]
data = TabularData().read(x, y)
sampler = ImbalancedSampler(data)
loader = DataLoader(16, sampler)
y_batches = []
for x_batch, y_batch in loader:
(16, 1) (16, 1)
(4, 1) (4, 1)
print(x_batch.shape, y_batch.shape)
Counter({1: 11, 0: 9})


First release

Page 3 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.