Release Notes
`DataSplitter`
Util class for dividing dataset based on task type
> If it's regression task, it's simple to split data
> If it's classification task, we need to split data based on labels, because we need to ensure the divided data contain all labels available
Examples
python
import numpy as np
from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import DataSplitter
x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
data_splitter = DataSplitter().fit(dataset)
labels in result will keep its ratio
result = data_splitter.split(3)
[0 0 1]
print(result.dataset.y.ravel())
data_splitter.reset()
result = data_splitter.split(0.5)
[0 0 1]
print(result.dataset.y.ravel())
at least one sample of each class will be kept
y[-2] = 0
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
data_splitter = DataSplitter().fit(dataset)
result = data_splitter.split(2)
[0 0 0 0 0 1] [0 1]
print(y, result.dataset.y.ravel())
`KFold`
Util class which can perform k-fold data splitting:
1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
2. Xi ∩ Xj = ∅, ∀ i, j = 1,..., K
3. X1 ∪ X2 ∪ ... ∪ Xk = X
> Notice that `KFold` does not always hold the principles listed above, because `DataSplitter` will ensure that at least one sample of each class will be kept. In this case, when we apply `KFold` to an imbalance dataset, `KFold` may slightly violate principle 2. and 3.
Parameters
+ k : int, number of folds
+ dataset : TabularDataset, dataset which we want to split
+ **kwargs : used to initialize `DataSplitter` instance
Examples
python
import numpy as np
from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import KFold
x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
k_fold = KFold(3, dataset)
for train_fold, test_fold in k_fold:
print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))
`KRandom`
Util class which can perform k-random data splitting:
1. X = {x1, x2, ..., xn} -> [X1, X2, ..., Xk]
2. idx{X1} ≠ idx{X2} ≠ ... ≠ idx{Xk}, where idx{X} = {1, 2, ..., n}
3. X1 = X2 = ... = Xk = X
Parameters
+ k : int, number of folds
+ num_test : {int, float}
+ if float and < 1 : ratio of the test dataset
+ if int and > 1 : exact number of test samples
+ dataset : TabularDataset, dataset which we want to split
+ **kwargs : used to initialize `DataSplitter` instance
Examples
python
import numpy as np
from cfdata.types import np_int_type
from cfdata.tabular.types import TaskTypes
from cfdata.tabular.wrapper import TabularDataset
from cfdata.tabular.utils import KRandom
x = np.arange(12).reshape([6, 2])
create an imbalance dataset
y = np.zeros(6, np_int_type)
y[[-1, -2]] = 1
dataset = TabularDataset.from_xy(x, y, TaskTypes.CLASSIFICATION)
k_random = KRandom(3, 2, dataset)
for train_fold, test_fold in k_random:
print(np.vstack([train_fold.dataset.x, test_fold.dataset.x]))
print(np.vstack([train_fold.dataset.y, test_fold.dataset.y]))
`ImbalancedSampler`
Util class which can sample imbalance dataset in a balanced way
Parameters
+ data : TabularData, data which we want to sample from
+ imbalance_threshold : float
+ for binary class cases, if n_pos / n_neg < threshold, we'll treat data as imbalance data
+ for multi class cases, if n_min_class / n_max_class < threshold, we'll treat data as imbalance data
+ shuffle : bool, whether shuffle the returned indices
+ sample_method : str, sampling method used in `cftool.misc.Sampler`
+ currently only 'multinomial' is supported
+ verbose_level : int, verbose level used in `LoggingMixin`
Examples
python
import numpy as np
from cfdata.types import np_int_type
from cfdata.tabular import TabularData
from cfdata.tabular.utils import ImbalancedSampler
from cftool.misc import get_counter_from_arr
n = 20
x = np.arange(2 * n).reshape([n, 2])
create an imbalance dataset
y = np.zeros([n, 1], np_int_type)
y[-1] = [1]
data = TabularData().read(x, y)
sampler = ImbalancedSampler(data)
Counter({1: 12, 0: 8})
This may vary, but will be rather balanced
You might notice that positive samples are even more than negative samples!
print(get_counter_from_arr(y[sampler.get_indices()]))
`DataLoader`
Util class which can generated batches from `ImbalancedSampler`
Examples
python
import numpy as np
from cfdata.types import np_int_type
from cfdata.tabular import TabularData
from cfdata.tabular.utils import DataLoader
from cfdata.tabular.utils import ImbalancedSampler
from cftool.misc import get_counter_from_arr
n = 20
x = np.arange(2 * n).reshape([n, 2])
y = np.zeros([n, 1], np_int_type)
y[-1] = [1]
data = TabularData().read(x, y)
sampler = ImbalancedSampler(data)
loader = DataLoader(16, sampler)
y_batches = []
for x_batch, y_batch in loader:
y_batches.append(y_batch)
(16, 1) (16, 1)
(4, 1) (4, 1)
print(x_batch.shape, y_batch.shape)
Counter({1: 11, 0: 9})
print(get_counter_from_arr(np.vstack(y_batches).ravel()))