Abstract class for managing the data and samples during training a classifier. DataManagerClassifier is used with TEClassifierRegular and TEClassifierProtoNet.
Value
Objects of this class are used for ensuring the correct data management for training different types of classifiers. Objects of this class are also used for data augmentation by creating synthetic cases with different techniques.
See also
Other Data Management:
EmbeddedText
,
LargeDataSetForText
,
LargeDataSetForTextEmbeddings
Public fields
config
('list')
Field for storing configuration of the DataManagerClassifier.state
('list')
Field for storing the current state of the DataManagerClassifier.datasets
('list')
Field for storing the data sets used during training. All elements of the list are data sets of classdatasets.arrow_dataset.Dataset
. The following data sets are available:data_labeled: all cases which have a label.
data_unlabeled: all cases which have no label.
data_labeled_synthetic: all synthetic cases with their corresponding labels.
data_labeled_pseudo: subset of data_unlabeled if pseudo labels were estimated by a classifier.
name_idx
('named vector')
Field for storing the pairs of indexes and names of every case. The pairs for labeled and unlabeled data are separated.samples
('list')
Field for storing the assignment of every cases to a train, validation or test data set depending on the concrete fold. Only the indexes and not the names are stored. In addition, the list contains the assignment for the final training which excludes a test data set. If the DataManagerClassifier usesi
folds the sample for the final training can be requested withi+1
.
Methods
Method new()
Creating a new instance of this class.
Usage
DataManagerClassifier$new(
data_embeddings,
data_targets,
folds = 5,
val_size = 0.25,
class_levels,
one_hot_encoding = TRUE,
add_matrix_map = TRUE,
sc_methods = "dbsmote",
sc_min_k = 1,
sc_max_k = 10,
trace = TRUE,
n_cores = auto_n_cores()
)
Arguments
data_embeddings
Object of class EmbeddedText or LargeDataSetForTextEmbeddings from which the DataManagerClassifier should be created.
data_targets
factor
containing the labels for cases stored indata_embeddings
. Factor must be named and has to use the same names used indata_embeddings
. Missing values are supported and should be supplied (e.g., for pseudo labeling).folds
int
determining the number of cross-fold samples. Value must be at least 2.val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be used for the validation sample. The remaining cases are part of the training data.class_levels
vector
containing the possible levels of the labels.one_hot_encoding
bool
IfTRUE
all labels are converted to one hot encoding.add_matrix_map
bool
IfTRUE
all embeddings are transformed into a two dimensional matrix. The number of rows equals the number of cases. The number of columns equalstimes*features
.sc_methods
string
determining the technique used for creating synthetic cases.sc_min_k
int
determining the minimal number of neighbors during the creating of synthetic cases.sc_max_k
int
determining the minimal number of neighbors during the creating of synthetic cases.trace
bool
IfTRUE
information on the process are printed to the console.n_cores
int
Number of cores which should be used during the calculation of synthetic cases.
Method get_samples()
Method for requesting the assignments to train, validation, and test data sets for every fold and the final training.
Method set_state()
Method for setting the current state of the DataManagerClassifier.
Arguments
iteration
int
determining the current iteration of the training. That is iteration determines the fold to use for training, validation, and testing. If i is the number of fold i+1 request the sample for the final training. For requesting the sample for the final training iteration can take a string"final"
.step
int
determining the step for estimating and using pseudo labels during training. Only relevant if training is requested with pseudo labels.
Method get_n_folds()
Method for requesting the number of folds the DataManagerClassifier can use with the current data.
Method get_dataset()
Method for requesting a data set for training depending in the current state of the DataManagerClassifier.
Usage
DataManagerClassifier$get_dataset(
inc_labeled = TRUE,
inc_unlabeled = FALSE,
inc_synthetic = FALSE,
inc_pseudo_data = FALSE
)
Arguments
inc_labeled
bool
IfTRUE
the data set includes all cases which have labels.inc_unlabeled
bool
IfTRUE
the data set includes all cases which have no labels.inc_synthetic
bool
IfTRUE
the data set includes all synthetic cases with their corresponding labels.inc_pseudo_data
bool
IfTRUE
the data set includes all cases which have pseudo labels.
Returns
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for training. Please note that this method returns a data sets that is
designed for training only. The corresponding validation data set is requested with get_val_dataset
and the
corresponding test data set with get_test_dataset
.
Method get_val_dataset()
Method for requesting a data set for validation depending in the current state of the DataManagerClassifier.
Returns
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset
and the corresponding data set for testing with get_test_dataset
.
Method get_test_dataset()
Method for requesting a data set for testing depending in the current state of the DataManagerClassifier.
Returns
Returns an object of class datasets.arrow_dataset.Dataset
containing the requested kind of data along
with all requested transformations for validation. The corresponding data set for training can be requested
with get_dataset
and the corresponding data set for validation with get_val_dataset
.
Method create_synthetic()
Method for generating synthetic data used during training. The process uses all labeled data belonging to the current state of the DataManagerClassifier.
Arguments
trace
bool
IfTRUE
information on the process are printed to the console.inc_pseudo_data
bool
IfTRUE
data with pseudo labels are used in addition to the labeled data for generating synthetic cases.
Returns
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset
in the field datasets$data_labeled_synthetic
. Please note
that a call of this method will override an existing data set in the corresponding field.
Method add_replace_pseudo_data()
Method for adding data with pseudo labels generated by a classifier
Arguments
inputs
array
ormatrix
representing the input data.labels
factor
containing the corresponding pseudo labels.
Returns
This method does nothing return. It generates a new data set for synthetic cases which are stored as an
object of class datasets.arrow_dataset.Dataset
in the field datasets$data_labeled_pseudo
. Please note that
a call of this method will override an existing data set in the corresponding field.