Skip to contents

Function creates cross-validation samples and ensures that the relative frequency for every category/label within a fold equals the relative frequency of the category/label within the initial data.

Usage

get_folds(target, k_folds)

Arguments

target

Named factor containing the relevant labels/categories. Missing cases should be declared with NA.

k_folds

int number of folds.

Value

Return a list with the following components:

  • val_sample: vector of strings containing the names of cases of the validation sample.

  • train_sample: vector of strings containing the names of cases of the train sample.

  • n_folds: int Number of realized folds.

  • unlabeled_cases: vector of strings containing the names of the unlabeled cases.

Note

The parameter target allows cases with missing categories/labels. These should be declared with NA. All these cases are ignored for creating the different folds. Their names are saved within the component unlabeled_cases. These cases can be used for Pseudo Labeling.

the function checks the absolute frequencies of every category/label. If the absolute frequency is not sufficient to ensure at least four cases in every fold, the number of folds is adjusted. In these cases, a warning is printed to the console. At least four cases per fold are necessary to ensure that the training of TextEmbeddingClassifierNeuralNet works well with all options turned on.