Text embedding classifier with a neural net

Abstract class for neural nets with 'keras'/'tensorflow' and ' pytorch'.

Value

Objects of this class are used for assigning texts to classes/categories. For the creation and training of a classifier an object of class EmbeddedText or LargeDataSetForTextEmbeddings on the one hand and a factor on the other hand are necessary.

The object of class EmbeddedText or LargeDataSetForTextEmbeddings contains the numerical text representations (text embeddings) of the raw texts generated by an object of class TextEmbeddingModel. For supporting large data sets it is recommended to use LargeDataSetForTextEmbeddings instead of EmbeddedText.

The factor contains the classes/categories for every text. Missing values (unlabeled cases) are supported and can be used for pseudo labeling.

For predictions an object of class EmbeddedText or LargeDataSetForTextEmbeddings has to be used which was created with the same TextEmbeddingModel as for training.

Super class

aifeducation::AIFEBaseModel -> TEClassifierRegular

Public fields

feature_extractor

('list()')
List for storing information and objects about the feature_extractor.

reliability

('list()')

List for storing central reliability measures of the last training.

reliability$test_metric: Array containing the reliability measures for the test data for every fold and step (in case of pseudo-labeling).
reliability$test_metric_mean: Array containing the reliability measures for the test data. The values represent the mean values for every fold.
reliability$raw_iota_objects: List containing all iota_object generated with the package iotarelr for every fold at the end of the last training for the test data.
reliability$raw_iota_objects$iota_objects_end: List of objects with class iotarelr_iota2 containing the estimated iota reliability of the second generation for the final model for every fold for the test data.
reliability$raw_iota_objects$iota_objects_end_free: List of objects with class iotarelr_iota2 containing the estimated iota reliability of the second generation for the final model for every fold for the test data. Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$iota_object_end: Object of class iotarelr_iota2 as a mean of the individual objects for every fold for the test data.
reliability$iota_object_end_free: Object of class iotarelr_iota2 as a mean of the individual objects for every fold. Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$standard_measures_end: Object of class list containing the final measures for precision, recall, and f1 for every fold.
reliability$standard_measures_mean: matrix containing the mean measures for precision, recall, and f1.

Methods

Public methods

TEClassifierRegular$configure()
TEClassifierRegular$train()
TEClassifierRegular$predict()
TEClassifierRegular$check_embedding_model()
TEClassifierRegular$check_feature_extractor_object_type()
TEClassifierRegular$requires_compression()
TEClassifierRegular$save()
TEClassifierRegular$load_from_disk()
TEClassifierRegular$clone()

Inherited methods

Method `configure()`

Creating a new instance of this class.

Usage

TEClassifierRegular$configure(
  ml_framework = "pytorch",
  name = NULL,
  label = NULL,
  text_embeddings = NULL,
  feature_extractor = NULL,
  target_levels = NULL,
  dense_size = 4,
  dense_layers = 0,
  rec_size = 4,
  rec_layers = 2,
  rec_type = "gru",
  rec_bidirectional = FALSE,
  self_attention_heads = 0,
  intermediate_size = NULL,
  attention_type = "fourier",
  add_pos_embedding = TRUE,
  rec_dropout = 0.1,
  repeat_encoder = 1,
  dense_dropout = 0.4,
  recurrent_dropout = 0.4,
  encoder_dropout = 0.1,
  optimizer = "adam"
)

Arguments

ml_framework: string Framework to use for training and inference. ml_framework="tensorflow" for 'tensorflow' and ml_framework="pytorch" for 'pytorch'
name: string Name of the new classifier. Please refer to common name conventions. Free text can be used with parameter label.
label: string Label for the new classifier. Here you can use free text.
text_embeddings: An object of class EmbeddedText or LargeDataSetForTextEmbeddings.
feature_extractor: Object of class TEFeatureExtractor which should be used in order to reduce the number of dimensions of the text embeddings. If no feature extractor should be applied set NULL.
target_levels: vector containing the levels (categories or classes) within the target data. Please not that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels indicating a higher category/class. For nominal data the order does not matter.
dense_size: int Number of neurons for each dense layer.
dense_layers: int Number of dense layers.
rec_size: int Number of neurons for each recurrent layer.
rec_layers: int Number of recurrent layers.
rec_type: string Type of the recurrent layers. rec_type="gru" for Gated Recurrent Unit and rec_type="lstm" for Long Short-Term Memory.
rec_bidirectional: bool If TRUE a bidirectional version of the recurrent layers is used.
self_attention_heads: int determining the number of attention heads for a self-attention layer. Only relevant if attention_type="multihead"
intermediate_size: int determining the size of the projection layer within a each transformer encoder.
attention_type: string Choose the relevant attention type. Possible values are fourier and multihead. Please note that you may see different values for a case for different input orders if you choose fourier on linux.
add_pos_embedding: bool TRUE if positional embedding should be used.
rec_dropout: int ranging between 0 and lower 1, determining the dropout between bidirectional recurrent layers.
repeat_encoder: int determining how many times the encoder should be added to the network.
dense_dropout: int ranging between 0 and lower 1, determining the dropout between dense layers.
recurrent_dropout: int ranging between 0 and lower 1, determining the recurrent dropout for each recurrent layer. Only relevant for keras models.
encoder_dropout: int ranging between 0 and lower 1, determining the dropout for the dense projection within the encoder layers.
optimizer: string "adam" or "rmsprop" .

Returns

Returns an object of class TEClassifierRegular which is ready for training.

Method `train()`

Method for training a neural net.

Training includes a routine for early stopping. In the case that loss<0.0001 and Accuracy=1.00 and Average Iota=1.00 training stops. The history uses the values of the last trained epoch for the remaining epochs.

After training the model with the best values for Average Iota, Accuracy, and Loss on the validation data set is used as the final model.

Usage

TEClassifierRegular$train(
  data_embeddings,
  data_targets,
  data_folds = 5,
  data_val_size = 0.25,
  balance_class_weights = TRUE,
  balance_sequence_length = TRUE,
  use_sc = TRUE,
  sc_method = "dbsmote",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = TRUE,
  pl_max_steps = 3,
  pl_max = 1,
  pl_anchor = 1,
  pl_min = 0,
  sustain_track = TRUE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 40,
  batch_size = 32,
  dir_checkpoint,
  trace = TRUE,
  ml_trace = 1,
  log_dir = NULL,
  log_write_interval = 10,
  n_cores = auto_n_cores()
)

Arguments

data_embeddings: Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
data_targets: factor containing the labels for cases stored in data_embeddings. Factor must be named and has to use the same names used in data_embeddings.
data_folds: int determining the number of cross-fold samples.
data_val_size: double between 0 and 1, indicating the proportion of cases of each class which should be used for the validation sample during the estimation of the model. The remaining cases are part of the training data.
balance_class_weights: bool If TRUE class weights are generated based on the frequencies of the training data with the method Inverse Class Frequency'. If FALSE each class has the weight 1.
balance_sequence_length: bool If TRUE sample weights are generated for the length of sequences based on the frequencies of the training data with the method Inverse Class Frequency'. If FALSE each sequences length has the weight 1.
use_sc: bool TRUE if the estimation should integrate synthetic cases. FALSE if not.
sc_method: vector containing the method for generating synthetic cases. Possible are sc_method="adas", sc_method="smote", and sc_method="dbsmote".
sc_min_k: int determining the minimal number of k which is used for creating synthetic units.
sc_max_k: int determining the maximal number of k which is used for creating synthetic units.
use_pl: bool TRUE if the estimation should integrate pseudo-labeling. FALSE if not.
pl_max_steps: int determining the maximum number of steps during pseudo-labeling.
pl_max: double between 0 and 1, setting the maximal level of confidence for considering a case for pseudo-labeling.
pl_anchor: double between 0 and 1 indicating the reference point for sorting the new cases of every label. See notes for more details.
pl_min: double between 0 and 1, setting the minimal level of confidence for considering a case for pseudo-labeling.
sustain_track: bool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_code: string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_region: Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval: int Interval in seconds for measuring power usage.
epochs: int Number of training epochs.
batch_size: int Size of the batches for training.
dir_checkpoint: string Path to the directory where the checkpoint during training should be saved. If the directory does not exist, it is created.
trace: bool TRUE, if information about the estimation phase should be printed to the console.
ml_trace: int ml_trace=0 does not print any information about the training process from pytorch on the console.
log_dir: string Path to the directory where the log files should be saved. If no logging is desired set this argument to NULL.
log_write_interval: int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_dir is not NULL.
n_cores: int Number of cores which should be used during the calculation of synthetic cases. Only relevant if use_sc=TRUE.

Details

sc_max_k: All values from sc_min_k up to sc_max_k are successively used. If the number of sc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic units.
pl_anchor: With the help of this value, the new cases are sorted. For this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.

Returns

Function does not return a value. It changes the object into a trained classifier.

Method `predict()`

Method for predicting new data with a trained neural net.

Usage

TEClassifierRegular$predict(newdata, batch_size = 32, ml_trace = 1)

Arguments

newdata: Object of class TextEmbeddingModel or LargeDataSetForTextEmbeddings for which predictions should be made. In addition, this method allows to use objects of class array and datasets.arrow_dataset.Dataset. However, these should be used only by developers.
batch_size: int Size of batches.
ml_trace: int ml_trace=0 does not print any information on the process from the machine learning framework.

Returns

Returns a data.frame containing the predictions and the probabilities of the different labels for each case.

Method `check_embedding_model()`

Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.

Usage

TEClassifierRegular$check_embedding_model(
  text_embeddings,
  require_compressed = FALSE
)

Arguments

text_embeddings: Object of class EmbeddedText or LargeDataSetForTextEmbeddings.
require_compressed: TRUE if a compressed version of the embeddings are necessary. Compressed embeddings are created by an object of class TEFeatureExtractor.

Returns

TRUE if the underlying TextEmbeddingModel is the same. FALSE if the models differ.

Method `check_feature_extractor_object_type()`

Method for checking an object of class TEFeatureExtractor.

Usage

TEClassifierRegular$check_feature_extractor_object_type(feature_extractor)

Arguments

feature_extractor: Object of class TEFeatureExtractor

Returns

This method does nothing returns. It raises an error if

the object is NULL
the object does not rely on the same machine learning framework as the classifier
the object is not trained.

Method `requires_compression()`

Method for checking if provided text embeddings must be compressed via a TEFeatureExtractor before processing.

Usage

TEClassifierRegular$requires_compression(text_embeddings)

Arguments

text_embeddings: Object of class EmbeddedText, LargeDataSetForTextEmbeddings, array or datasets.arrow_dataset.Dataset.

Returns

Return TRUE if a compression is necessary and FALSE if not.

Method `save()`

Method for saving a model.

Usage

TEClassifierRegular$save(dir_path, folder_name)

Arguments

dir_path: string Path of the directory where the model should be saved.
folder_name: string Name of the folder that should be created within the directory.

Returns

Function does not return a value. It saves the model to disk.

Method `load_from_disk()`

loads an object from disk and updates the object to the current version of the package.

Usage

TEClassifierRegular$load_from_disk(dir_path)

Arguments

dir_path: Path where the object set is stored.

Returns

Method does not return anything. It loads an object from disk.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

TEClassifierRegular$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

See also

Super class

Public fields

Methods

Public methods

Method configure()

Usage

Arguments

Returns

Method train()

Usage

Arguments

Details

Returns

Method predict()

Usage

Arguments

Returns

Method check_embedding_model()

Usage

Arguments

Returns

Method check_feature_extractor_object_type()

Usage

Arguments

Returns

Method requires_compression()

Usage

Arguments

Returns

Method save()

Usage

Arguments

Returns

Method load_from_disk()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

Method `configure()`

Method `train()`

Method `predict()`

Method `check_embedding_model()`

Method `check_feature_extractor_object_type()`

Method `requires_compression()`

Method `save()`

Method `load_from_disk()`

Method `clone()`