Text embedding classifier with a neural net
Source:R/te_classifier_neuralnet_model.R
TextEmbeddingClassifierNeuralNet.Rd
Abstract class for neural nets with 'keras'/'tensorflow' and 'pytorch'.
Value
Objects of this class are used for assigning texts to classes/categories. For
the creation and training of a classifier an object of class EmbeddedText and a factor
are necessary. The object of class EmbeddedText contains the numerical text
representations (text embeddings) of the raw texts generated by an object of class
TextEmbeddingModel. The factor
contains the classes/categories for every
text. Missing values (unlabeled cases) are supported. For predictions an object of class
EmbeddedText has to be used which was created with the same text embedding model as
for training.
Public fields
model
('tensorflow_model()')
Field for storing the tensorflow model after loading.model_config
('list()')
List for storing information about the configuration of the model. This information is used to predict new data.model_config$n_rec:
Number of recurrent layers.model_config$n_hidden:
Number of dense layers.model_config$target_levels:
Levels of the target variable. Do not change this manually.model_config$input_variables:
Order and name of the input variables. Do not change this manually.model_config$init_config:
List storing all parameters passed to method new().
last_training
('list()')
List for storing the history and the results of the last training. This information will be overwritten if a new training is started.last_training$learning_time:
Duration of the training process.config$history:
History of the last training.config$data:
Object of class table storing the initial frequencies of the passed data.config$data_pb:l
Matrix storing the number of additional cases (test and training) added during balanced pseudo-labeling. The rows refer to folds and final training. The columns refer to the steps during pseudo-labeling.config$data_bsc_test:
Matrix storing the number of cases for each category used for testing during the phase of balanced synthetic units. Please note that the frequencies include original and synthetic cases. In case the number of original and synthetic cases exceeds the limit for the majority classes, the frequency represents the number of cases created by cluster analysis.config$date:
Time when the last training finished.config$config:
List storing which kind of estimation was requested during the last training.config$config$use_bsc:
TRUE
if balanced synthetic cases were requested.FALSE
if not.config$config$use_baseline:
TRUE
if baseline estimation were requested.FALSE
if not.config$config$use_bpl:
TRUE
if balanced, pseudo-labeling cases were requested.FALSE
if not.
reliability
('list()')
List for storing central reliability measures of the last training.reliability$test_metric:
Array containing the reliability measures for the validation data for every fold, method, and step (in case of pseudo-labeling).reliability$test_metric_mean:
Array containing the reliability measures for the validation data for every method and step (in case of pseudo-labeling). The values represent the mean values for every fold.reliability$raw_iota_objects:
List containing all iota_object generated with the packageiotarelr
for every fold at the start and the end of the last training.reliability$raw_iota_objects$iota_objects_start:
List of objects with classiotarelr_iota2
containing the estimated iota reliability of the second generation for the baseline model for every fold. If the estimation of the baseline model is not requested, the list is set toNULL
.reliability$raw_iota_objects$iota_objects_end:
List of objects with classiotarelr_iota2
containing the estimated iota reliability of the second generation for the final model for every fold. Depending of the requested training method these values refer to the baseline model, a trained model on the basis of balanced synthetic cases, balanced pseudo labeling or a combination of balanced synthetic cases with pseudo labeling.reliability$raw_iota_objects$iota_objects_start_free:
List of objects with classiotarelr_iota2
containing the estimated iota reliability of the second generation for the baseline model for every fold. If the estimation of the baseline model is not requested, the list is set toNULL
.Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.reliability$raw_iota_objects$iota_objects_end_free:
List of objects with classiotarelr_iota2
containing the estimated iota reliability of the second generation for the final model for every fold. Depending of the requested training method, these values refer to the baseline model, a trained model on the basis of balanced synthetic cases, balanced pseudo-labeling or a combination of balanced synthetic cases and pseudo-labeling. Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$iota_object_start:
Object of classiotarelr_iota2
as a mean of the individual objects for every fold. If the estimation of the baseline model is not requested, the list is set toNULL
.reliability$iota_object_start_free:
Object of classiotarelr_iota2
as a mean of the individual objects for every fold. If the estimation of the baseline model is not requested, the list is set toNULL
. Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.reliability$iota_object_end:
Object of classiotarelr_iota2
as a mean of the individual objects for every fold. Depending on the requested training method, this object refers to the baseline model, a trained model on the basis of balanced synthetic cases, balanced pseudo-labeling or a combination of balanced synthetic cases and pseudo-labeling.reliability$iota_object_end_free:
Object of classiotarelr_iota2
as a mean of the individual objects for every fold. Depending on the requested training method, this object refers to the baseline model, a trained model on the basis of balanced synthetic cases, balanced pseudo-labeling or a combination of balanced synthetic cases and pseudo-labeling. Please note that the model is estimated without forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.reliability$standard_measures_end:
Object of classlist
containing the final measures for precision, recall, and f1 for every fold. Depending of the requested training method, these values refer to the baseline model, a trained model on the basis of balanced synthetic cases, balanced pseudo-labeling or a combination of balanced synthetic cases and pseudo-labeling.reliability$standard_measures_mean:
matrix
containing the mean measures for precision, recall, and f1 at the end of every fold.
Methods
Public methods
Method new()
Creating a new instance of this class.
Usage
TextEmbeddingClassifierNeuralNet$new(
ml_framework = aifeducation_config$get_framework(),
name = NULL,
label = NULL,
text_embeddings = NULL,
targets = NULL,
hidden = c(128),
rec = c(128),
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = TRUE,
rec_dropout = 0.1,
repeat_encoder = 1,
dense_dropout = 0.4,
recurrent_dropout = 0.4,
encoder_dropout = 0.1,
optimizer = "adam"
)
Arguments
ml_framework
string
Framework to use for training and inference.ml_framework="tensorflow"
for 'tensorflow' andml_framework="pytorch"
for 'pytorch'name
Character
Name of the new classifier. Please refer to common name conventions. Free text can be used with parameterlabel
.label
Character
Label for the new classifier. Here you can use free text.text_embeddings
An object of class
TextEmbeddingModel
.targets
factor
containing the target values of the classifier.hidden
vector
containing the number of neurons for each dense layer. The length of the vector determines the number of dense layers. If you want no dense layer, set this parameter toNULL
.rec
vector
containing the number of neurons for each recurrent layer. The length of the vector determines the number of dense layers. If you want no dense layer, set this parameter toNULL
.self_attention_heads
integer
determining the number of attention heads for a self-attention layer. Only relevant ifattention_type="multihead"
intermediate_size
int
determining the size of the projection layer within a each transformer encoder.attention_type
string
Choose the relevant attention type. Possible values are"fourier"
andmultihead
.add_pos_embedding
bool
TRUE
if positional embedding should be used.rec_dropout
double
ranging between 0 and lower 1, determining the dropout between bidirectional gru layers.repeat_encoder
int
determining how many times the encoder should be added to the network.dense_dropout
double
ranging between 0 and lower 1, determining the dropout between dense layers.recurrent_dropout
double
ranging between 0 and lower 1, determining the recurrent dropout for each recurrent layer. Only relevant for keras models.encoder_dropout
double
ranging between 0 and lower 1, determining the dropout for the dense projection within the encoder layers.optimizer
Object of class
keras.optimizers
.
Method train()
Method for training a neural net.
Usage
TextEmbeddingClassifierNeuralNet$train(
data_embeddings,
data_targets,
data_n_test_samples = 5,
balance_class_weights = TRUE,
use_baseline = TRUE,
bsl_val_size = 0.25,
use_bsc = TRUE,
bsc_methods = c("dbsmote"),
bsc_max_k = 10,
bsc_val_size = 0.25,
bsc_add_all = FALSE,
use_bpl = TRUE,
bpl_max_steps = 3,
bpl_epochs_per_step = 1,
bpl_dynamic_inc = FALSE,
bpl_balance = FALSE,
bpl_max = 1,
bpl_anchor = 1,
bpl_min = 0,
bpl_weight_inc = 0.02,
bpl_weight_start = 0,
bpl_model_reset = FALSE,
sustain_track = TRUE,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 15,
epochs = 40,
batch_size = 32,
dir_checkpoint,
trace = TRUE,
keras_trace = 2,
pytorch_trace = 2,
n_cores = 2
)
Arguments
data_embeddings
Object of class
TextEmbeddingModel
.data_targets
Factor
containing the labels for cases stored indata_embeddings
. Factor must be named and has to use the same names used indata_embeddings
.data_n_test_samples
int
determining the number of cross-fold samples.balance_class_weights
bool
IfTRUE
class weights are generated based on the frequencies of the training data with the method Inverse Class Frequency'. IfFALSE
each class has the weight 1.use_baseline
bool
TRUE
if the calculation of a baseline model is requested. This option is only relevant foruse_bsc=TRUE
oruse_pbl=TRUE
. If both areFALSE
, a baseline model is calculated.bsl_val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be used for the validation sample during the estimation of the baseline model. The remaining cases are part of the training data.use_bsc
bool
TRUE
if the estimation should integrate balanced synthetic cases.FALSE
if not.bsc_methods
vector
containing the methods for generating synthetic cases via 'smotefamily'. Multiple methods can be passed. Currentlybsc_methods=c("adas")
,bsc_methods=c("smote")
andbsc_methods=c("dbsmote")
are possible.bsc_max_k
int
determining the maximal number of k which is used for creating synthetic units.bsc_val_size
double
between 0 and 1, indicating the proportion of cases of each class which should be used for the validation sample during the estimation with synthetic cases.bsc_add_all
bool
IfFALSE
only synthetic cases necessary to fill the gab between the class and the major class are added to the data. IfTRUE
all generated synthetic cases are added to the data.use_bpl
bool
TRUE
if the estimation should integrate balanced pseudo-labeling.FALSE
if not.bpl_max_steps
int
determining the maximum number of steps during pseudo-labeling.bpl_epochs_per_step
int
Number of training epochs within every step.bpl_dynamic_inc
bool
IfTRUE
, only a specific percentage of cases is included during each step. The percentage is determined by \(step/bpl_max_steps\). IfFALSE
, all cases are used.bpl_balance
bool
IfTRUE
, the same number of cases for every category/class of the pseudo-labeled data are used with training. That is, the number of cases is determined by the minor class/category.bpl_max
double
between 0 and 1, setting the maximal level of confidence for considering a case for pseudo-labeling.bpl_anchor
double
between 0 and 1 indicating the reference point for sorting the new cases of every label. See notes for more details.bpl_min
double
between 0 and 1, setting the minimal level of confidence for considering a case for pseudo-labeling.bpl_weight_inc
double
value how much the sample weights should be increased for the cases with pseudo-labels in every step.bpl_weight_start
dobule
Starting value for the weights of the unlabeled cases.bpl_model_reset
bool
IfTRUE
, model is re-initialized at every step.sustain_track
bool
IfTRUE
energy consumption is tracked during training via the python library codecarbon.sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.sustain_region
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_interval
integer
Interval in seconds for measuring power usage.epochs
int
Number of training epochs.batch_size
int
Size of batches.dir_checkpoint
string
Path to the directory where the checkpoint during training should be saved. If the directory does not exist, it is created.trace
bool
TRUE
, if information about the estimation phase should be printed to the console.keras_trace
int
keras_trace=0
does not print any information about the training process from keras on the console.pytorch_trace
int
pytorch_trace=0
does not print any information about the training process from pytorch on the console.pytorch_trace=1
prints a progress bar.pytorch_trace=2
prints one line of information for every epoch.n_cores
int
Number of cores used for creating synthetic units.
Details
bsc_max_k:
All values from 2 up to bsc_max_k are successively used. If the number of bsc_max_k is too high, the value is reduced to a number that allows the calculating of synthetic units.bpl_anchor:
With the help of this value, the new cases are sorted. For this aim, the distance from the anchor is calculated and all cases are arranged into an ascending order.
Method predict()
Method for predicting new data with a trained neural net.
Arguments
newdata
Object of class
TextEmbeddingModel
ordata.frame
for which predictions should be made.batch_size
int
Size of batches.verbose
int
verbose=0
does not cat any information about the training process from keras on the console.verbose=1
prints a progress bar.verbose=2
prints one line of information for every epoch.
Method check_embedding_model()
Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.
Arguments
text_embeddings
Object of class EmbeddedText.
Returns
TRUE
if the underlying TextEmbeddingModel are the same.
FALSE
if the models differ.
Method set_publication_info()
Method for setting publication information of the classifier
Method get_publication_info()
Method for requesting the bibliographic information of the classifier.
Method set_documentation_license()
Method for setting the license of the classifier's documentation.
Method get_documentation_license()
Method for getting the license of the classifier's documentation.
Method set_model_description()
Method for setting a description of the classifier.
Usage
TextEmbeddingClassifierNeuralNet$set_model_description(
eng = NULL,
native = NULL,
abstract_eng = NULL,
abstract_native = NULL,
keywords_eng = NULL,
keywords_native = NULL
)
Arguments
eng
string
A text describing the training of the learner, its theoretical and empirical background, and the different output labels in English.native
string
A text describing the training of the learner, its theoretical and empirical background, and the different output labels in the native language of the classifier.abstract_eng
string
A text providing a summary of the description in English.abstract_native
string
A text providing a summary of the description in the native language of the classifier.keywords_eng
vector
of keyword in English.keywords_native
vector
of keyword in the native language of the classifier.
Method save_model()
Method for saving a model to 'Keras v3 format', 'tensorflow' SavedModel format or h5 format.
Arguments
dir_path
string()
Path of the directory where the model should be saved.save_format
Format for saving the model. For 'tensorflow'/'keras' models
"keras"
for 'Keras v3 format',"tf"
for SavedModel or"h5"
for HDF5. For 'pytorch' models"safetensors"
for 'safetensors' or"pt"
for 'pytorch' via pickle. Use"default"
for the standard format. This is keras for 'tensorflow'/'keras' models and safetensors for 'pytorch' models.
Method load_model()
Method for importing a model from 'Keras v3 format', 'tensorflow' SavedModel format or h5 format.
Method get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the classifier.
Method get_sustainability_data()
Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
Method get_ml_framework()
Method for requesting the machine learning framework used for the classifier.