Text embedding model

This R6 class stores a text embedding model which can be used to tokenize, encode, decode, and embed raw texts. The object provides a unique interface for different text processing methods.

Value

Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.

Public fields

last_training

('list()')
List for storing the history and the results of the last training. This information will be overwritten if a new training is started.

tokenizer_statistics

('matrix()')
Matrix containing the tokenizer statistics for the creation of the tokenizer and all training runs according to Kaya & Tantuğ (2024).

Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335

Methods

Public methods

TextEmbeddingModel$configure()
TextEmbeddingModel$load_from_disk()
TextEmbeddingModel$load()
TextEmbeddingModel$save()
TextEmbeddingModel$encode()
TextEmbeddingModel$decode()
TextEmbeddingModel$get_special_tokens()
TextEmbeddingModel$embed()
TextEmbeddingModel$embed_large()
TextEmbeddingModel$fill_mask()
TextEmbeddingModel$set_publication_info()
TextEmbeddingModel$get_publication_info()
TextEmbeddingModel$set_model_license()
TextEmbeddingModel$get_model_license()
TextEmbeddingModel$set_documentation_license()
TextEmbeddingModel$get_documentation_license()
TextEmbeddingModel$set_model_description()
TextEmbeddingModel$get_model_description()
TextEmbeddingModel$get_model_info()
TextEmbeddingModel$get_package_versions()
TextEmbeddingModel$get_basic_components()
TextEmbeddingModel$get_transformer_components()
TextEmbeddingModel$get_sustainability_data()
TextEmbeddingModel$get_ml_framework()
TextEmbeddingModel$count_parameter()
TextEmbeddingModel$is_configured()
TextEmbeddingModel$get_private()
TextEmbeddingModel$get_all_fields()
TextEmbeddingModel$clone()

Method `configure()`

Method for creating a new text embedding model

Usage

TextEmbeddingModel$configure(
  model_name = NULL,
  model_label = NULL,
  model_language = NULL,
  method = NULL,
  ml_framework = "pytorch",
  max_length = 0,
  chunks = 2,
  overlap = 0,
  emb_layer_min = "middle",
  emb_layer_max = "2_3_layer",
  emb_pool_type = "average",
  model_dir = NULL,
  trace = FALSE
)

Arguments

model_name: string containing the name of the new model.
model_label: string containing the label/title of the new model.
model_language: string containing the language which the model represents (e.g., English).
method: string determining the kind of embedding model. Currently the following models are supported: method="bert" for Bidirectional Encoder Representations from Transformers (BERT), method="roberta" for A Robustly Optimized BERT Pretraining Approach (RoBERTa), method="longformer" for Long-Document Transformer, method="funnel" for Funnel-Transformer, method="deberta_v2" for Decoding-enhanced BERT with Disentangled Attention (DeBERTa V2), method="glove"`` for GlobalVector Clusters, and method="lda"` for topic modeling. See details for more information.
ml_framework: string Framework to use for the model. ml_framework="tensorflow" for 'tensorflow' and ml_framework="pytorch" for 'pytorch'. Only relevant for transformer models. To request bag-of-words model set ml_framework=NULL.
max_length: int determining the maximum length of token sequences used in transformer models. Not relevant for the other methods.
chunks: int Maximum number of chunks. Must be at least 2.
overlap: int determining the number of tokens which should be added at the beginning of the next chunk. Only relevant for transformer models.
emb_layer_min: int or string determining the first layer to be included in the creation of embeddings. An integer correspondents to the layer number. The first layer has the number 1. Instead of an integer the following strings are possible: "start" for the first layer, "middle" for the middle layer, "2_3_layer" for the layer two-third layer, and "last" for the last layer.
emb_layer_max: int or string determining the last layer to be included in the creation of embeddings. An integer correspondents to the layer number. The first layer has the number 1. Instead of an integer the following strings are possible: "start" for the first layer, "middle" for the middle layer, "2_3_layer" for the layer two-third layer, and "last" for the last layer.
emb_pool_type: string determining the method for pooling the token embeddings within each layer. If "cls" only the embedding of the CLS token is used. If "average" the token embedding of all tokens are averaged (excluding padding tokens). "cls is not supported for method="funnel".
model_dir: string path to the directory where the BERT model is stored.
trace: bool TRUE prints information about the progress. FALSE does not.

Details

In the case of any transformer (e.g.method="bert", method="roberta", and method="longformer"), a pretrained transformer model must be supplied via model_dir.

Returns

Returns an object of class TextEmbeddingModel.

Method `load_from_disk()`

loads an object from disk and updates the object to the current version of the package.

Usage

TextEmbeddingModel$load_from_disk(dir_path)

Arguments

dir_path: Path where the object set is stored.

Returns

Method does not return anything. It loads an object from disk.

Method `load()`

Method for loading a transformers model into R.

Usage

TextEmbeddingModel$load(dir_path)

Arguments

dir_path: string containing the path to the relevant model directory.

Returns

Function does not return a value. It is used for loading a saved transformer model into the R interface.

Method `save()`

Method for saving a transformer model on disk.Relevant only for transformer models.

Usage

TextEmbeddingModel$save(dir_path, folder_name)

Arguments

dir_path: string containing the path to the relevant model directory.
folder_name: string Name for the folder created within the directory. This folder contains all model files.

Returns

Function does not return a value. It is used for saving a transformer model to disk.

Method `encode()`

Method for encoding words of raw texts into integers.

Usage

TextEmbeddingModel$encode(
  raw_text,
  token_encodings_only = FALSE,
  to_int = TRUE,
  trace = FALSE
)

Arguments

raw_text: vectorcontaining the raw texts.
token_encodings_only: bool If TRUE, only the token encodings are returned. If FALSE, the complete encoding is returned which is important for some transformer models.
to_int: bool If TRUE the integer ids of the tokens are returned. If FALSE the tokens are returned. Argument only applies for transformer models and if token_encodings_only=TRUE.
trace: bool If TRUE, information of the progress is printed. FALSE if not requested.

Returns

list containing the integer or token sequences of the raw texts with special tokens.

Method `decode()`

Method for decoding a sequence of integers into tokens

Usage

TextEmbeddingModel$decode(int_seqence, to_token = FALSE)

Arguments

int_seqence: list containing the integer sequences which should be transformed to tokens or plain text.
to_token: bool If FALSE plain text is returned. If TRUE a sequence of tokens is returned. Argument only relevant if the model is based on a transformer.

Returns

list of token sequences

Method `get_special_tokens()`

Method for receiving the special tokens of the model

Usage

TextEmbeddingModel$get_special_tokens()

Returns

Returns a matrix containing the special tokens in the rows and their type, token, and id in the columns.

Method `embed()`

Method for creating text embeddings from raw texts. This method should only be used if a small number of texts should be transformed into text embeddings. For a large number of texts please use the method embed_large. In the case of using a GPU and running out of memory while using 'tensorflow' reduce the batch size or restart R and switch to use cpu only via set_config_cpu_only. In general, not relevant for 'pytorch'.

Usage

TextEmbeddingModel$embed(
  raw_text = NULL,
  doc_id = NULL,
  batch_size = 8,
  trace = FALSE,
  return_large_dataset = FALSE
)

Arguments

raw_text: vector containing the raw texts.
doc_id: vector containing the corresponding IDs for every text.
batch_size: int determining the maximal size of every batch.
trace: bool TRUE, if information about the progression should be printed on console.
return_large_dataset: 'bool' If TRUE the retuned object is of class LargeDataSetForTextEmbeddings. If FALSE it is of class EmbeddedText

Returns

Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.

Method `embed_large()`

Method for creating text embeddings from raw texts.

Usage

TextEmbeddingModel$embed_large(
  large_datas_set,
  batch_size = 32,
  trace = FALSE,
  log_file = NULL,
  log_write_interval = 2
)

Arguments

large_datas_set: Object of class LargeDataSetForText containing the raw texts.
batch_size: int determining the maximal size of every batch.
trace: bool TRUE, if information about the progression should be printed on console.
log_file: string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.
log_write_interval: int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

Returns

Method returns an object of class LargeDataSetForTextEmbeddings.

Method `fill_mask()`

Method for calculating tokens behind mask tokens.

Usage

TextEmbeddingModel$fill_mask(text, n_solutions = 5)

Arguments

text: string Text containing mask tokens.
n_solutions: int Number estimated tokens for every mask.

Returns

Returns a list containing a data.frame for every mask. The data.frame contains the solutions in the rows and reports the score, token id, and token string in the columns.

Method `set_publication_info()`

Method for setting the bibliographic information of the model.

Usage

TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)

Arguments

type: string Type of information which should be changed/added. developer, and modifier are possible.
authors: List of people.
citation: string Citation in free text.
url: string Corresponding URL if applicable.

Returns

Function does not return a value. It is used to set the private members for publication information of the model.

Method `get_publication_info()`

Method for getting the bibliographic information of the model.

Usage

TextEmbeddingModel$get_publication_info()

Returns

list of bibliographic information.

Method `set_model_license()`

Method for setting the license of the model

Usage

TextEmbeddingModel$set_model_license(license = "CC BY")

Arguments

license: string containing the abbreviation of the license or the license text.

Returns

Function does not return a value. It is used for setting the private member for the software license of the model.

Method `get_model_license()`

Method for requesting the license of the model

Usage

TextEmbeddingModel$get_model_license()

Returns

string License of the model

Method `set_documentation_license()`

Method for setting the license of models' documentation.

Usage

TextEmbeddingModel$set_documentation_license(license = "CC BY")

Arguments

license: string containing the abbreviation of the license or the license text.

Returns

Function does not return a value. It is used to set the private member for the documentation license of the model.

Method `get_documentation_license()`

Method for getting the license of the models' documentation.

Usage

TextEmbeddingModel$get_documentation_license()

Arguments

license: string containing the abbreviation of the license or the license text.

Method `set_model_description()`

Method for setting a description of the model

Usage

TextEmbeddingModel$set_model_description(
  eng = NULL,
  native = NULL,
  abstract_eng = NULL,
  abstract_native = NULL,
  keywords_eng = NULL,
  keywords_native = NULL
)

Arguments

eng: string A text describing the training of the classifier, its theoretical and empirical background, and the different output labels in English.
native: string A text describing the training of the classifier, its theoretical and empirical background, and the different output labels in the native language of the model.
abstract_eng: string A text providing a summary of the description in English.
abstract_native: string A text providing a summary of the description in the native language of the classifier.
keywords_eng: vectorof keywords in English.
keywords_native: vectorof keywords in the native language of the classifier.

Returns

Function does not return a value. It is used to set the private members for the description of the model.

Method `get_model_description()`

Method for requesting the model description.

Usage

TextEmbeddingModel$get_model_description()

Returns

list with the description of the model in English and the native language.

Method `get_model_info()`

Method for requesting the model information

Usage

TextEmbeddingModel$get_model_info()

Returns

list of all relevant model information

Method `get_package_versions()`

Method for requesting a summary of the R and python packages' versions used for creating the model.

Usage

TextEmbeddingModel$get_package_versions()

Returns

Returns a list containing the versions of the relevant R and python packages.

Method `get_basic_components()`

Method for requesting the part of interface's configuration that is necessary for all models.

Usage

TextEmbeddingModel$get_basic_components()

Returns

Returns a list.

Method `get_transformer_components()`

Method for requesting the part of interface's configuration that is necessary for transformer models.

Usage

TextEmbeddingModel$get_transformer_components()

Returns

Returns a list.

Method `get_sustainability_data()`

Method for requesting a log of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.

Usage

TextEmbeddingModel$get_sustainability_data()

Returns

Returns a matrix containing the tracked energy consumption, CO2 equivalents in kg, information on the tracker used, and technical information on the training infrastructure for every training run.

Method `get_ml_framework()`

Method for requesting the machine learning framework used for the classifier.

Usage

TextEmbeddingModel$get_ml_framework()

Returns

Returns a string describing the machine learning framework used for the classifier.

Method `count_parameter()`

Method for counting the trainable parameters of a model.

Usage

TextEmbeddingModel$count_parameter(with_head = FALSE)

Arguments

with_head: bool If TRUE the number of parameters is returned including the language modeling head of the model. If FALSE only the number of parameters of the core model is returned.

Returns

Returns the number of trainable parameters of the model.

Method `is_configured()`

Method for checking if the model was successfully configured. An object can only be used if this value is TRUE.

Usage

TextEmbeddingModel$is_configured()

Returns

bool TRUE if the model is fully configured. FALSE if not.

Method `get_private()`

Method for requesting all private fields and methods. Used for loading and updating an object.

Usage

TextEmbeddingModel$get_private()

Returns

Returns a list with all private fields and methods.

Method `get_all_fields()`

Return all fields.

Usage

TextEmbeddingModel$get_all_fields()

Returns

Method returns a list containing all public and private fields of the object.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

TextEmbeddingModel$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

See also

Public fields

Methods

Public methods

Method configure()

Usage

Arguments

Details

Returns

Method load_from_disk()

Usage

Arguments

Returns

Method load()

Usage

Arguments

Returns

Method save()

Usage

Arguments

Returns

Method encode()

Usage

Arguments

Returns

Method decode()

Usage

Arguments

Returns

Method get_special_tokens()

Usage

Returns

Method embed()

Usage

Arguments

Returns

Method embed_large()

Usage

Arguments

Returns

Method fill_mask()

Usage

Arguments

Returns

Method set_publication_info()

Usage

Arguments

Returns

Method get_publication_info()

Usage

Returns

Method set_model_license()

Usage

Arguments

Returns

Method get_model_license()

Usage

Returns

Method set_documentation_license()

Usage

Arguments

Returns

Method get_documentation_license()

Usage

Arguments

Method set_model_description()

Usage

Arguments

Returns

Method get_model_description()

Usage

Returns

Method get_model_info()

Usage

Returns

Method get_package_versions()

Usage

Returns

Method `configure()`

Method `load_from_disk()`

Method `load()`

Method `save()`

Method `encode()`

Method `decode()`

Method `get_special_tokens()`

Method `embed()`

Method `embed_large()`

Method `fill_mask()`

Method `set_publication_info()`

Method `get_publication_info()`

Method `set_model_license()`

Method `get_model_license()`

Method `set_documentation_license()`

Method `get_documentation_license()`

Method `set_model_description()`

Method `get_model_description()`

Method `get_model_info()`

Method `get_package_versions()`

Method `get_basic_components()`

Method `get_transformer_components()`

Method `get_sustainability_data()`

Method `get_ml_framework()`

Method `count_parameter()`

Method `is_configured()`

Method `get_private()`

Method `get_all_fields()`

Method `clone()`