Skip to contents

This R6 class stores a text embedding model which can be used to tokenize, encode, decode, and embed raw texts. The object provides a unique interface for different text processing methods.

Value

Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.

See also

Other Text Embedding: TEFeatureExtractor

Super classes

aifeducation::AIFEMaster -> aifeducation::AIFEBaseModel -> TextEmbeddingModel

Public fields

BaseModel

('BaseModelCore')
Object of class BaseModelCore.

Methods

Inherited methods


Method configure()

Method for creating a new text embedding model

Usage

TextEmbeddingModel$configure(
  model_name = NULL,
  model_label = NULL,
  model_language = NULL,
  max_length = 0L,
  chunks = 2L,
  overlap = 0L,
  emb_layer_min = 1L,
  emb_layer_max = 2L,
  emb_pool_type = "Average",
  pad_value = -100L,
  base_model = NULL
)

Arguments

model_name

string Name of the new model. Please refer to common name conventions. Free text can be used with parameter label. If set to NULL a unique ID is generated automatically. Allowed values: any

model_label

string Label for the new model. Here you can use free text. Allowed values: any

model_language

string Languages that the models can work with. Allowed values: any

max_length

int Maximal number of token per chunks. Must be equal or lower as the maximal postional embeddings for the model. Allowed values: 20 <= x

chunks

int Maximal number chunks. Allowed values: 2 <= x

overlap

int Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values: 0 <= x

emb_layer_min

int Minimal layer from which the embeddings should be calculated. Allowed values: 1 <= x

emb_layer_max

int Maximal layer from which the embeddings should be calculated. Allowed values: 1 <= x

emb_pool_type

string Method to summarize the embedding of single tokens into a text embedding. In the case of 'CLS' all cls-tokens between emb_layer_min and emb_layer_max are averaged. In the case of 'Average' the embeddings of all tokens are averaged. Please note that BaseModelFunnel allows only 'CLS'. Allowed values: 'CLS', 'Average'

pad_value

int Value indicating padding. This value should no be in the range of regluar values for computations. Thus it is not recommended to chance this value. Default is -100. Allowed values: x <= -100

base_model

BaseModelCore BaseModels for processing raw texts.

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

Does nothing return.


Method load_from_disk()

Loads an object from disk and updates the object to the current version of the package.

Usage

TextEmbeddingModel$load_from_disk(dir_path)

Arguments

dir_path

Path where the object set is stored.

Returns

Function does nothin return. It loads an object from disk.


Method save()

Method for saving a model on disk.

Usage

TextEmbeddingModel$save(dir_path, folder_name)

Arguments

dir_path

Path to the directory where to save the object.

folder_name

string Name of the folder where the model should be saved. Allowed values: any

Returns

Function does nothing return. It is used to save an object on disk.


Method encode()

Method for encoding words of raw texts into integers.

Usage

TextEmbeddingModel$encode(
  raw_text,
  token_encodings_only = FALSE,
  token_to_int = TRUE,
  trace = FALSE
)

Arguments

raw_text

vector Raw text.

token_encodings_only

bool

  • TRUE: Returns a list containg only the tokens.

  • FALSE: Returns a list containg a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.

token_to_int

bool

  • TRUE: Returns the tokens as int index.

  • FALSE: Returns the tokens as strings.

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

list containing the integer or token sequences of the raw texts with special tokens.


Method decode()

Method for decoding a sequence of integers into tokens

Usage

TextEmbeddingModel$decode(int_seqence, to_token = FALSE)

Arguments

int_seqence

list list of integer sequence that should be converted to tokens.

to_token

bool

  • FALSE: Transforms the integers to plain text.

  • TRUE: Transforms the integers to a sequence of tokens.

Returns

list of token sequences


Method embed()

Method for creating text embeddings from raw texts. This method should only be used if a small number of texts should be transformed into text embeddings. For a large number of texts please use the method embed_large.

Usage

TextEmbeddingModel$embed(
  raw_text = NULL,
  doc_id = NULL,
  batch_size = 8L,
  trace = FALSE,
  return_large_dataset = FALSE
)

Arguments

raw_text

vector Raw text.

doc_id

vector Id for every text.

batch_size

int Size of the batches for training. Allowed values: 1 <= x

trace

bool TRUE if information about the estimation phase should be printed to the console.

return_large_dataset

bool If TRUE a LargeDataSetForTextEmbeddings is returned. If FALSE an object if class EmbeddedText is returned.

Returns

Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.


Method embed_large()

Method for creating text embeddings from raw texts.

Usage

TextEmbeddingModel$embed_large(
  text_dataset,
  batch_size = 32L,
  trace = FALSE,
  log_file = NULL,
  log_write_interval = 2L
)

Arguments

text_dataset

LargeDataSetForText LargeDataSetForText Object storing textual data.

batch_size

int Size of the batches for training. Allowed values: 1 <= x

trace

bool TRUE if information about the estimation phase should be printed to the console.

log_file

string Path to the file where the log files should be saved. If no logging is desired set this argument to NULL. Allowed values: any

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_dir is not NULL. Allowed values: 1 <= x

Returns

Method returns an object of class LargeDataSetForTextEmbeddings.


Method get_n_features()

Method for requesting the number of features.

Usage

TextEmbeddingModel$get_n_features()

Returns

Returns a double which represents the number of features. This number represents the hidden size of the embeddings for every chunk or time.


Method get_pad_value()

Value for indicating padding.

Usage

TextEmbeddingModel$get_pad_value()

Returns

Returns an int describing the value used for padding.


Method set_publication_info()

Method for setting the bibliographic information of the model.

Usage

TextEmbeddingModel$set_publication_info(type, authors, citation, url = NULL)

Arguments

type

string Type of information which should be changed/added. developer, and modifier are possible.

authors

List of people.

citation

string Citation in free text.

url

string Corresponding URL if applicable.

Returns

Function does not return a value. It is used to set the private members for publication information of the model.


Method get_sustainability_data()

Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.

Usage

TextEmbeddingModel$get_sustainability_data(track_mode = "training")

Arguments

track_mode

string Determines the stept to which the data refer. Allowed values: 'training', 'inference'

Returns

Returns a list containing the tracked energy consumption, CO2 equivalents in kg, information on the tracker used, and technical information on the training infrastructure.


Method estimate_sustainability_inference_embed()

Calculates the energy consumption for inference of the given task.

Usage

TextEmbeddingModel$estimate_sustainability_inference_embed(
  text_dataset = NULL,
  batch_size = 32L,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 10L,
  sustain_log_level = "warning",
  trace = TRUE
)

Arguments

text_dataset

LargeDataSetForText LargeDataSetForText Object storing textual data.

batch_size

int Size of the batches for training. Allowed values: 1 <= x

sustain_iso_code

string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any

sustain_region

string Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any

sustain_interval

int Interval in seconds for measuring power usage. Allowed values: 1 <= x

sustain_log_level
trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

Returns nothing. Method saves the statistics internally. The statistics can be accessed with the method get_sustainability_data("inference")


Method clone()

The objects of this class are cloneable with this method.

Usage

TextEmbeddingModel$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.