This R6
class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Value
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
See also
Other Text Embedding:
TEFeatureExtractor
Super classes
aifeducation::AIFEMaster
-> aifeducation::AIFEBaseModel
-> TextEmbeddingModel
Methods
Public methods
Inherited methods
aifeducation::AIFEMaster$get_all_fields()
aifeducation::AIFEMaster$get_documentation_license()
aifeducation::AIFEMaster$get_ml_framework()
aifeducation::AIFEMaster$get_model_config()
aifeducation::AIFEMaster$get_model_description()
aifeducation::AIFEMaster$get_model_info()
aifeducation::AIFEMaster$get_model_license()
aifeducation::AIFEMaster$get_package_versions()
aifeducation::AIFEMaster$get_private()
aifeducation::AIFEMaster$get_publication_info()
aifeducation::AIFEMaster$is_configured()
aifeducation::AIFEMaster$is_trained()
aifeducation::AIFEMaster$set_documentation_license()
aifeducation::AIFEMaster$set_model_description()
aifeducation::AIFEMaster$set_model_license()
aifeducation::AIFEBaseModel$count_parameter()
Method configure()
Method for creating a new text embedding model
Usage
TextEmbeddingModel$configure(
model_name = NULL,
model_label = NULL,
model_language = NULL,
max_length = 0L,
chunks = 2L,
overlap = 0L,
emb_layer_min = 1L,
emb_layer_max = 2L,
emb_pool_type = "Average",
pad_value = -100L,
base_model = NULL
)
Arguments
model_name
string
Name of the new model. Please refer to common name conventions. Free text can be used with parameterlabel
. If set toNULL
a unique ID is generated automatically. Allowed values: anymodel_label
string
Label for the new model. Here you can use free text. Allowed values: anymodel_language
string
Languages that the models can work with. Allowed values: anymax_length
int
Maximal number of token per chunks. Must be equal or lower as the maximal postional embeddings for the model. Allowed values:20 <= x
chunks
int
Maximal number chunks. Allowed values:2 <= x
overlap
int
Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values:0 <= x
emb_layer_min
int
Minimal layer from which the embeddings should be calculated. Allowed values:1 <= x
emb_layer_max
int
Maximal layer from which the embeddings should be calculated. Allowed values:1 <= x
emb_pool_type
string
Method to summarize the embedding of single tokens into a text embedding. In the case of'CLS'
all cls-tokens betweenemb_layer_min
andemb_layer_max
are averaged. In the case of'Average'
the embeddings of all tokens are averaged. Please note that BaseModelFunnel allows only 'CLS'. Allowed values: 'CLS', 'Average'pad_value
int
Value indicating padding. This value should no be in the range of regluar values for computations. Thus it is not recommended to chance this value. Default is-100
. Allowed values:x <= -100
base_model
BaseModelCore
BaseModels for processing raw texts.trace
bool
TRUE
if information about the estimation phase should be printed to the console.
Method load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
Method save()
Method for saving a model on disk.
Method encode()
Method for encoding words of raw texts into integers.
Usage
TextEmbeddingModel$encode(
raw_text,
token_encodings_only = FALSE,
token_to_int = TRUE,
trace = FALSE
)
Arguments
raw_text
vector
Raw text.token_encodings_only
bool
TRUE
: Returns alist
containg only the tokens.FALSE
: Returns alist
containg a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.
token_to_int
bool
TRUE
: Returns the tokens asint
index.FALSE
: Returns the tokens asstring
s.
trace
bool
TRUE
if information about the estimation phase should be printed to the console.
Method decode()
Method for decoding a sequence of integers into tokens
Method embed()
Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large
.
Usage
TextEmbeddingModel$embed(
raw_text = NULL,
doc_id = NULL,
batch_size = 8L,
trace = FALSE,
return_large_dataset = FALSE
)
Arguments
raw_text
vector
Raw text.doc_id
vector
Id for every text.batch_size
int
Size of the batches for training. Allowed values:1 <= x
trace
bool
TRUE
if information about the estimation phase should be printed to the console.return_large_dataset
bool
IfTRUE
a LargeDataSetForTextEmbeddings is returned. IfFALSE
an object if class EmbeddedText is returned.
Returns
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
Method embed_large()
Method for creating text embeddings from raw texts.
Usage
TextEmbeddingModel$embed_large(
text_dataset,
batch_size = 32L,
trace = FALSE,
log_file = NULL,
log_write_interval = 2L
)
Arguments
text_dataset
LargeDataSetForText
LargeDataSetForText Object storing textual data.batch_size
int
Size of the batches for training. Allowed values:1 <= x
trace
bool
TRUE
if information about the estimation phase should be printed to the console.log_file
string
Path to the file where the log files should be saved. If no logging is desired set this argument toNULL
. Allowed values: anylog_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_dir
is notNULL
. Allowed values:1 <= x
Returns
Method returns an object of class LargeDataSetForTextEmbeddings.
Method set_publication_info()
Method for setting the bibliographic information of the model.
Method get_sustainability_data()
Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
Method estimate_sustainability_inference_embed()
Calculates the energy consumption for inference of the given task.
Usage
TextEmbeddingModel$estimate_sustainability_inference_embed(
text_dataset = NULL,
batch_size = 32L,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 10L,
sustain_log_level = "warning",
trace = TRUE
)
Arguments
text_dataset
LargeDataSetForText
LargeDataSetForText Object storing textual data.batch_size
int
Size of the batches for training. Allowed values:1 <= x
sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: anysustain_region
string
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: anysustain_interval
int
Interval in seconds for measuring power usage. Allowed values:1 <= x
sustain_log_level
trace
bool
TRUE
if information about the estimation phase should be printed to the console.