This R6
class stores a text embedding model which can be used to tokenize, encode, decode, and embed
raw texts. The object provides a unique interface for different text processing methods.
Value
Objects of class TextEmbeddingModel transform raw texts into numerical representations which can be used for downstream tasks. For this aim objects of this class allow to tokenize raw texts, to encode tokens to sequences of integers, and to decode sequences of integers back to tokens.
See also
Other Text Embedding:
TEFeatureExtractor
Public fields
last_training
('list()')
List for storing the history and the results of the last training. This information will be overwritten if a new training is started.tokenizer_statistics
('matrix()')
Matrix containing the tokenizer statistics for the creation of the tokenizer and all training runs according to Kaya & Tantuğ (2024).Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335
Methods
Method configure()
Method for creating a new text embedding model
Usage
TextEmbeddingModel$configure(
model_name = NULL,
model_label = NULL,
model_language = NULL,
method = NULL,
ml_framework = "pytorch",
max_length = 0,
chunks = 2,
overlap = 0,
emb_layer_min = "middle",
emb_layer_max = "2_3_layer",
emb_pool_type = "average",
model_dir = NULL,
trace = FALSE
)
Arguments
model_name
string
containing the name of the new model.model_label
string
containing the label/title of the new model.model_language
string
containing the language which the model represents (e.g., English).method
string
determining the kind of embedding model. Currently the following models are supported:method="bert"
for Bidirectional Encoder Representations from Transformers (BERT),method="roberta"
for A Robustly Optimized BERT Pretraining Approach (RoBERTa),method="longformer"
for Long-Document Transformer,method="funnel"
for Funnel-Transformer,method="deberta_v2"
for Decoding-enhanced BERT with Disentangled Attention (DeBERTa V2),method="glove"`` for GlobalVector Clusters, and
method="lda"` for topic modeling. See details for more information.ml_framework
string
Framework to use for the model.ml_framework="tensorflow"
for 'tensorflow' andml_framework="pytorch"
for 'pytorch'. Only relevant for transformer models. To request bag-of-words model setml_framework=NULL
.max_length
int
determining the maximum length of token sequences used in transformer models. Not relevant for the other methods.chunks
int
Maximum number of chunks. Must be at least 2.overlap
int
determining the number of tokens which should be added at the beginning of the next chunk. Only relevant for transformer models.emb_layer_min
int
orstring
determining the first layer to be included in the creation of embeddings. An integer correspondents to the layer number. The first layer has the number 1. Instead of an integer the following strings are possible:"start"
for the first layer,"middle"
for the middle layer,"2_3_layer"
for the layer two-third layer, and"last"
for the last layer.emb_layer_max
int
orstring
determining the last layer to be included in the creation of embeddings. An integer correspondents to the layer number. The first layer has the number 1. Instead of an integer the following strings are possible:"start"
for the first layer,"middle"
for the middle layer,"2_3_layer"
for the layer two-third layer, and"last"
for the last layer.emb_pool_type
string
determining the method for pooling the token embeddings within each layer. If"cls"
only the embedding of the CLS token is used. If"average"
the token embedding of all tokens are averaged (excluding padding tokens)."cls
is not supported formethod="funnel"
.model_dir
string
path to the directory where the BERT model is stored.trace
bool
TRUE
prints information about the progress.FALSE
does not.
Method load_from_disk()
loads an object from disk and updates the object to the current version of the package.
Method load()
Method for loading a transformers model into R.
Method save()
Method for saving a transformer model on disk.Relevant only for transformer models.
Method encode()
Method for encoding words of raw texts into integers.
Usage
TextEmbeddingModel$encode(
raw_text,
token_encodings_only = FALSE,
to_int = TRUE,
trace = FALSE
)
Arguments
raw_text
vector
containing the raw texts.token_encodings_only
bool
IfTRUE
, only the token encodings are returned. IfFALSE
, the complete encoding is returned which is important for some transformer models.to_int
bool
IfTRUE
the integer ids of the tokens are returned. IfFALSE
the tokens are returned. Argument only applies for transformer models and iftoken_encodings_only=TRUE
.trace
bool
IfTRUE
, information of the progress is printed.FALSE
if not requested.
Method decode()
Method for decoding a sequence of integers into tokens
Method embed()
Method for creating text embeddings from raw texts.
This method should only be used if a small number of texts should be transformed
into text embeddings. For a large number of texts please use the method embed_large
.
In the case of using a GPU and running out of memory while using 'tensorflow' reduce the
batch size or restart R and switch to use cpu only via set_config_cpu_only
. In general,
not relevant for 'pytorch'.
Usage
TextEmbeddingModel$embed(
raw_text = NULL,
doc_id = NULL,
batch_size = 8,
trace = FALSE,
return_large_dataset = FALSE
)
Arguments
raw_text
vector
containing the raw texts.doc_id
vector
containing the corresponding IDs for every text.batch_size
int
determining the maximal size of every batch.trace
bool
TRUE
, if information about the progression should be printed on console.return_large_dataset
'bool' If
TRUE
the retuned object is of class LargeDataSetForTextEmbeddings. IfFALSE
it is of class EmbeddedText
Returns
Method returns an object of class EmbeddedText or LargeDataSetForTextEmbeddings. This object contains the embeddings as a data.frame and information about the model creating the embeddings.
Method embed_large()
Method for creating text embeddings from raw texts.
Usage
TextEmbeddingModel$embed_large(
large_datas_set,
batch_size = 32,
trace = FALSE,
log_file = NULL,
log_write_interval = 2
)
Arguments
large_datas_set
Object of class LargeDataSetForText containing the raw texts.
batch_size
int
determining the maximal size of every batch.trace
bool
TRUE
, if information about the progression should be printed on console.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.
Returns
Method returns an object of class LargeDataSetForTextEmbeddings.
Method fill_mask()
Method for calculating tokens behind mask tokens.
Method set_publication_info()
Method for setting the bibliographic information of the model.
Method set_model_description()
Method for setting a description of the model
Usage
TextEmbeddingModel$set_model_description(
eng = NULL,
native = NULL,
abstract_eng = NULL,
abstract_native = NULL,
keywords_eng = NULL,
keywords_native = NULL
)
Arguments
eng
string
A text describing the training of the classifier, its theoretical and empirical background, and the different output labels in English.native
string
A text describing the training of the classifier, its theoretical and empirical background, and the different output labels in the native language of the model.abstract_eng
string
A text providing a summary of the description in English.abstract_native
string
A text providing a summary of the description in the native language of the classifier.keywords_eng
vector
of keywords in English.keywords_native
vector
of keywords in the native language of the classifier.
Method get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.
Method get_basic_components()
Method for requesting the part of interface's configuration that is necessary for all models.
Method get_transformer_components()
Method for requesting the part of interface's configuration that is necessary for transformer models.
Method get_sustainability_data()
Method for requesting a log of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
Method get_ml_framework()
Method for requesting the machine learning framework used for the classifier.
Method count_parameter()
Method for counting the trainable parameters of a model.
Method is_configured()
Method for checking if the model was successfully configured.
An object can only be used if this value is TRUE
.
Method get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.