Abstract class for small data sets containing text embeddings
Source:R/obj_EmbeddedText.R
EmbeddedText.RdObject of class R6 which stores the text embeddings generated by an object of class
TextEmbeddingModel. The text embeddings are stored within memory/RAM. In the case of a high number of documents
the data may not fit into memory/RAM. Thus, please use this object only for a small sample of texts. In general, it
is recommended to use an object of class LargeDataSetForTextEmbeddings which can deal with any number of texts.
Value
Returns an object of class EmbeddedText. These objects are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class EmbeddedText serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.
See also
Other Data Management:
LargeDataSetForText,
LargeDataSetForTextEmbeddings
Public fields
embeddings('data.frame()')
data.frame containing the text embeddings for all chunks. Documents are in the rows. Embedding dimensions are in the columns.
Methods
Method configure()
Creates a new object representing text embeddings.
Usage
EmbeddedText$configure(
embeddings,
model_name = NA,
model_label = NA,
model_date = NA,
model_method = NA,
model_version = NA,
model_language = NA,
param_seq_length = NA,
param_chunks = NULL,
param_features = NULL,
param_overlap = NULL,
param_emb_layer_min = NULL,
param_emb_layer_max = NULL,
param_emb_pool_type = NULL,
param_aggregation = NULL,
param_pad_value = -100L
)Arguments
embeddingsdata.framecontaining the text embeddings.model_namestringName of the model that generates this embedding.model_labelstringLabel of the model that generates this embedding.model_datestringDate when the embedding generating model was created.model_methodstringMethod of the underlying embedding model.model_versionstringVersion of the model that generated this embedding.model_languagestringLanguage of the model that generated this embedding.param_seq_lengthintMaximum number of tokens that processes the generating model for a chunk.param_chunksintMaximum number of chunks which are supported by the generating model.param_featuresintNumber of dimensions of the text embeddings.param_overlapintNumber of tokens that were added at the beginning of the sequence for the next chunk by this model. #'param_emb_layer_minintorstringdetermining the first layer to be included in the creation of embeddings.param_emb_layer_maxintorstringdetermining the last layer to be included in the creation of embeddings.param_emb_pool_typestringdetermining the method for pooling the token embeddings within each layer.param_aggregationstringAggregation method of the hidden states. Deprecated. Only included for backward compatibility.param_pad_valueintValue indicating padding. This value should no be in the range of regluar values for computations. Thus it is not recommended to chance this value. Default is-100. Allowed values:x <= -100
Returns
Returns an object of class EmbeddedText which stores the text embeddings produced by an objects of class TextEmbeddingModel.
Method save()
Saves a data set to disk.
Method is_configured()
Method for checking if the model was successfully configured. An object can only be used if this
value is TRUE.
Method load_from_disk()
loads an object of class EmbeddedText from disk and updates the object to the current version of the package.
Method get_model_info()
Method for retrieving information about the model that generated this embedding.
Method get_model_label()
Method for retrieving the label of the model that generated this embedding.
Method get_features()
Number of actual features/dimensions of the text embeddings.In the case a
feature extractor was used the number of features is smaller as the original number of
features. To receive the original number of features (the number of features before applying a
feature extractor) you can use the method get_original_features of this class.
Method get_original_features()
Number of original features/dimensions of the text embeddings.
Returns
Returns an int describing the number of features/dimensions if no
feature extractor) is used or before a feature extractor) is
applied.
Method is_compressed()
Checks if the text embedding were reduced by a feature extractor.
Returns
Returns TRUE if the number of dimensions was reduced by a feature extractor. If
not return FALSE.
Method add_feature_extractor_info()
Method setting information on the feature extractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a feature extractor was applied.
Usage
EmbeddedText$add_feature_extractor_info(
model_name,
model_label = NA,
features = NA,
method = NA,
noise_factor = NA,
optimizer = NA
)Arguments
model_namestringName of the underlying TextEmbeddingModel.model_labelstringLabel of the underlying TextEmbeddingModel.featuresintNumber of dimension (features) for the compressed text embeddings.methodstringMethod that the TEFeatureExtractor applies for genereating the compressed text embeddings.noise_factordoubleNoise factor of the TEFeatureExtractor.optimizerstringOptimizer used during training the TEFeatureExtractor.
Returns
Method does nothing return. It sets information on a feature extractor.
Method get_feature_extractor_info()
Method for receiving information on the feature extractor that was used to reduce the number of dimensions of the text embeddings.
Returns
Returns a list with information on the feature extractor. If no
feature extractor was used it returns NULL.
Method convert_to_LargeDataSetForTextEmbeddings()
Method for converting this object to an object of class LargeDataSetForTextEmbeddings.
Returns
Returns an object of class LargeDataSetForTextEmbeddings which uses memory mapping allowing to work with large data sets.
Method set_package_versions()
Method for setting the package version for 'aifeducation', 'reticulate', 'torch', and 'numpy' to the currently used versions.
Method get_package_versions()
Method for requesting a summary of the R and python packages' versions used for creating the model.