Abstract class for large data sets containing text embeddings

This object stores text embeddings which are usually produced by an object of class TextEmbeddingModel. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.

LargeDataSetForTextEmbeddings are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class LargeDataSetForTextEmbeddings serve as input for objects of class ClassifiersBasedOnTextEmbeddings and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embeddings generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.

This class is not designed for a direct use.

Value

Returns a new object of this class.

Super class

aifeducation::LargeDataSetBase -> LargeDataSetForTextEmbeddings

Methods

Public methods

LargeDataSetForTextEmbeddings$configure()
LargeDataSetForTextEmbeddings$is_configured()
LargeDataSetForTextEmbeddings$get_text_embedding_model_name()
LargeDataSetForTextEmbeddings$get_model_info()
LargeDataSetForTextEmbeddings$load_from_disk()
LargeDataSetForTextEmbeddings$get_model_label()
LargeDataSetForTextEmbeddings$add_feature_extractor_info()
LargeDataSetForTextEmbeddings$get_feature_extractor_info()
LargeDataSetForTextEmbeddings$is_compressed()
LargeDataSetForTextEmbeddings$get_times()
LargeDataSetForTextEmbeddings$get_features()
LargeDataSetForTextEmbeddings$get_original_features()
LargeDataSetForTextEmbeddings$get_pad_value()
LargeDataSetForTextEmbeddings$add_embeddings_from_array()
LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText()
LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings()
LargeDataSetForTextEmbeddings$convert_to_EmbeddedText()
LargeDataSetForTextEmbeddings$clone()

Inherited methods

Method `configure()`

Creates a new object representing text embeddings.

Usage

LargeDataSetForTextEmbeddings$configure(
  model_name = NA,
  model_label = NA,
  model_date = NA,
  model_method = NA,
  model_version = NA,
  model_language = NA,
  param_seq_length = NA,
  param_chunks = NULL,
  param_features = NULL,
  param_overlap = NULL,
  param_emb_layer_min = NULL,
  param_emb_layer_max = NULL,
  param_emb_pool_type = NULL,
  param_pad_value = -100L,
  param_aggregation = NULL
)

Arguments

model_name: string Name of the model that generates this embedding.
model_label: string Label of the model that generates this embedding.
model_date: string Date when the embedding generating model was created.
model_method: string Method of the underlying embedding model.
model_version: string Version of the model that generated this embedding.
model_language: string Language of the model that generated this embedding.
param_seq_length: int Maximum number of tokens that processes the generating model for a chunk.
param_chunks: int Maximum number of chunks which are supported by the generating model.
param_features: int Number of dimensions of the text embeddings.
param_overlap: int Number of tokens that were added at the beginning of the sequence for the next chunk by this model.
param_emb_layer_min: int or string determining the first layer to be included in the creation of embeddings.
param_emb_layer_max: int or string determining the last layer to be included in the creation of embeddings.
param_emb_pool_type: string determining the method for pooling the token embeddings within each layer.
param_pad_value: int Value indicating padding. This value should no be in the range of regluar values for computations. Thus it is not recommended to chance this value. Default is -100. Allowed values: $ x <= -1$
param_aggregation: string Aggregation method of the hidden states. Deprecated. Only included for backward compatibility.

Returns

The method returns a new object of this class.

Method `is_configured()`

Method for checking if the model was successfully configured. An object can only be used if this value is TRUE.

Usage

LargeDataSetForTextEmbeddings$is_configured()

Returns

bool TRUE if the model is fully configured. FALSE if not.

Method `get_text_embedding_model_name()`

Method for requesting the name (unique id) of the underlying text embedding model.

Usage

LargeDataSetForTextEmbeddings$get_text_embedding_model_name()

Returns

Returns a string describing name of the text embedding model.

Method `get_model_info()`

Method for retrieving information about the model that generated this embedding.

Usage

LargeDataSetForTextEmbeddings$get_model_info()

Returns

list containing all saved information about the underlying text embedding model.

Method `load_from_disk()`

loads an object of class LargeDataSetForTextEmbeddings from disk and updates the object to the current version of the package.

Usage

LargeDataSetForTextEmbeddings$load_from_disk(dir_path)

Arguments

dir_path: Path where the data set set is stored.

Returns

Method does not return anything. It loads an object from disk.

Method `get_model_label()`

Method for retrieving the label of the model that generated this embedding.

Usage

LargeDataSetForTextEmbeddings$get_model_label()

Returns

string Label of the corresponding text embedding model

Method `add_feature_extractor_info()`

Method setting information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a TEFeatureExtractor was applied.

Usage

LargeDataSetForTextEmbeddings$add_feature_extractor_info(
  model_name,
  model_label = NA,
  features = NA,
  method = NA,
  noise_factor = NA,
  optimizer = NA
)

Arguments

model_name: string Name of the underlying TextEmbeddingModel.
model_label: string Label of the underlying TextEmbeddingModel.
features: int Number of dimension (features) for the compressed text embeddings.
method: string Method that the TEFeatureExtractor applies for genereating the compressed text embeddings.
noise_factor: double Noise factor of the TEFeatureExtractor.
optimizer: string Optimizer used during training the TEFeatureExtractor.

Returns

Method does nothing return. It sets information on a TEFeatureExtractor.

Method `get_feature_extractor_info()`

Method for receiving information on the TEFeatureExtractor that was used to reduce the number of dimensions of the text embeddings.

Usage

LargeDataSetForTextEmbeddings$get_feature_extractor_info()

Returns

Returns a list with information on the TEFeatureExtractor. If no TEFeatureExtractor was used it returns NULL.

Method `is_compressed()`

Checks if the text embedding were reduced by a TEFeatureExtractor.

Usage

LargeDataSetForTextEmbeddings$is_compressed()

Returns

Returns TRUE if the number of dimensions was reduced by a TEFeatureExtractor. If not return FALSE.

Method `get_times()`

Number of chunks/times of the text embeddings.

Usage

LargeDataSetForTextEmbeddings$get_times()

Returns

Returns an int describing the number of chunks/times of the text embeddings.

Method `get_features()`

Number of actual features/dimensions of the text embeddings.In the case a TEFeatureExtractor was used the number of features is smaller as the original number of features. To receive the original number of features (the number of features before applying a TEFeatureExtractor) you can use the method get_original_features of this class.

Usage

LargeDataSetForTextEmbeddings$get_features()

Returns

Returns an int describing the number of features/dimensions of the text embeddings.

Method `get_original_features()`

Number of original features/dimensions of the text embeddings.

Usage

LargeDataSetForTextEmbeddings$get_original_features()

Returns

Returns an int describing the number of features/dimensions if no TEFeatureExtractor) is used or before a TEFeatureExtractor) is applied.

Method `get_pad_value()`

Value for indicating padding.

Usage

LargeDataSetForTextEmbeddings$get_pad_value()

Returns

Returns an int describing the value used for padding.

Method `add_embeddings_from_array()`

Method for adding new data to the data set from an array. Please note that the method does not check if cases already exist in the data set. To reduce the data set to unique cases call the method reduce_to_unique_ids.

Usage

LargeDataSetForTextEmbeddings$add_embeddings_from_array(embedding_array)

Arguments

embedding_array: array containing the text embeddings.

Returns

The method does not return anything. It adds new data to the data set.

Method `add_embeddings_from_EmbeddedText()`

Method for adding new data to the data set from an EmbeddedText. Please note that the method does not check if cases already exist in the data set. To reduce the data set to unique cases call the method reduce_to_unique_ids.

Usage

LargeDataSetForTextEmbeddings$add_embeddings_from_EmbeddedText(EmbeddedText)

Arguments

EmbeddedText: Object of class EmbeddedText.

Returns

The method does not return anything. It adds new data to the data set.

Method `add_embeddings_from_LargeDataSetForTextEmbeddings()`

Method for adding new data to the data set from an LargeDataSetForTextEmbeddings. Please note that the method does not check if cases already exist in the data set. To reduce the data set to unique cases call the method reduce_to_unique_ids.

Usage

LargeDataSetForTextEmbeddings$add_embeddings_from_LargeDataSetForTextEmbeddings(
  dataset
)

Arguments

dataset: Object of class LargeDataSetForTextEmbeddings.

Returns

The method does not return anything. It adds new data to the data set.

Method `convert_to_EmbeddedText()`

Method for converting this object to an object of class EmbeddedText.

Attention This object uses memory mapping to allow the usage of data sets that do not fit into memory. By calling this method the data set will be loaded and stored into memory/RAM. This may lead to an out-of-memory error.

Usage

LargeDataSetForTextEmbeddings$convert_to_EmbeddedText()

Returns

LargeDataSetForTextEmbeddings an object of class EmbeddedText which is stored in the memory/RAM.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

LargeDataSetForTextEmbeddings$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

See also

Super class

Methods

Public methods

Method configure()

Usage

Arguments

Returns

Method is_configured()

Usage

Returns

Method get_text_embedding_model_name()

Usage

Returns

Method get_model_info()

Usage

Returns

Method load_from_disk()

Usage

Arguments

Returns

Method get_model_label()

Usage

Returns

Method add_feature_extractor_info()

Usage

Arguments

Returns

Method get_feature_extractor_info()

Usage

Returns

Method is_compressed()

Usage

Returns

Method get_times()

Usage

Returns

Method get_features()

Usage

Returns

Method get_original_features()

Usage

Returns

Method get_pad_value()

Usage

Returns

Method add_embeddings_from_array()

Usage

Arguments

Returns

Method add_embeddings_from_EmbeddedText()

Usage

Arguments

Returns

Method add_embeddings_from_LargeDataSetForTextEmbeddings()

Usage

Arguments

Returns

Method convert_to_EmbeddedText()

Usage

Returns

Method clone()

Usage

Arguments

Method `configure()`

Method `is_configured()`

Method `get_text_embedding_model_name()`

Method `get_model_info()`

Method `load_from_disk()`

Method `get_model_label()`

Method `add_feature_extractor_info()`

Method `get_feature_extractor_info()`

Method `is_compressed()`

Method `get_times()`

Method `get_features()`

Method `get_original_features()`

Method `get_pad_value()`

Method `add_embeddings_from_array()`

Method `add_embeddings_from_EmbeddedText()`

Method `add_embeddings_from_LargeDataSetForTextEmbeddings()`

Method `convert_to_EmbeddedText()`

Method `clone()`