Embedded text

Object of class R6 which stores the text embeddings generated by an object of class TextEmbeddingModel. The text embeddings are stored within memory/RAM. In the case of a high number of documents the data may not fit into memory/RAM. Thus, please use this object only for a small sample of texts. In general, it is recommended to use an object of class LargeDataSetForTextEmbeddings which can deal with any number of texts.

Value

Returns an object of class EmbeddedText. These objects are used for storing and managing the text embeddings created with objects of class TextEmbeddingModel. Objects of class EmbeddedText serve as input for objects of class TEClassifierRegular, TEClassifierProtoNet, and TEFeatureExtractor. The main aim of this class is to provide a structured link between embedding models and classifiers. Since objects of this class save information on the text embedding model that created the text embedding it ensures that only embedding generated with same embedding model are combined. Furthermore, the stored information allows objects to check if embeddings of the correct text embedding model are used for training and predicting.

Public fields

embeddings: ('data.frame()')
data.frame containing the text embeddings for all chunks. Documents are in the rows. Embedding dimensions are in the columns.

Methods

Public methods

EmbeddedText$configure()
EmbeddedText$save()
EmbeddedText$is_configured()
EmbeddedText$load_from_disk()
EmbeddedText$get_model_info()
EmbeddedText$get_model_label()
EmbeddedText$get_times()
EmbeddedText$get_features()
EmbeddedText$get_original_features()
EmbeddedText$is_compressed()
EmbeddedText$add_feature_extractor_info()
EmbeddedText$get_feature_extractor_info()
EmbeddedText$convert_to_LargeDataSetForTextEmbeddings()
EmbeddedText$n_rows()
EmbeddedText$get_all_fields()
EmbeddedText$clone()

Method `configure()`

Creates a new object representing text embeddings.

Usage

EmbeddedText$configure(
  model_name = NA,
  model_label = NA,
  model_date = NA,
  model_method = NA,
  model_version = NA,
  model_language = NA,
  param_seq_length = NA,
  param_chunks = NULL,
  param_features = NULL,
  param_overlap = NULL,
  param_emb_layer_min = NULL,
  param_emb_layer_max = NULL,
  param_emb_pool_type = NULL,
  param_aggregation = NULL,
  embeddings
)

Arguments

model_name: string Name of the model that generates this embedding.
model_label: string Label of the model that generates this embedding.
model_date: string Date when the embedding generating model was created.
model_method: string Method of the underlying embedding model.
model_version: string Version of the model that generated this embedding.
model_language: string Language of the model that generated this embedding.
param_seq_length: int Maximum number of tokens that processes the generating model for a chunk.
param_chunks: int Maximum number of chunks which are supported by the generating model.
param_features: int Number of dimensions of the text embeddings.
param_overlap: int Number of tokens that were added at the beginning of the sequence for the next chunk by this model. #'
param_emb_layer_min: int or string determining the first layer to be included in the creation of embeddings.
param_emb_layer_max: int or string determining the last layer to be included in the creation of embeddings.
param_emb_pool_type: string determining the method for pooling the token embeddings within each layer.
param_aggregation: string Aggregation method of the hidden states. Deprecated. Only included for backward compatibility.
embeddings: data.frame containing the text embeddings.

Returns

Returns an object of class EmbeddedText which stores the text embeddings produced by an objects of class TextEmbeddingModel.

Method `save()`

Saves a data set to disk.

Usage

EmbeddedText$save(dir_path, folder_name, create_dir = TRUE)

Arguments

dir_path: Path where to store the data set.
folder_name: string Name of the folder for storing the data set.
create_dir: bool If True the directory will be created if it does not exist.

Returns

Method does not return anything. It write the data set to disk.

Method `is_configured()`

Method for checking if the model was successfully configured. An object can only be used if this value is TRUE.

Usage

EmbeddedText$is_configured()

Returns

bool TRUE if the model is fully configured. FALSE if not.

Method `load_from_disk()`

loads an object of class EmbeddedText from disk and updates the object to the current version of the package.

Usage

EmbeddedText$load_from_disk(dir_path)

Arguments

dir_path: Path where the data set set is stored.

Returns

Method does not return anything. It loads an object from disk.

Method `get_model_info()`

Method for retrieving information about the model that generated this embedding.

Usage

EmbeddedText$get_model_info()

Returns

list contains all saved information about the underlying text embedding model.

Method `get_model_label()`

Method for retrieving the label of the model that generated this embedding.

Usage

EmbeddedText$get_model_label()

Returns

string Label of the corresponding text embedding model

Method `get_times()`

Number of chunks/times of the text embeddings.

Usage

EmbeddedText$get_times()

Returns

Returns an int describing the number of chunks/times of the text embeddings.

Method `get_features()`

Number of actual features/dimensions of the text embeddings.In the case a feature extractor was used the number of features is smaller as the original number of features. To receive the original number of features (the number of features before applying a feature extractor) you can use the method get_original_features of this class.

Usage

EmbeddedText$get_features()

Returns

Returns an int describing the number of features/dimensions of the text embeddings.

Method `get_original_features()`

Number of original features/dimensions of the text embeddings.

Usage

EmbeddedText$get_original_features()

Returns

Returns an int describing the number of features/dimensions if no feature extractor) is used or before a feature extractor) is applied.

Method `is_compressed()`

Checks if the text embedding were reduced by a feature extractor.

Usage

EmbeddedText$is_compressed()

Returns

Returns TRUE if the number of dimensions was reduced by a feature extractor. If not return FALSE.

Method `add_feature_extractor_info()`

Method setting information on the feature extractor that was used to reduce the number of dimensions of the text embeddings. This information should only be used if a feature extractor was applied.

Usage

EmbeddedText$add_feature_extractor_info(
  model_name,
  model_label = NA,
  features = NA,
  method = NA,
  noise_factor = NA,
  optimizer = NA
)

Arguments

model_name: string Name of the underlying TextEmbeddingModel.
model_label: string Label of the underlying TextEmbeddingModel.
features: int Number of dimension (features) for the compressed text embeddings.
method: string Method that the TEFeatureExtractor applies for genereating the compressed text embeddings.
noise_factor: double Noise factor of the TEFeatureExtractor.
optimizer: string Optimizer used during training the TEFeatureExtractor.

Returns

Method does nothing return. It sets information on a feature extractor.

Method `get_feature_extractor_info()`

Method for receiving information on the feature extractor that was used to reduce the number of dimensions of the text embeddings.

Usage

EmbeddedText$get_feature_extractor_info()

Returns

Returns a list with information on the feature extractor. If no feature extractor was used it returns NULL.

Method `convert_to_LargeDataSetForTextEmbeddings()`

Method for converting this object to an object of class LargeDataSetForTextEmbeddings.

Usage

EmbeddedText$convert_to_LargeDataSetForTextEmbeddings()

Returns

Returns an object of class LargeDataSetForTextEmbeddings which uses memory mapping allowing to work with large data sets.

Method `n_rows()`

Number of rows.

Usage

EmbeddedText$n_rows()

Returns

Returns the number of rows of the text embeddings which represent the number of cases.

Method `get_all_fields()`

Return all fields.

Usage

EmbeddedText$get_all_fields()

Returns

Method returns a list containing all public and private fields of the object.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

EmbeddedText$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

See also

Public fields

Methods

Public methods

Method configure()

Usage

Arguments

Returns

Method save()

Usage

Arguments

Returns

Method is_configured()

Usage

Returns

Method load_from_disk()

Usage

Arguments

Returns

Method get_model_info()

Usage

Returns

Method get_model_label()

Usage

Returns

Method get_times()

Usage

Returns

Method get_features()

Usage

Returns

Method get_original_features()

Usage

Returns

Method is_compressed()

Usage

Returns

Method add_feature_extractor_info()

Usage

Arguments

Returns

Method get_feature_extractor_info()

Usage

Returns

Method convert_to_LargeDataSetForTextEmbeddings()

Usage

Returns

Method n_rows()

Usage

Returns

Method get_all_fields()

Usage

Returns

Method clone()

Usage

Arguments

Method `configure()`

Method `save()`

Method `is_configured()`

Method `load_from_disk()`

Method `get_model_info()`

Method `get_model_label()`

Method `get_times()`

Method `get_features()`

Method `get_original_features()`

Method `is_compressed()`

Method `add_feature_extractor_info()`

Method `get_feature_extractor_info()`

Method `convert_to_LargeDataSetForTextEmbeddings()`

Method `n_rows()`

Method `get_all_fields()`

Method `clone()`