Base class for tokenizers containing all methods shared by the sub-classes.
See also
Other R6 Classes for Developers:
AIFEBaseModel
,
AIFEMaster
,
BaseModelCore
,
ClassifiersBasedOnTextEmbeddings
,
DataManagerClassifier
,
LargeDataSetBase
,
ModelsBasedOnTextEmbeddings
,
TEClassifiersBasedOnProtoNet
,
TEClassifiersBasedOnRegular
Super class
aifeducation::AIFEMaster
-> TokenizerBase
Methods
Inherited methods
aifeducation::AIFEMaster$get_all_fields()
aifeducation::AIFEMaster$get_documentation_license()
aifeducation::AIFEMaster$get_ml_framework()
aifeducation::AIFEMaster$get_model_config()
aifeducation::AIFEMaster$get_model_description()
aifeducation::AIFEMaster$get_model_info()
aifeducation::AIFEMaster$get_model_license()
aifeducation::AIFEMaster$get_package_versions()
aifeducation::AIFEMaster$get_private()
aifeducation::AIFEMaster$get_publication_info()
aifeducation::AIFEMaster$get_sustainability_data()
aifeducation::AIFEMaster$is_configured()
aifeducation::AIFEMaster$is_trained()
aifeducation::AIFEMaster$set_documentation_license()
aifeducation::AIFEMaster$set_model_description()
aifeducation::AIFEMaster$set_model_license()
aifeducation::AIFEMaster$set_publication_info()
Method save()
Method for saving a model on disk.
Method load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
Method encode()
Method for encoding words of raw texts into integers.
Usage
TokenizerBase$encode(
raw_text,
token_overlap = 0L,
max_token_sequence_length = 512L,
n_chunks = 1L,
token_encodings_only = FALSE,
token_to_int = TRUE,
return_token_type_ids = TRUE,
trace = FALSE
)
Arguments
raw_text
vector
Raw text.token_overlap
int
Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values:0 <= x
max_token_sequence_length
int
Maximal number of tokens per chunk. Allowed values:20 <= x
n_chunks
int
Maximal number chunks. Allowed values:1 <= x
token_encodings_only
bool
TRUE
: Returns alist
containg only the tokens.FALSE
: Returns alist
containg a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.
token_to_int
bool
TRUE
: Returns the tokens asint
index.FALSE
: Returns the tokens asstring
s.
return_token_type_ids
bool
IfTRUE
additionally returns the return_token_type_ids.trace
bool
TRUE
if information about the estimation phase should be printed to the console.
Method decode()
Method for decoding a sequence of integers into tokens
Method calculate_statistics()
Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>
Usage
TokenizerBase$calculate_statistics(
text_dataset,
statistics_max_tokens_length,
step = "creation"
)
Arguments
text_dataset
LargeDataSetForText
LargeDataSetForText Object storing textual data.statistics_max_tokens_length
int
Maximum sequence length for calculating the statistics. Allowed values:20 <= x <= 8192
step
string
describing the context of the estimation.