Base class for tokenizers containing all methods shared by the sub-classes.
See also
Other R6 Classes for Developers:
AIFEBaseModel,
AIFEMaster,
BaseModelCore,
ClassifiersBasedOnTextEmbeddings,
DataManagerClassifier,
LargeDataSetBase,
ModelsBasedOnTextEmbeddings,
TEClassifiersBasedOnProtoNet,
TEClassifiersBasedOnRegular
Super class
aifeducation::AIFEMaster -> TokenizerBase
Methods
Inherited methods
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()
Method save()
Method for saving a model on disk.
Method load_from_disk()
Loads an object from disk and updates the object to the current version of the package.
Method encode()
Method for encoding words of raw texts into integers.
Usage
TokenizerBase$encode(
raw_text,
token_overlap = 0L,
max_token_sequence_length = 512L,
n_chunks = 1L,
token_encodings_only = FALSE,
token_to_int = TRUE,
return_token_type_ids = TRUE,
trace = FALSE
)Arguments
raw_textvectorRaw text.token_overlapintNumber of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values:0 <= xmax_token_sequence_lengthintMaximal number of tokens per chunk. Allowed values:20 <= xn_chunksintMaximal number chunks. Allowed values:1 <= xtoken_encodings_onlyboolTRUE: Returns alistcontaing only the tokens.FALSE: Returns alistcontaing a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.
token_to_intboolTRUE: Returns the tokens asintindex.FALSE: Returns the tokens asstrings.
return_token_type_idsboolIfTRUEadditionally returns the return_token_type_ids.traceboolTRUEif information about the estimation phase should be printed to the console.
Method decode()
Method for decoding a sequence of integers into tokens
Method calculate_statistics()
Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>
Usage
TokenizerBase$calculate_statistics(
text_dataset,
statistics_max_tokens_length,
step = "creation"
)Arguments
text_datasetLargeDataSetForTextLargeDataSetForText Object storing textual data.statistics_max_tokens_lengthintMaximum sequence length for calculating the statistics. Allowed values:20 <= x <= 8192stepstringdescribing the context of the estimation.