Base class for tokenizers

Base class for tokenizers containing all methods shared by the sub-classes.

Value

Does return a new object of this class.

Returns a data.frame containing the estimates.

Super class

aifeducation::AIFEMaster -> TokenizerBase

Methods

Public methods

TokenizerBase$save()
TokenizerBase$load_from_disk()
TokenizerBase$get_tokenizer_statistics()
TokenizerBase$get_tokenizer()
TokenizerBase$encode()
TokenizerBase$decode()
TokenizerBase$get_special_tokens()
TokenizerBase$n_special_tokens()
TokenizerBase$calculate_statistics()
TokenizerBase$clone()

Inherited methods

Method `save()`

Method for saving a model on disk.

Usage

TokenizerBase$save(dir_path, folder_name)

Arguments

dir_path: Path to the directory where to save the object.
folder_name: string Name of the folder where the model should be saved. Allowed values: any

Returns

Function does nothing return. It is used to save an object on disk.

Method `load_from_disk()`

Loads an object from disk and updates the object to the current version of the package.

Usage

TokenizerBase$load_from_disk(dir_path)

Arguments

dir_path: Path where the object set is stored.

Returns

Function does nothin return. It loads an object from disk.

Method `get_tokenizer_statistics()`

Tokenizer statistics

Usage

TokenizerBase$get_tokenizer_statistics()

Returns

Returns a data.frame containing the tokenizer's statistics.

Method `get_tokenizer()`

Python tokenizer

Usage

TokenizerBase$get_tokenizer()

Returns

Returns the python tokenizer within the model.

Method `encode()`

Method for encoding words of raw texts into integers.

Usage

TokenizerBase$encode(
  raw_text,
  token_overlap = 0L,
  max_token_sequence_length = 512L,
  n_chunks = 1L,
  token_encodings_only = FALSE,
  token_to_int = TRUE,
  return_token_type_ids = TRUE,
  trace = FALSE
)

Arguments

raw_text

vector Raw text.

token_overlap

int Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values: $0 <= x $

max_token_sequence_length

int Maximal number of tokens per chunk. Allowed values: $20 <= x $

n_chunks

int Maximal number chunks. Allowed values: $2 <= x $

token_encodings_only

bool

TRUE: Returns a list containg only the tokens.
FALSE: Returns a list containg a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.

token_to_int

bool

TRUE: Returns the tokens as int index.
FALSE: Returns the tokens as strings.

return_token_type_ids

bool If TRUE additionally returns the return_token_type_ids.

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

list containing the integer or token sequences of the raw texts with special tokens.

Method `decode()`

Method for decoding a sequence of integers into tokens

Usage

TokenizerBase$decode(int_seqence, to_token = FALSE)

Arguments

int_seqence

list list of integer sequence that should be converted to tokens.

to_token

bool

FALSE: Transforms the integers to plain text.
TRUE: Transforms the integers to a sequence of tokens.

Returns

list of token sequences

Method `get_special_tokens()`

Method for receiving the special tokens of the model

Usage

TokenizerBase$get_special_tokens()

Returns

Returns a matrix containing the special tokens in the rows and their type, token, and id in the columns.

Method `n_special_tokens()`

Method for receiving the special tokens of the model

Usage

TokenizerBase$n_special_tokens()

Returns

Returns an 'int' counting the number of special tokens.

Method `calculate_statistics()`

Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).

Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>

Usage

TokenizerBase$calculate_statistics(
  text_dataset,
  statistics_max_tokens_length,
  step = "creation"
)

Arguments

text_dataset: LargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_length: int Maximum sequence length for calculating the statistics. Allowed values: $20 <= x <= 8192$
step: string describing the context of the estimation.

Returns

Returns an 'int' counting the number of special tokens.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

TokenizerBase$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

See also

Super class

Methods

Public methods

Method save()

Usage

Arguments

Returns

Method load_from_disk()

Usage

Arguments

Returns

Method get_tokenizer_statistics()

Usage

Returns

Method get_tokenizer()

Usage

Returns

Method encode()

Usage

Arguments

Returns

Method decode()

Usage

Arguments

Returns

Method get_special_tokens()

Usage

Returns

Method n_special_tokens()

Usage

Returns

Method calculate_statistics()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

Method `save()`

Method `load_from_disk()`

Method `get_tokenizer_statistics()`

Method `get_tokenizer()`

Method `encode()`

Method `decode()`

Method `get_special_tokens()`

Method `n_special_tokens()`

Method `calculate_statistics()`

Method `clone()`