Skip to contents

Base class for tokenizers containing all methods shared by the sub-classes.

Value

Does return a new object of this class.

Returns a data.frame containing the estimates.

Super class

aifeducation::AIFEMaster -> TokenizerBase

Methods

Inherited methods


Method save()

Method for saving a model on disk.

Usage

TokenizerBase$save(dir_path, folder_name)

Arguments

dir_path

Path to the directory where to save the object.

folder_name

string Name of the folder where the model should be saved. Allowed values: any

Returns

Function does nothing return. It is used to save an object on disk.


Method load_from_disk()

Loads an object from disk and updates the object to the current version of the package.

Usage

TokenizerBase$load_from_disk(dir_path)

Arguments

dir_path

Path where the object set is stored.

Returns

Function does nothin return. It loads an object from disk.


Method get_tokenizer_statistics()

Tokenizer statistics

Usage

TokenizerBase$get_tokenizer_statistics()

Returns

Returns a data.frame containing the tokenizer's statistics.


Method get_tokenizer()

Python tokenizer

Usage

TokenizerBase$get_tokenizer()

Returns

Returns the python tokenizer within the model.


Method encode()

Method for encoding words of raw texts into integers.

Usage

TokenizerBase$encode(
  raw_text,
  token_overlap = 0L,
  max_token_sequence_length = 512L,
  n_chunks = 1L,
  token_encodings_only = FALSE,
  token_to_int = TRUE,
  return_token_type_ids = TRUE,
  trace = FALSE
)

Arguments

raw_text

vector Raw text.

token_overlap

int Number of tokens from the previous chunk that should be added at the beginng of the next chunk. Allowed values: 0 <= x

max_token_sequence_length

int Maximal number of tokens per chunk. Allowed values: 20 <= x

n_chunks

int Maximal number chunks. Allowed values: 1 <= x

token_encodings_only

bool

  • TRUE: Returns a list containg only the tokens.

  • FALSE: Returns a list containg a list for the tokens, the number of chunks, and the number potential number of chunks for each document/text.

token_to_int

bool

  • TRUE: Returns the tokens as int index.

  • FALSE: Returns the tokens as strings.

return_token_type_ids

bool If TRUE additionally returns the return_token_type_ids.

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

list containing the integer or token sequences of the raw texts with special tokens.


Method decode()

Method for decoding a sequence of integers into tokens

Usage

TokenizerBase$decode(int_seqence, to_token = FALSE)

Arguments

int_seqence

list list of integer sequence that should be converted to tokens.

to_token

bool

  • FALSE: Transforms the integers to plain text.

  • TRUE: Transforms the integers to a sequence of tokens.

Returns

list of token sequences


Method get_special_tokens()

Method for receiving the special tokens of the model

Usage

TokenizerBase$get_special_tokens()

Returns

Returns a matrix containing the special tokens in the rows and their type, token, and id in the columns.


Method n_special_tokens()

Method for receiving the special tokens of the model

Usage

TokenizerBase$n_special_tokens()

Returns

Returns an 'int' counting the number of special tokens.


Method calculate_statistics()

Method for calculating tokenizer statistics as suggested by Kaya and Tantuğ (2024).

Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. <https://doi.org/10.1016/j.iswa.2024.200335>

Usage

TokenizerBase$calculate_statistics(
  text_dataset,
  statistics_max_tokens_length,
  step = "creation"
)

Arguments

text_dataset

LargeDataSetForText LargeDataSetForText Object storing textual data.

statistics_max_tokens_length

int Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192

step

string describing the context of the estimation.

Returns

Returns an 'int' counting the number of special tokens.


Method clone()

The objects of this class are cloneable with this method.

Usage

TokenizerBase$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.