Estimate tokenizer statistics — calc_tokenizer

Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).

Usage

calc_tokenizer_statistics(
  dataset,
  step = "creation",
  statistics_max_tokens_length = 512L
)

Arguments

dataset

Object of class datasets.arrow_dataset.Dataset. The data set must contain a column "length" containing the number of tokens for every sequence and a column "word_ids" containing the word ids within every sequence.

step

string indicating to which step the statistics belong. Recommended values are

"creation" for the creation of the tokenizer.
"initial_training" for the first training of the transformer.
"fine_tuning" for all following trainings of the transformer.
"training" for a training run of the transformer.

statistics_max_tokens_length

int Maximum sequence length for calculating the statistics. Allowed values: \(20 <= x <= 8192\)

Value

Returns a list with the following entries:

n_sequences: Number of sequences
n_words: Number for words in whole corpus
n_tokens: Number of tokens in the whole corpus
mu_t: eqn(n_tokens/n_sequences)
mu_w: eqn(n_words/n_sequences)
mu_g: eqn(n_tokens/n_words)

References

Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335