Skip to contents

Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).

Usage

calc_tokenizer_statistics(dataset, step = "creation")

Arguments

dataset

Object of class datasets.arrow_dataset.Dataset. The data set must contain a column "length" containing the number of tokens for every sequence and a column "word_ids" containing the word ids within every sequence.

step

string indicating to which step the statistics belong. Recommended values are

  • "creation" for the creation of the tokenizer.

  • "initial_training" for the first training of the transformer.

  • "fine_tuning" for all following trainings of the transformer.

  • "training" for a training run of the transformer.

Value

Returns a list with the following entries:

  • n_sequences: Number of sequences

  • n_words: Number for words in whole corpus

  • n_tokens: Number of tokens in the whole corpus

  • mu_t: eqn(n_tokens/n_sequences)

  • mu_w: eqn(n_words/n_sequences)

  • mu_g: eqn(n_tokens/n_words)

References

Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335