Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).
Arguments
- dataset
Object of class datasets.arrow_dataset.Dataset. The data set must contain a column
"length"
containing the number of tokens for every sequence and a column"word_ids"
containing the word ids within every sequence.- step
string
indicating to which step the statistics belong. Recommended values are"creation"
for the creation of the tokenizer."initial_training"
for the first training of the transformer."fine_tuning"
for all following trainings of the transformer."training"
for a training run of the transformer.