Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).
Arguments
- dataset
Object of class datasets.arrow_dataset.Dataset. The data set must contain a column
"length"containing the number of tokens for every sequence and a column"word_ids"containing the word ids within every sequence.- step
stringindicating to which step the statistics belong. Recommended values are"creation"for the creation of the tokenizer."initial_training"for the first training of the transformer."fine_tuning"for all following trainings of the transformer."training"for a training run of the transformer.
- statistics_max_tokens_length
intMaximum sequence length for calculating the statistics. Allowed values:20 <= x <= 8192