Tokenizer based on the WordPiece model (Wu et al. 2016).
References
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>
See also
Other Tokenizer:
HuggingFaceTokenizer
Super classes
aifeducation::AIFEMaster
-> aifeducation::TokenizerBase
-> WordPieceTokenizer
Methods
Inherited methods
aifeducation::AIFEMaster$get_all_fields()
aifeducation::AIFEMaster$get_documentation_license()
aifeducation::AIFEMaster$get_ml_framework()
aifeducation::AIFEMaster$get_model_config()
aifeducation::AIFEMaster$get_model_description()
aifeducation::AIFEMaster$get_model_info()
aifeducation::AIFEMaster$get_model_license()
aifeducation::AIFEMaster$get_package_versions()
aifeducation::AIFEMaster$get_private()
aifeducation::AIFEMaster$get_publication_info()
aifeducation::AIFEMaster$get_sustainability_data()
aifeducation::AIFEMaster$is_configured()
aifeducation::AIFEMaster$is_trained()
aifeducation::AIFEMaster$set_documentation_license()
aifeducation::AIFEMaster$set_model_description()
aifeducation::AIFEMaster$set_model_license()
aifeducation::AIFEMaster$set_publication_info()
aifeducation::TokenizerBase$calculate_statistics()
aifeducation::TokenizerBase$decode()
aifeducation::TokenizerBase$encode()
aifeducation::TokenizerBase$get_special_tokens()
aifeducation::TokenizerBase$get_tokenizer()
aifeducation::TokenizerBase$get_tokenizer_statistics()
aifeducation::TokenizerBase$load_from_disk()
aifeducation::TokenizerBase$n_special_tokens()
aifeducation::TokenizerBase$save()
Method configure()
Configures a new object of this class.
Method train()
Trains a new object of this class
Usage
WordPieceTokenizer$train(
text_dataset,
statistics_max_tokens_length = 512L,
sustain_track = FALSE,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 15L,
sustain_log_level = "warning",
trace = FALSE
)
Arguments
text_dataset
LargeDataSetForText
LargeDataSetForText Object storing textual data.statistics_max_tokens_length
int
Maximum sequence length for calculating the statistics. Allowed values:20 <= x <= 8192
sustain_track
bool
IfTRUE
energy consumption is tracked during training via the python library 'codecarbon'.sustain_iso_code
string
ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: anysustain_region
string
Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: anysustain_interval
int
Interval in seconds for measuring power usage. Allowed values:1 <= x
sustain_log_level
string
Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'trace
bool
TRUE
if information about the estimation phase should be printed to the console.