Tokenizer based on the WordPiece model (Wu et al. 2016).
References
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>
See also
Other Tokenizer:
HuggingFaceTokenizer
Super classes
aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer
Methods
Inherited methods
aifeducation::AIFEMaster$get_all_fields()aifeducation::AIFEMaster$get_documentation_license()aifeducation::AIFEMaster$get_ml_framework()aifeducation::AIFEMaster$get_model_config()aifeducation::AIFEMaster$get_model_description()aifeducation::AIFEMaster$get_model_info()aifeducation::AIFEMaster$get_model_license()aifeducation::AIFEMaster$get_package_versions()aifeducation::AIFEMaster$get_private()aifeducation::AIFEMaster$get_publication_info()aifeducation::AIFEMaster$get_sustainability_data()aifeducation::AIFEMaster$is_configured()aifeducation::AIFEMaster$is_trained()aifeducation::AIFEMaster$set_documentation_license()aifeducation::AIFEMaster$set_model_description()aifeducation::AIFEMaster$set_model_license()aifeducation::AIFEMaster$set_publication_info()aifeducation::TokenizerBase$calculate_statistics()aifeducation::TokenizerBase$decode()aifeducation::TokenizerBase$encode()aifeducation::TokenizerBase$get_special_tokens()aifeducation::TokenizerBase$get_tokenizer()aifeducation::TokenizerBase$get_tokenizer_statistics()aifeducation::TokenizerBase$load_from_disk()aifeducation::TokenizerBase$n_special_tokens()aifeducation::TokenizerBase$save()
Method configure()
Configures a new object of this class.
Method train()
Trains a new object of this class
Usage
WordPieceTokenizer$train(
text_dataset,
statistics_max_tokens_length = 512L,
sustain_track = FALSE,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 15L,
sustain_log_level = "warning",
trace = FALSE
)Arguments
text_datasetLargeDataSetForTextLargeDataSetForText Object storing textual data.statistics_max_tokens_lengthintMaximum sequence length for calculating the statistics. Allowed values:20 <= x <= 8192sustain_trackboolIfTRUEenergy consumption is tracked during training via the python library 'codecarbon'.sustain_iso_codestringISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: anysustain_regionstringRegion within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: anysustain_intervalintInterval in seconds for measuring power usage. Allowed values:1 <= xsustain_log_levelstringLevel for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'traceboolTRUEif information about the estimation phase should be printed to the console.