Skip to contents

Tokenizer based on the WordPiece model (Wu et al. 2016).

Value

Does return a new object of this class.

References

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>

See also

Other Tokenizer: HuggingFaceTokenizer

Super classes

aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer

Methods

Inherited methods


Method configure()

Configures a new object of this class.

Usage

WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)

Arguments

vocab_size

int Size of the vocabulary. Allowed values: 1000 <= x <= 500000

vocab_do_lower_case

bool TRUE if all tokens should be lower case.

Returns

Does nothing return.


Method train()

Trains a new object of this class

Usage

WordPieceTokenizer$train(
  text_dataset,
  statistics_max_tokens_length = 512L,
  sustain_track = FALSE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15L,
  sustain_log_level = "warning",
  trace = FALSE
)

Arguments

text_dataset

LargeDataSetForText LargeDataSetForText Object storing textual data.

statistics_max_tokens_length

int Maximum sequence length for calculating the statistics. Allowed values: 20 <= x <= 8192

sustain_track

bool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.

sustain_iso_code

string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any

sustain_region

string Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any

sustain_interval

int Interval in seconds for measuring power usage. Allowed values: 1 <= x

sustain_log_level

string Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'

trace

bool TRUE if information about the estimation phase should be printed to the console.

Returns

Does nothing return.


Method clone()

The objects of this class are cloneable with this method.

Usage

WordPieceTokenizer$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.