WordPieceTokenizer

Value

Does return a new object of this class.

References

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. <https://doi.org/10.48550/arXiv.1609.08144>

Super classes

aifeducation::AIFEMaster -> aifeducation::TokenizerBase -> WordPieceTokenizer

Methods

Inherited methods

Method `configure()`

Configures a new object of this class.

Usage

WordPieceTokenizer$configure(vocab_size = 10000L, vocab_do_lower_case = FALSE)

Arguments

vocab_size: int Size of the vocabulary. Allowed values: $1000 <= x <= 500000$
vocab_do_lower_case: bool TRUE if all tokens should be lower case.

Returns

Does nothing return.

Method `train()`

Trains a new object of this class

Usage

WordPieceTokenizer$train(
  text_dataset,
  statistics_max_tokens_length = 512L,
  sustain_track = FALSE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15L,
  sustain_log_level = "warning",
  trace = FALSE
)

Arguments

text_dataset: LargeDataSetForText LargeDataSetForText Object storing textual data.
statistics_max_tokens_length: int Maximum sequence length for calculating the statistics. Allowed values: $20 <= x <= 8192$
sustain_track: bool If TRUE energy consumption is tracked during training via the python library 'codecarbon'.
sustain_iso_code: string ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. Allowed values: any
sustain_region: string Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html Allowed values: any
sustain_interval: int Interval in seconds for measuring power usage. Allowed values: $1 <= x $
sustain_log_level: string Level for printing information to the console. Allowed values: 'debug', 'info', 'warning', 'error', 'critical'
trace: bool TRUE if information about the estimation phase should be printed to the console.

Returns

Does nothing return.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

WordPieceTokenizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Value

References

See also

Super classes

Methods

Public methods

Method configure()

Usage

Arguments

Returns

Method train()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

Method `configure()`

Method `train()`

Method `clone()`