Skip to contents

Function for creating a first draft of a vocabulary This function creates a list of tokens which refer to specific universal part-of-speech tags (UPOS) and provides the corresponding lemmas.

Usage

bow_pp_create_vocab_draft(
  path_language_model,
  data,
  upos = c("NOUN", "ADJ", "VERB"),
  label_language_model = NULL,
  language = NULL,
  chunk_size = 100,
  trace = TRUE
)

Arguments

path_language_model

string Path to a udpipe language model that should be used for tagging and lemmatization.

data

vector containing the raw texts.

upos

vector containing the universal part-of-speech tags which should be used to build the vocabulary.

label_language_model

string Label for the udpipe language model used.

language

string Name of the language (e.g., English, German)

chunk_size

int Number of raw texts which should be processed at once.

trace

bool TRUE if information about the progress should be printed to console.

Value

list with the following components.

  • vocab: data.frame containing the tokens, lemmas, tokens in lower case, and lemmas in lower case.

  • ud_language_model udpipe language model that is used for tagging.

  • label_language_model Label of the udpipe language model.

  • language Language of the raw texts.

  • upos Used univerisal part-of-speech tags.

  • n_sentence int Estimated number of sentences in the raw texts.

  • n_token int Estimated number of tokens in the raw texts.

  • n_document_segments int Estimated number of document segments/raw texts.

Note

A list of possible tags can be found here: https://universaldependencies.org/u/pos/index.html.

A huge number of models can be found here: https://ufal.mff.cuni.cz/udpipe/2/models.

See also

Other Preparation: bow_pp_create_basic_text_rep()