Function for creating a first draft of a vocabulary This function creates a list of tokens which refer to specific universal part-of-speech tags (UPOS) and provides the corresponding lemmas.
Source:R/preparation.R
bow_pp_create_vocab_draft.Rd
Function for creating a first draft of a vocabulary This function creates a list of tokens which refer to specific universal part-of-speech tags (UPOS) and provides the corresponding lemmas.
Usage
bow_pp_create_vocab_draft(
path_language_model,
data,
upos = c("NOUN", "ADJ", "VERB"),
label_language_model = NULL,
language = NULL,
chunk_size = 100,
trace = TRUE
)
Arguments
- path_language_model
string
Path to a udpipe language model that should be used for tagging and lemmatization.- data
vector
containing the raw texts.- upos
vector
containing the universal part-of-speech tags which should be used to build the vocabulary.- label_language_model
string
Label for the udpipe language model used.- language
string
Name of the language (e.g., English, German)- chunk_size
int
Number of raw texts which should be processed at once.- trace
bool
TRUE
if information about the progress should be printed to console.
Value
list
with the following components.
vocab:
data.frame
containing the tokens, lemmas, tokens in lower case, and lemmas in lower case.ud_language_model
udpipe language model that is used for tagging.label_language_model
Label of the udpipe language model.language
Language of the raw texts.upos
Used univerisal part-of-speech tags.n_sentence
int
Estimated number of sentences in the raw texts.n_token
int
Estimated number of tokens in the raw texts.n_document_segments
int
Estimated number of document segments/raw texts.
Note
A list of possible tags can be found here: https://universaldependencies.org/u/pos/index.html.
A huge number of models can be found here: https://ufal.mff.cuni.cz/udpipe/2/models.
See also
Other Preparation:
bow_pp_create_basic_text_rep()