Prepare texts for text embeddings with a bag of word approach.
Source:R/preparation.R
bow_pp_create_basic_text_rep.Rd
This function prepares raw texts for use with TextEmbeddingModel.
Usage
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)
Arguments
- data
vector
containing the raw texts.- vocab_draft
Object created with bow_pp_create_vocab_draft.
- remove_punct
bool
TRUE
if punctuation should be removed.- remove_symbols
bool
TRUE
if symbols should be removed.- remove_numbers
bool
TRUE
if numbers should be removed.- remove_url
bool
TRUE
if urls should be removed.- remove_separators
bool
TRUE
if separators should be removed.- split_hyphens
bool
TRUE
if hyphens should be split into several tokens.- split_tags
bool
TRUE
if tags should be split.- language_stopwords
string
Abbreviation for the language for which stopwords should be removed.- use_lemmata
bool
TRUE
lemmas instead of original tokens should be used.- to_lower
bool
TRUE
if tokens or lemmas should be used with lower cases.- min_termfreq
int
Minimum frequency of a token to be part of the vocabulary.- min_docfreq
int
Minimum appearance of a token in documents to be part of the vocabulary.- max_docfreq
int
Maximum appearance of a token in documents to be part of the vocabulary.- window
int
size of the window for creating the feature-co-occurance matrix.- weights
vector
weights for the corresponding window. The vector length must be equal to the window size.- trace
bool
TRUE
if information about the progress should be printed to console.
Value
Returns a list
of class basic_text_rep
with the following components.
dfm:
Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.fcm:
Feature-Co-Occurance-Matrix.information:
list
containing information about the used vocabulary. These are:n_sentence:
Number of sentencesn_document_segments:
Number of document segments/raw textsn_token_init:
Number of initial tokensn_token_final:
Number of final tokensn_lemmata:
Number of lemmas
configuration:
list
containing information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.language_model:
list
containing information about the applied language model. These are:model:
the udpipe language modellabel:
the label of the udpipe language modelupos:
the applied universal part-of-speech tagslanguage:
the languagevocab:
adata.frame
with the original vocabulary
See also
Other Preparation:
bow_pp_create_vocab_draft()