Prepare texts for text embeddings with a bag of word approach.

This function prepares raw texts for use with TextEmbeddingModel.

Usage

bow_pp_create_basic_text_rep(
  data,
  vocab_draft,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  split_hyphens = FALSE,
  split_tags = FALSE,
  language_stopwords = "de",
  use_lemmata = FALSE,
  to_lower = FALSE,
  min_termfreq = NULL,
  min_docfreq = NULL,
  max_docfreq = NULL,
  window = 5,
  weights = 1/(1:5),
  trace = TRUE
)

Arguments

data: vector containing the raw texts.
vocab_draft: Object created with bow_pp_create_vocab_draft.
remove_punct: bool TRUE if punctuation should be removed.
remove_symbols: bool TRUE if symbols should be removed.
remove_numbers: bool TRUE if numbers should be removed.
remove_url: bool TRUE if urls should be removed.
remove_separators: bool TRUE if separators should be removed.
split_hyphens: bool TRUE if hyphens should be split into several tokens.
split_tags: bool TRUE if tags should be split.
language_stopwords: string Abbreviation for the language for which stopwords should be removed.
use_lemmata: bool TRUE lemmas instead of original tokens should be used.
to_lower: bool TRUE if tokens or lemmas should be used with lower cases.
min_termfreq: int Minimum frequency of a token to be part of the vocabulary.
min_docfreq: int Minimum appearance of a token in documents to be part of the vocabulary.
max_docfreq: int Maximum appearance of a token in documents to be part of the vocabulary.
window: int size of the window for creating the feature-co-occurance matrix.
weights: vector weights for the corresponding window. The vector length must be equal to the window size.
trace: bool TRUE if information about the progress should be printed to console.

Value

Returns a list of class basic_text_rep with the following components.

dfm: Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.
fcm: Feature-Co-Occurance-Matrix.
information: list containing information about the used vocabulary. These are:
- n_sentence: Number of sentences
- n_document_segments: Number of document segments/raw texts
- n_token_init: Number of initial tokens
- n_token_final: Number of final tokens
- n_lemmata: Number of lemmas
configuration: list containing information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.
language_model: list containing information about the applied language model. These are:
- model: the udpipe language model
- label: the label of the udpipe language model
- upos: the applied universal part-of-speech tags
- language: the language
- vocab: a data.frame with the original vocabulary

Usage

Arguments

Value

See also