Skip to contents

This function prepares raw texts for use with TextEmbeddingModel.

Usage

bow_pp_create_basic_text_rep(
  data,
  vocab_draft,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  split_hyphens = FALSE,
  split_tags = FALSE,
  language_stopwords = "de",
  use_lemmata = FALSE,
  to_lower = FALSE,
  min_termfreq = NULL,
  min_docfreq = NULL,
  max_docfreq = NULL,
  window = 5,
  weights = 1/(1:5),
  trace = TRUE
)

Arguments

data

vector containing the raw texts.

vocab_draft

Object created with bow_pp_create_vocab_draft.

remove_punct

bool TRUE if punctuation should be removed.

remove_symbols

bool TRUE if symbols should be removed.

remove_numbers

bool TRUE if numbers should be removed.

remove_url

bool TRUE if urls should be removed.

remove_separators

bool TRUE if separators should be removed.

split_hyphens

bool TRUE if hyphens should be split into several tokens.

split_tags

bool TRUE if tags should be split.

language_stopwords

string Abbreviation for the language for which stopwords should be removed.

use_lemmata

bool TRUE lemmas instead of original tokens should be used.

to_lower

bool TRUE if tokens or lemmas should be used with lower cases.

min_termfreq

int Minimum frequency of a token to be part of the vocabulary.

min_docfreq

int Minimum appearance of a token in documents to be part of the vocabulary.

max_docfreq

int Maximum appearance of a token in documents to be part of the vocabulary.

window

int size of the window for creating the feature-co-occurance matrix.

weights

vector weights for the corresponding window. The vector length must be equal to the window size.

trace

bool TRUE if information about the progress should be printed to console.

Value

Returns a list of class basic_text_rep with the following components.

  • dfm: Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.

  • fcm: Feature-Co-Occurance-Matrix.

  • information: list containing information about the used vocabulary. These are:

    • n_sentence: Number of sentences

    • n_document_segments: Number of document segments/raw texts

    • n_token_init: Number of initial tokens

    • n_token_final: Number of final tokens

    • n_lemmata: Number of lemmas

  • configuration: list containing information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.

  • language_model: list containing information about the applied language model. These are:

    • model: the udpipe language model

    • label: the label of the udpipe language model

    • upos: the applied universal part-of-speech tags

    • language: the language

    • vocab: a data.frame with the original vocabulary

See also

Other Preparation: bow_pp_create_vocab_draft()