Skip to contents

This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.

Value

Returns a new object of this class.

Super class

aifeducation::LargeDataSetBase -> LargeDataSetForText

Methods

Inherited methods


Method new()

Method for creation of LargeDataSetForText instance. It can be initialized with init_data parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).

Usage

LargeDataSetForText$new(init_data = NULL)

Arguments

init_data

Initial data.frame for dataset.

Returns

A new instance of this class initialized with init_data if passed.


Method add_from_files_txt()

Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:

  • bib_entry.txt: containing a text version of the bibliographic information of the raw text.

  • license.txt: containing a statement about the license to use the raw text such as "CC BY".

  • url_license.txt: containing the url/link to the license in the internet.

  • text_license.txt: containing the license in raw text.

  • url_source.txt: containing the url/link to the source in the internet.

    The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_txt(
  dir_path,
  batch_size = 500,
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA,
  trace = TRUE
)

Arguments

dir_path

Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.


Method add_from_files_pdf()

Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:

  • bib_entry.txt: containing a text version of the bibliographic information of the raw text.

  • license.txt: containing a statement about the license to use the raw text such as "CC BY".

  • url_license.txt: containing the url/link to the license in the internet.

  • text_license.txt: containing the license in raw text.

  • url_source.txt: containing the url/link to the source in the internet.

    The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_pdf(
  dir_path,
  batch_size = 500,
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA,
  trace = TRUE
)

Arguments

dir_path

Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.


Method add_from_files_xlsx()

Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.

Usage

LargeDataSetForText$add_from_files_xlsx(
  dir_path,
  trace = TRUE,
  id_column = "id",
  text_column = "text",
  bib_entry_column = "bib_entry",
  license_column = "license",
  url_license_column = "url_license",
  text_license_column = "text_license",
  url_source_column = "url_source",
  log_file = NULL,
  log_write_interval = 2,
  log_top_value = 0,
  log_top_total = 1,
  log_top_message = NA
)

Arguments

dir_path

Path to the directory where the files are stored.

trace

bool If TRUE prints information on the progress to the console.

id_column

string Name of the column storing the ids for the texts.

text_column

string Name of the column storing the raw text.

bib_entry_column

string Name of the column storing the bibliographic information of the texts.

license_column

string Name of the column storing information about the licenses.

url_license_column

string Name of the column storing information about the url to the license in the internet.

text_license_column

string Name of the column storing the license as text.

url_source_column

string Name of the column storing information about about the url to the source in the internet.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

Returns

The method does not return anything. It adds new raw texts to the data set.


Method add_from_data.frame()

Method for adding raw texts from a data.frame

Usage

LargeDataSetForText$add_from_data.frame(data_frame)

Arguments

data_frame

Object of class data.frame with at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in the data.frame they are added with empty values(NA). Additional columns are dropped.

Returns

The method does not return anything. It adds new raw texts to the data set.


Method get_private()

Method for requesting all private fields and methods. Used for loading and updating an object.

Usage

LargeDataSetForText$get_private()

Returns

Returns a list with all private fields and methods.


Method clone()

The objects of this class are cloneable with this method.

Usage

LargeDataSetForText$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.