Abstract class for large data sets containing raw texts
Source:R/obj_LargeDataSetForTexts.R
LargeDataSetForText.RdThis object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
See also
Other Data Management:
EmbeddedText,
LargeDataSetForTextEmbeddings
Super class
aifeducation::LargeDataSetBase -> LargeDataSetForText
Methods
Inherited methods
aifeducation::LargeDataSetBase$get_all_fields()aifeducation::LargeDataSetBase$get_colnames()aifeducation::LargeDataSetBase$get_dataset()aifeducation::LargeDataSetBase$get_ids()aifeducation::LargeDataSetBase$get_package_versions()aifeducation::LargeDataSetBase$load()aifeducation::LargeDataSetBase$load_from_disk()aifeducation::LargeDataSetBase$n_cols()aifeducation::LargeDataSetBase$n_rows()aifeducation::LargeDataSetBase$reduce_to_unique_ids()aifeducation::LargeDataSetBase$save()aifeducation::LargeDataSetBase$select()aifeducation::LargeDataSetBase$set_package_versions()
Method new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).
Usage
LargeDataSetForText$new(init_data = NULL)Method add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_txt(
dir_path,
batch_size = 500L,
log_file = NULL,
log_write_interval = 2L,
log_top_value = 0L,
log_top_total = 1L,
log_top_message = NA,
clean_text = TRUE,
trace = TRUE
)Arguments
dir_pathPath to the directory where the files are stored.
batch_sizeintdetermining the number of files to process at once.log_filestringPath to the file where the log should be saved. If no logging is desired set this argument toNULL.log_write_intervalintTime in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_fileis notNULL.log_top_valueintindicating the current iteration of the process.log_top_totalintdetermining the maximal number of iterations.log_top_messagestringproviding additional information of the process.clean_textboolIfTRUEthe text is modified to improve the quality of the following analysis:Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.
traceboolIfTRUEinformation on the progress is printed to the console.
Method add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_pdf(
dir_path,
batch_size = 500L,
log_file = NULL,
log_write_interval = 2L,
log_top_value = 0L,
log_top_total = 1L,
log_top_message = NA,
clean_text = TRUE,
trace = TRUE
)Arguments
dir_pathPath to the directory where the files are stored.
batch_sizeintdetermining the number of files to process at once.log_filestringPath to the file where the log should be saved. If no logging is desired set this argument toNULL.log_write_intervalintTime in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_fileis notNULL.log_top_valueintindicating the current iteration of the process.log_top_totalintdetermining the maximal number of iterations.log_top_messagestringproviding additional information of the process.clean_textboolIfTRUEthe text is modified to improve the quality of the following analysis:Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.
traceboolIfTRUEinformation on the progress is printed to the console.
Method add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
Usage
LargeDataSetForText$add_from_files_xlsx(
dir_path,
trace = TRUE,
id_column = "id",
text_column = "text",
bib_entry_column = "bib_entry",
license_column = "license",
url_license_column = "url_license",
text_license_column = "text_license",
url_source_column = "url_source",
log_file = NULL,
log_write_interval = 2L,
log_top_value = 0L,
log_top_total = 1L,
log_top_message = NA
)Arguments
dir_pathPath to the directory where the files are stored.
traceboolIfTRUEprints information on the progress to the console.id_columnstringName of the column storing the ids for the texts.text_columnstringName of the column storing the raw text.bib_entry_columnstringName of the column storing the bibliographic information of the texts.license_columnstringName of the column storing information about the licenses.url_license_columnstringName of the column storing information about the url to the license in the internet.text_license_columnstringName of the column storing the license as text.url_source_columnstringName of the column storing information about about the url to the source in the internet.log_filestringPath to the file where the log should be saved. If no logging is desired set this argument toNULL.log_write_intervalintTime in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_fileis notNULL.log_top_valueintindicating the current iteration of the process.log_top_totalintdetermining the maximal number of iterations.log_top_messagestringproviding additional information of the process.
Method add_from_data.frame()
Method for adding raw texts from a data.frame
Arguments
data_frameObject of class
data.framewith at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in thedata.framethey are added with empty values(NA). Additional columns are dropped.
Method get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.