Abstract class for large data sets containing raw texts
Source:R/LargeDataSetForTexts.R
LargeDataSetForText.Rd
This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
See also
Other Data Management:
DataManagerClassifier
,
EmbeddedText
,
LargeDataSetForTextEmbeddings
Super class
aifeducation::LargeDataSetBase
-> LargeDataSetForText
Methods
Inherited methods
aifeducation::LargeDataSetBase$get_all_fields()
aifeducation::LargeDataSetBase$get_colnames()
aifeducation::LargeDataSetBase$get_dataset()
aifeducation::LargeDataSetBase$get_ids()
aifeducation::LargeDataSetBase$load()
aifeducation::LargeDataSetBase$load_from_disk()
aifeducation::LargeDataSetBase$n_cols()
aifeducation::LargeDataSetBase$n_rows()
aifeducation::LargeDataSetBase$reduce_to_unique_ids()
aifeducation::LargeDataSetBase$save()
aifeducation::LargeDataSetBase$select()
Method new()
Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame()
method if init_data
is data.frame
).
Usage
LargeDataSetForText$new(init_data = NULL)
Method add_from_files_txt()
Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_txt(
dir_path,
batch_size = 500,
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA,
trace = TRUE
)
Arguments
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.trace
bool
IfTRUE
information on the progress is printed to the console.
Method add_from_files_pdf()
Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
Usage
LargeDataSetForText$add_from_files_pdf(
dir_path,
batch_size = 500,
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA,
trace = TRUE
)
Arguments
dir_path
Path to the directory where the files are stored.
batch_size
int
determining the number of files to process at once.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.trace
bool
IfTRUE
information on the progress is printed to the console.
Method add_from_files_xlsx()
Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
Usage
LargeDataSetForText$add_from_files_xlsx(
dir_path,
trace = TRUE,
id_column = "id",
text_column = "text",
bib_entry_column = "bib_entry",
license_column = "license",
url_license_column = "url_license",
text_license_column = "text_license",
url_source_column = "url_source",
log_file = NULL,
log_write_interval = 2,
log_top_value = 0,
log_top_total = 1,
log_top_message = NA
)
Arguments
dir_path
Path to the directory where the files are stored.
trace
bool
IfTRUE
prints information on the progress to the console.id_column
string
Name of the column storing the ids for the texts.text_column
string
Name of the column storing the raw text.bib_entry_column
string
Name of the column storing the bibliographic information of the texts.license_column
string
Name of the column storing information about the licenses.url_license_column
string
Name of the column storing information about the url to the license in the internet.text_license_column
string
Name of the column storing the license as text.url_source_column
string
Name of the column storing information about about the url to the source in the internet.log_file
string
Path to the file where the log should be saved. If no logging is desired set this argument toNULL
.log_write_interval
int
Time in seconds determining the interval in which the logger should try to update the log files. Only relevant iflog_file
is notNULL
.log_top_value
int
indicating the current iteration of the process.log_top_total
int
determining the maximal number of iterations.log_top_message
string
providing additional information of the process.
Method add_from_data.frame()
Method for adding raw texts from a data.frame
Arguments
data_frame
Object of class
data.frame
with at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in thedata.frame
they are added with empty values(NA
). Additional columns are dropped.
Method get_private()
Method for requesting all private fields and methods. Used for loading and updating an object.