
03 Using R syntax
Florian Berding, Yuliia Tykhonova, Julia Pargmann, Andreas Slopinski, Elisabeth Riebenbauer, Karin Rebmann
Source:vignettes/classification_tasks.Rmd
classification_tasks.Rmd
1 Introduction and Overview
1.1 Preface
This vignette introduces the package aifeducation and its usage with R syntax. For users who are unfamiliar with R or those who do not have coding skills in relevant languages (e.g. python), we recommend to start with the graphical user interface AI for Education - Studio which is described in the vignette 02 Using the graphical user interface Aifeducation - Studio.
We assume that aifeducation is installed as described in vignette 01 Get Started. The introduction starts with a brief explanation of basic concepts, which are necessary to work with this package.
1.2 Basic Concepts
In the educational and social sciences, assigning scientific concepts to an observation is an important task that allows researchers to understand an observation, to generate new insights, and to derive recommendations for research and practice.
In educational science, several areas deal with this kind of task. For example, diagnosing students’ characteristics is an important aspect of a teachers’ profession and necessary to understand and promote learning. Another example is the use of learning analytics, where data about students is used to provide learning environments adapted to their individual needs. On another level, educational institutions such as schools and universities can use this information for data-driven performance decisions (Laurusson & White 2014) as well as where and how to improve it. In any case, a real-world observation is aligned with scientific models to use scientific knowledge as a technology for improved learning and instruction.
Supervised machine learning is one concept that allows a link between real-world observations and existing scientific models and theories (Berding et al. 2022). For educational science, this is a great advantage because it allows researchers to use the existing knowledge and insights to apply AI. The drawback of this approach is that the training of AI requires both information about the real world observations and information on the corresponding alignment with scientific models and theories.
A valuable source of data in educational science are written texts, since textual data can be found almost everywhere in the realm of learning and teaching (Berding et al. 2022). For example, teachers often require students to solve a task which they provide in a written form. Students have to create a solution for the tasks which they often document with a short written essay or a presentation. This data can be used to analyze learning and teaching. Teachers’ written tasks for their students may provide insights into the quality of instruction while students’ solutions may provide insights into their learning outcomes and prerequisites.
AI can be a helpful assistant in analyzing textual data since the analysis of textual data is a challenging and time-consuming task for humans.
Please note that an introduction to content analysis, natural language processing or machine learning is beyond the scope of this vignette. If you would like to learn more, please refer to the cited literature.
Before we start, it is necessary to introduce a definition of our understanding of some basic concepts, since applying AI to educational contexts means to combine the knowledge of different scientific disciplines using different, sometimes overlapping, concepts. Even within a single research area, concepts are not unified. Figure 1 illustrates this package’s understanding.

Since aifeducation looks at the application of AI for classification tasks from the perspective of the empirical method of content analysis, there is some overlapping between the concepts of content analysis and machine learning. In content analysis, a phenomenon like performance or colors can be described as a scale/dimension which is made up by several categories (e.g. Schreier 2012, pp. 59). In our example, an exam’s performance (scale/dimension) could be “good”, “average” or “poor”. In terms of colors (scale/dimension) categories could be “blue”, “green”, etc. Machine learning literature uses other words to describe this kind of data. In machine learning, “scale” and “dimension” correspond to the term “label” while “categories” refer to the term “classes” (Chollet, Kalinowski & Allaire 2022, p. 114).
With these clarifications, classification means that a text is assigned to the correct category of a scale or, respectively, that the text is labeled with the correct class. As Figure 2 illustrates, two kinds of data are necessary to train an AI to classify text in line with supervised machine learning principles.

By providing AI with both the textual data as input data and the corresponding information about the class as target data, AI can learn which texts imply a specific class or category. In the above exam example, AI can learn which texts imply a “good”, an “average” or a “poor” judgment. After training, AI can be applied to new texts and predict the most likely class of every new text. The generated class can be used for further statistical analysis or to derive recommendations about learning and teaching.
In use cases as described in this vignette, AI has to “understand” natural language: „Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English and Mandarin. This processing generally involves translating natural language into data (numbers) that a computer can use to learn about the world. (…)” (Lane , Howard & Hapke 2019, p. 4)
Thus, the first step is to transform raw texts into a a form that is usable for a computer, hence raw texts must be transformed into numbers. In modern approaches, this is usually done through word embeddings. Campesato (2021, p. 102) describes them as “the collective name for a set of language modeling and feature learning techniques (…) where words or phrases from the vocabulary are mapped to vectors of real numbers.” The definition of a word vector is similar: „Word vectors represent the semantic meaning of words as vectors in the context of the training corpus.” (Lane, Howard & Hapke 2019, p. 191). In the next step, the words or text embeddings can be used as input data and the labels as target data when training AI to classify a text.
In aifeducation, these steps are covered with three different types of models, as shown in Figure 3.

Base Models: The base models contain the capacities to understand natural language. In general, these are transformers such as BERT, RoBERTa, etc. A huge number of pre-trained models can be found on Hugging Face.
Text Embedding Models: The modes are built on top of base models and store directions on how to use these base models for converting raw texts into sequences of numbers. Please note that the same base model can be used to create different text embedding models.
Classifiers: Classifiers are used on top of a text embedding model. They are used to classify a text into categories/classes based on the numeric representation provided by the corresponding text embedding model. Please note that a text embedding model can be used to create different classifiers (e.g. one classifier for colors, one classifier to estimate the quality of a text, etc.).
2 Start Working
2.1 Starting a New Session
Before you can work with aifeducation, you must set up a new
R session. First, you can load aifeducation
.
Second, it is necessary that you set up python via ‘reticulate’ and
chose the environment where all necessary python libraries are
available. In case you installed python as suggested in vignette 01 Get started you may start a new session
like this:
library(aifeducation)
prepare_session()
#> Python is already initalized with the virtual environment ' aifeducation '.
#> Try to use this environment.
#> Detected OS: windows
#> Checking python packages. This can take a moment.
#> All necessary python packages are available.
#> python: 3.10
#> torch: 2.7.1+cu126
#> pyarrow: 20.0.0
#> transformers: 4.52.4
#> tokenizers: 0.21.1
#> pandas: 2.3.0
#> datasets: 3.6.0
#> codecarbon: 3.0.2
#> safetensors: 0.5.3
#> torcheval: 0.0.7
#> accelerate: 1.8.1
#> numpy: 2.2.6
#> GPU Acceleration: TRUE
#> Location for Temporary Files:C:\Users\User\AppData\Local\Temp\RtmpMHnMI8/r_aifeducation
Please remember: Every time you start a new session in R, you have to load the library
aifeducation
and to configure python. We recommend to use the functionprepare_session
because it performs all necessary steps for setting up python correctly.
Now you can start your work.
2.2 Data Management
2.2.1 Introducation
In the context of use cases for aifeducation, three different types of data are necessary: raw texts, text embeddings, and target data which represent the categories/classes of a text.
To deal with the first two types and to allow the use of large data sets that may not fit into the memory of your machine, the packages ships with two specialized objects.
The first is LargeDataSetForText
. Objects of this class
are used to read raw texts from .txt, .pdf, and .xlsx files and store
them for further computations. The second is
LargeDataSetForTextEmbeddings
which are used to store the
text embeddings of raw texts which are generated with
TextEmbeddingModel
s. We will describe the transformation of
raw texts into text embeddings later.
2.2.2 Raw Texts
The creation of a LargeDataSetForText
is necessary if
you would like to create or train a base model or to generate text
embeddings. In case you would like to create such a data set for the
first time you have to create an empty data set first:
raw_texts <- LargeDataSetForText$new()
To fill this object with raw texts different methods are available depending on the file type you use for storing raw texts.
.txt files
The first alternative is to store raw texts in .txt files. To use these you have to structure your data in a specific way:
- Create a main folder for storing your data.
- Store every raw text/document into a single .txt file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
- Add an additional .txt file to the folder named
bib_entry.txt
. This file contains bibliographic information for the raw text. - Add an additional .txt file to the folder named
license.txt
which contains a short statement for the license of the text such as “CC BY”. - Add an additional .txt file to the folder named
url_license.txt
which contains the url/link to the license’s text such as “https://creativecommons.org/licenses/by/4.0/”. - Add an additional .txt file to the folder named
text_license.txt
which contains the full license in raw texts. - Add an additional .txt file to the folder named
url_source.txt
which contains the url/link to the text file in the internet.
Applying these rules may result in a data structure as follows:
- Folder “main folder”
- Folder Text A
- text_a.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text B
- text_b.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text C
- text_C.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text A
Now you can call the method add_from_files_txt
by
passing the path to the directory of the main folder to
dir_path
.
raw_texts$add_from_files_txt(
dir_path = "main folder",
clean_text=TRUE
)
The data set will now read all the raw texts in the main folder and
will assign every text the corresponding bib entry, license, etc. Please
note that adding a bib_entry.txt
, license.txt
,
url_license.txt
, text_license.txt
, and
url_soruce.text
to every folder is optional. If there is no
such file in the corresponding folder, there will be an empty entry in
the data set. However, against the backdrop of the European AI Act, we
recommend to provide both the license and bibliographic information to
make the documentation of your models more straightforward. Furthermore,
some licenses such as those provided by Creative Commons require
statements about the creators, a copyright note, a URL or link to the
source material (if possible), the license of the material and a URL or
link to the license’s text on the internet or the license text itself.
Please check the licenses of the material you are using for the
requirements.
.pdf files
The second alternative is to use .pdf files as a source for raw texts. Here, the necessary structure is similar to .txt files:
- Create a main folder for storing your data.
- Store every raw text/document into a single .pdf file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
- Add an additional .txt file to the folder named
bib_entry.txt
. This file contains bibliographic information for the raw text. - Add an additional .txt file to the folder named
license.txt
which contains a short statement for the license of the text such as “CC BY”. - Add an additional .txt file to the folder named
url_license.txt
which contains the URL/link to the license’s text such as “https://creativecommons.org/licenses/by/4.0/”. - Add an additional .txt file to the folder named
text_license.txt
which contains the full license in raw texts. - Add an additional .txt file to the folder named
url_source.txt
which contains the url/link to the text file in the internet.
Applying these rules may result in a data structure as follows:
- Folder “main folder”
- Folder Text A
- text_a.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text B
- text_b.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text C
- text_C.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text A
Please not that all files except the text file must be .txt, not .pdf.
Now you can call the method add_from_files_pdf
by
passing the path to the directory of the main folder to
dir_path
.
raw_texts$add_from_files_pdf(
dir_path = "main folder",
clean_text=TRUE
)
As stated above, bib_entry.txt
,
license.txt
, url_license.txt
,
text_license.txt
, and url_soruce.text
are
optional.
.xlsx files
The third alternative is to store the raw texts into .xlsx files. This alternative is useful if you have many small raw texts. For raw texts that are very large such as books or papers we recommend to store them as .txt or .pdf files.
In order to add raw texts from .xlsx files, the files need a special structure:
- Create a main folder for storing all .xlsx files you would like to read.
- All .xlsx files must contain the names of the columns in the first row and the names must be identical for each column across all .xslx files you would like to read.
- Every .xslx files must contain a column storing the text ID and must contain a column storing the raw text. Every text must have a unique ID across all .xlsx files.
- Every .xslx file can contain an additional column for the bib entry.
- Every .xslx file can contain an additional column for the license.
- Every .xslx file can contain an additional column for the license’s URL.
- Every .xslx file can contain an additional column for the license’s text.
- Every .xslx file can contain an additional column for the source’s URL.
Your .xlsx file may look like
id | text | bib | license | url_license | text_license | url_source |
---|---|---|---|---|---|---|
z3 | This is an example. | Author (2019) | CC BY | Example URL | Text | Example URL |
a3 | This is a second example. | Author (2022) | CC BY | Example URL | Text | Example URL |
… | … | … | … |
Now you can call the method add_from_files_xlsx
by
passing the path to the directory of the main folder to
dir_path
. Please do not forget to specify the column names
for ID and text as well as for the other columns.
raw_texts$add_from_files_xlsx(
dir_path = "main folder",
id_column = "id",
text_column = "text",
bib_entry_column = "bib_entry",
license_column = "license",
url_license_column = "url_license",
text_license_column = "text_license",
url_source_column = "url_source"
)
Cleant text
For .txt and .pdf files you can set the argument
clean_text=TRUE
. This requests an algorithm that should
pre-process the raw texts and applies the following modifications:
- Some special symbols are removed.
- All spaces at the beginning and the end of a row are removed.
- Multiple spaces are reduced to single space.
- All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
- List of content is removed.
- Hyphenation is made undone.
- Line breaks within a paragraph are removed.
- Multiple line breaks are reduced to a single line break.
The aim of these changes is to provide a clean plain text in order to increase the performance and quality of all analyses.
IDs In case of .xlsx files, the texts’ IDs are set to the IDs stored in the corresponding column for ID. In case of .pdf and .txt files, the file names are used as ID (without the file extension).
Please note that a consequence of this is that two files text_01.txt and text_01.pdf have the same ID, which is not allowed. Please ensure that you use unique IDs across file formats.
Saving and loading a data set
Once you have create a LargeDataSetForText
you can save
your data to disk by calling the function save_to_disk
. In
our example the code would be:
save_to_disk(
object = raw_texts,
dir_path = "examples",
folder_name = "raw_texts"
)
The argument object
requires the object you would like
to save. In our case this is raw_texts
. With
dir_path
you specific the location where to save the object
and with folder_name
you define the name of the folder that
will be created within that directory. In this folder the data set is
saved.
To load an existing data set, you can call the function
load_from_disk
with the directory path where you stored the
data. In our case this would be:
raw_text_dataset <- load_from_disk("examples/raw_texts")
Now you can work with your data.
2.2.3 Text Embeddings
The numerical representations of raw texts (called text embeddings)
are stored with objects of class
LargeDataSetForTextEmbeddings
. These kinds of data sets are
generated by some models such as TextEmbeddingModel
s. Thus,
you will never need to create such a data set manually.
However, you will need this kind of data set to train a classifier or
to predict the categories/classes of raw texts. Thus, it may be
advantageous to save already transformed data. You can save and load an
object of this class with the functions save_to_disk
and
load_from_disk
.
Let us assume that we have a
LargeDataSetForTextEmbeddings
called
text_embeddings. Saving this object may look like:
save_to_disk(
object = text_embeddings,
dir_path = "examples",
folder_name = "text_embeddings"
)
The data set will be saved at examples/text_embeddings
.
Loading this data set may look like:
new_text_embeddings <- load_from_disk("examples/text_embeddings")
2.2.4 Target Data
The last data type necessary for working with
aifeducation
are the categories/classes of given raw texts.
For this kind of data we currently do not provide a special object. You
just need a named factor
storing the
classes/categories for a dimension. It is also important that the names
equal the ID of the corresponding raw texts/text embeddings since
matching the classes/categories to texts is done with the help of these
names.
Saving and loading can be done with R’s functions
save
and load
.
2.3 Example Data for this Vignette
To illustrate the steps in this vignette, we cannot use data from
educational settings since these data is generally protected by privacy
policies. Therefore, we use a subset of the Standford Movie Review
Dataset provided by Maas et al. (2011) which is part of the package. You
can access the data set with imdb_movie_reviews
.
We now have a data set with three columns. The first column contains the raw text, the second contains the rating of the movie (positive or negative), and the third column the ID of the movie review. About 200 reviews imply a positive rating of a movie and about 100 imply a negative rating.
For this tutorial, we modify this data set by setting about 50
positive and 25 negative reviews to NA
, indicating that
these reviews are not labeled.
example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
table(example_data$label)
#>
#> neg pos
#> 75 150
We will now create a LargeDataSetForText
from this
data.frame
. Before we can do this we must ensure that the
data.set
has all necessary columns:
colnames(example_data)
#> [1] "text" "label" "id"
Now we have to add two columns. For this tutorial we do not add any bibliographic or license information although this is recommended in practice.
example_data$bib_entry <- NA
example_data$license <- NA
colnames(example_data)
#> [1] "text" "label" "id" "bib_entry" "license"
Now the data.frame
is ready as input for our data set.
The “label” column will not be included.
data_set_reviews_text <- LargeDataSetForText$new()
data_set_reviews_text$add_from_data.frame(example_data)
We save the categories/labels within a separate factor.
We will now use this data to show you how to use the different objects and functions in aifeducation.
3 Base Models
3.1 Overview
Base models are the foundation of all further models in aifeducation. At the moment, these are transformer models such modernBERT (Warner et al. 2024), MPNet (Song et al. 2020), BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), Funnel-Transformer (Dai et al. 2020), and Longformer (Beltagy, Peters & Cohan 2020). In general, these models are trained on a large corpus of general texts in the first step. In the next step, the models are fine-tuned to domain-specific texts and/or fine-tuned for specific tasks. Since the creation of base models requires a huge number of texts resulting in high computational time, it is recommended to use pre-trained models. These can be found on Hugging Face. Sometimes, however, it is more straightforward to create a new model to fit a specific purpose. aifeducation supports the option to both create and train/fine-tune base models.
3.2 Creation of Base Models
Every transformer model is composed of two parts:
- the tokenizer which splits raw texts into smaller pieces to model a large number of words with a limited, small number of tokens and
- the neural network that is used to model the capabilities for understanding natural language.
At the beginning you can choose between the different supported
transformer architectures. Depending on the architecture, you have
different options determining the shape of your neural network. For this
vignette we use a BERT (Devlin et al. 2019) model which can be created
with the function aife_transformer.make
.
base_model <- aife_transformer.make("bert")
#> [1] "BERT Model has been initialized."
base_model$create(
model_dir = "examples/my_own_transformer",
text_dataset = data_set_reviews_text,
vocab_size = 30522,
vocab_do_lower_case = FALSE,
max_position_embeddings = 512,
hidden_size = 768,
num_hidden_layer = 12,
num_attention_heads = 12,
intermediate_size = 3072,
hidden_act = "gelu",
hidden_dropout_prob = 0.1,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
trace = TRUE,
log_dir = NULL,
log_write_interval = 2
)
#> Mon Aug 11 11:10:08 2025 Start Sustainability Tracking
#> Mon Aug 11 11:10:09 2025 Creating Tokenizer Draft
#> Mon Aug 11 11:10:09 2025 Start Computing Vocabulary
#> Mon Aug 11 11:10:09 2025 Start Computing Vocabulary - Done
#> Mon Aug 11 11:10:09 2025 Saving Draft
#> Mon Aug 11 11:10:10 2025 Creating Tokenizer
#> Mon Aug 11 11:10:10 2025 Creating Tokenizer - Done
#> Mon Aug 11 11:10:13 2025 Creating Transformer Model
#> Mon Aug 11 11:10:13 2025 Saving BERT Model
#> Mon Aug 11 11:10:14 2025 Saving Tokenizer Model
#> Mon Aug 11 11:10:14 2025 Saving Sustainability Data
#> Mon Aug 11 11:10:14 2025 Done
For this function to work, you must provide a path to a directory
where your new transformer should be saved (model_dir
).
Furthermore, you must provide raw texts to text_dataset
.
This object should be of class LargeDataSetForText
as
described in section 2.2.2. These texts are not used to
train the transformer but for calculating the vocabulary. In this
example we use the text from the movie reviews. Please note, that this
data set is to small for creating a new transformer. We use this here
only for a fast running illustration. For real use cases a larger data
set is necessary.
The maximum size of the vocabulary is determined by
vocab_size
. Modern tokenizers such as WordPiece
(Wu et al. 2016) use algorithms that splits words into smaller elements,
allowing them to build a huge number of words with a small number of
elements. Thus, even with only small number of about 30,000 tokens, they
are able to represent a very large number of words.
The other parameters allow you to customize your BERT model. For example, you could increase the number of hidden layers from 12 to 24 or reduce the hidden size from 768 to 256, allowing you to build and to test larger or smaller models.
The vignette 04 Model configuration provides details on how to configure a base model.
Please note that with max_position_embeddings
you
determine how many tokens your transformer can process. If your text has
more tokens, these tokens are ignored. However, if you would like to
analyze long documents, please avoid to increase this number too
significantly because the computational time does not increase in a
linear way but quadratic (Beltagy, Peters & Cohan 2020). For long
documents you can use another architecture of BERT (e.g. Longformer from
Beltagy, Peters & Cohan 2020) or split a long document into several
chunks which are used sequentially for classification (e.g. Pappagari et
al. 2019). Using chunks is supported by aifedcuation for all
models.
Since creating a transformer model is energy consuming,
aifeducation allows you to estimate its ecological impact with
help of the python library codecarbon
. Thus,
sustain_track
is set to TRUE
by default. If
you use the sustainability tracker you must provide the alpha-3 code for
the country where your computer is located (e.g., “CAN”=“Canada”,
“DEU”=“Germany”). A list with the codes can be found on Wikipedia.
The reason is that different countries use different sources and
techniques for generating their energy resulting in a specific impact on
CO2 emissions. For the USA and Canada you can additionally specify a
region by setting sustain_region
. Please refer to the
documentation of codecarbon
for more information.
After calling the function, you will find your new model in your model directory.
3.3 Train/Fine-Tune a Base Model
If you would like to train a new base model (see section 3.2) for the
first time or want to adapt a pre-trained model to a domain-specific
language or task, you can call the corresponding
train
-method.
base_model <- aife_transformer.make("bert")
base_model$train(
output_dir = "examples/my_own_transformer_trained",
model_dir_path = "examples/my_own_transformer",
text_dataset = data_set_reviews_text,
p_mask = 0.15,
whole_word = TRUE,
val_size = 0.1,
n_epoch = 1,
batch_size = 12,
chunk_size = 250,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
trace = TRUE,
log_dir = NULL,
log_write_interval = 2
)
#> Mon Aug 11 11:10:14 2025 Start Sustainability Tracking
#> Mon Aug 11 11:10:15 2025 Loading Existing Model
#> Mon Aug 11 11:10:15 2025 Creating Chunks of Sequences for Training
#> Mon Aug 11 11:10:16 2025 459 Chunks Created
#> Mon Aug 11 11:10:16 2025 Using Whole Word Masking
#> Mon Aug 11 11:10:16 2025 Preparing Training of the Model
#> Mon Aug 11 11:10:16 2025 Start Fine Tuning
#> Mon Aug 11 11:10:24 2025 Saving BERT Model
#> Mon Aug 11 11:10:25 2025 Saving Tokenizer
#> Mon Aug 11 11:10:25 2025 Saving Sustainability Data
#> Mon Aug 11 11:10:25 2025 Done
Here it is important that you provide the path to the directory where your new transformer is stored. Furthermore, it is important that you provide another directory where your trained transformer should be saved to avoid reading and writing collisions.
Now, the provided raw data is used to train your model. In case of a Bert model, the learning objective is Masked Language Modeling. Other models may use other learning objectives. Please refer to the documentation for more details on every model.
First, you can set the length of token sequences with
chunk_size
leading the tokenizer to split long texts into
several chunks with the given size. With val_size
, you set
how many of these chunks should be used for the validation sample. With
whole_word
you can choose between masking single tokens or
masking complete words (Please remember that modern tokenizers split
words into several tokens. Thus, tokens and words are not forced to
match each other directly). Finally, with p_mask
you can
determine how many tokens should be masked.
Please remember to set the correct alpha-3 code for tracking the
ecological impact of training your model
(sustain_iso_code
).
If you work on a machine and your graphic device only has small memory capacity, please reduce the batch size significantly.
After the training finishes, you can find the transformer ready to
use in your output_dir
. Now you are able to create a text
embedding model.
4 Text Embedding Models
4.1 Introduction
The text embedding model is used to transform raw texts into
numerical representations. In order to create a new model, you need a
base model that provides the ability to understand natural language. A
text embedding model is stored as an object of class
TextEmbeddingModel
.
In aifedcuation, the transformation of raw texts into numbers is a separate step from downstream tasks such as classification. This is to reduce computational time on machines with low performance. By separating text embedding from other tasks, the text embedding has to be calculated only once and can be used for different tasks at the same time. Another advantage is that the training of the downstream tasks involves only the downstream tasks an not the parameters of the embedding model, making training less time-consuming, thus decreasing computational intensity. Finally, this approach allows the analysis of long documents by applying the same algorithm to different parts of a text.
The text embedding model provides a unified interface: After creating the model with different methods, the handling of the model is always the same.
4.2 Create a Text Embedding Model
First you have to choose the base model that forms the foundation of
your new text embedding model. In order to illustrate its use we apply a
pre-trained model from Hugging
Face called BERT base
model (uncased) published by Devlin et al. (2019). Download all
files into a new folder. Here we store the model in
"examples/bert_uncased"
.
bert_modeling <- TextEmbeddingModel$new()
bert_modeling$configure(
model_label = "Text Embedding via BERT",
model_language = "english",
max_length = 512,
chunks = 4,
overlap = 30,
emb_layer_min = "Middle",
emb_layer_max = "2_3_layer",
emb_pool_type = "Average",
model_dir = "examples/bert_uncased"
)
Next, you have to provide the directory where your base model is
stored. In this example this would be
model_dir="examples/bert_uncased"
. Of course you can use
any other pre-trained model from Hugging Face which addresses your needs
and is supported by aifeducation.
Using a BERT model for text embedding is not a problem since your
text does not provide more tokens than the transformer can process. This
maximum value is set in the configuration of the transformer (see
section 3.2). If the text produces more tokens, the last tokens are
ignored. In some instances you might want to analyze long texts. In
these situations, reducing the text to the first tokens (e.g. only the
first 512 tokens) could result in a problematic loss of information. To
deal with these situations, you can configure a text embedding model in
aifecuation to split long texts into several chunks which are
processed by the base model. The maximum number of chunks is set with
chunks
. In our example above, the text embedding model
would split a text consisting of 1024 tokens into two chunks with every
chunk consisting of 512 tokens. For every chunk, a text embedding is
calculated. As a result, you receive a sequence of embeddings. The first
embedding characterizes the first part of the text and the second
embedding characterizes the second part of the text (and so on). Thus,
our sample text embedding model is able to process texts with about
tokens. This approach is inspired by the work by Pappagari et
al. (2019).
Since transformers are able to account for the context, it may be
useful to interconnect every chunk to bring context into the
calculations. This can be done with overlap
to determine
how many tokens of the end of a prior chunk should be added to the next.
In our example the last 30 tokens of the prior chunks are added at the
beginning of the following chunk. This can help to add the correct
context of the text sections into the analysis. Altogether, this model
can analyse a maximum of
tokens of a text.
Finally, you have to decide from which hidden layer(s) the embeddings
should be drawn. With emb_layer_min
and
emb_layer_max
you can decide from which layers the average
value for every token should be calculated. Please note that the
calculation considers all layers between emb_layer_min
and
emb_layer_max
. In their initial work, Devlin et al. (2019)
used the hidden states of different layers for classification.
With emb_pool_type,
you decide which tokens are used for
pooling within every layer. In the case of
emb_pool_type="CLS",
only the cls token is used. In the
case of emb_pool_type="Average"
all tokens within a layer
are averaged except padding tokens.
The vignette 04 Model configuration provides details on how to configure a text embedding model.
After deciding about the configuration, you can use your model.
You can see the number of learnable parameters of the underlying base model with
bert_modeling$count_parameter()
#> [1] 108891648
Another important value is the number of features which you can
request by calling get_n_features
.
bert_modeling$get_n_features()
#> [1] 768
This number describes the number of dimensions for a text embedding. That is, the number of dimensions which is used to characterize the content of every chunk of text. This value is important as it determines the complexity a classifier or feature extractor has to deal with. Some of classifier’s and feature extractor’s parameters depend on this value. We elaborate this at the relevant point for the different models.
4.3 Transforming Raw Texts into Embedded Texts
To transform raw text into a numeric representation, you only have to
use the embed_large
method of your model. To do this, you
must provide a LargeDataSetForText
to
large_datas_set
. Relying on the sample data from section
2.3, we can use the movie reviews as raw texts.
review_embeddings <- bert_modeling$embed_large(
large_datas_set = data_set_reviews_text,
trace = TRUE
)
#> Mon Aug 11 11:10:28 2025 Batch 1 / 10 done
#> Mon Aug 11 11:10:29 2025 Batch 2 / 10 done
#> Mon Aug 11 11:10:30 2025 Batch 3 / 10 done
#> Mon Aug 11 11:10:31 2025 Batch 4 / 10 done
#> Mon Aug 11 11:10:31 2025 Batch 5 / 10 done
#> Mon Aug 11 11:10:32 2025 Batch 6 / 10 done
#> Mon Aug 11 11:10:33 2025 Batch 7 / 10 done
#> Mon Aug 11 11:10:34 2025 Batch 8 / 10 done
#> Mon Aug 11 11:10:35 2025 Batch 9 / 10 done
#> Mon Aug 11 11:10:36 2025 Batch 10 / 10 done
The method embed_large
creates an object of class
LargeDataSetForTextEmbeddings
. This is just a data set
consisting of the embeddings of every text. The embeddings are an array,
of which the first dimension refers to specific texts, the second
dimension refers to chunks/sequences, and the third dimension refers to
the features.
With the embedded texts you now have the input to train a new classifier or to apply a pre-trained classifier for predicting categories/classes. In the next chapter we will show you how to use these classifiers. But before we start, we will show you how to save and load your model.
4.4 Saving and Loading Embedded Texts
Since transforming raw texts into text embeddings is time and energy consuming we recommend to save them to disk in order to use the embeddings for further tasks and analysis.
To save the them just call the function save_to_disk
as
shown below.
save_to_disk(
object = review_embeddings,
dir_path = "examples",
folder_name = "imdb_movie_reviews"
)
To load the embeddings you can call the function
load_from_disk
.
review_embeddings<-load_from_disk("examples/imdb_movie_reviews")
4.5 Saving and Loading Text Embedding Models
Saving a created text embedding model is very easy in
aifeducation by using the function save_to_disk
.
This function provides a unique interface for all text embedding models.
For saving your work you can pass your model to object
and
the directory where to save the model to dir_path
. With
folder_name
you can determine the name of the folder that
should be created in that directory to store the model.
save_to_disk(
object = bert_modeling,
dir_path = "examples",
folder_name = "bert_te_model"
)
In this example the model is saved in a folder at the location
examples/bert_te_model
. If you want to load your model you
can call load_from_disk
.
bert_modeling <- load_from_disk("examples/bert_te_model")
4.6 Sustainability
In case the underlying model was trained with an active
sustainability tracker (section 3.2 and 3.3) you can receive a table
showing you the energy consumption, CO2 emissions, and hardware used
during training by calling the method
get_sustainability_data()
. For our example this would be
bert_modeling$get_sustainability_data()
.
4.7 Training History
If you would like to see the training history of the underlying base model you can call a special method.
bert_modeling$plot_training_history()
Please note that this plot is not available for this example since the necessary data is not directly available for this model on Hugging Face. If you train a model with this package the training history is always saved.
5 Classifiers
5.1 Overview
Classifiers are built on top of a TextEmbeddingModel
.
They use the embedded texts produced by these models and predict
classes/categories. You can build your classifier with the help of two
components.
First, you choose a core model. It determines where different layers are located and how the outputs of the different layers are combined into the final output of the model.

The sequential architecture (Figure 4) provides models where the input is passed to a specific number of layers step by step. All layers are grouped by their kind into stacks.
In contrast, the parallel architecture (Figure 5) offers a model where an input is passed to different types of layers separately. At the end the outputs are combined to create the final output of the whole model.

You can find the name of the used core model in the name of the
classifier. For example TEClassifierSequential
uses a
sequential core model while TEClassifierParallel
uses a parallel core model.
In general, all layers within a core model allow further customization allowing you to build a high number of different models.
A detailed description of all layers can be found in vignette A01 Layers and Stack of Layers:
Second, you can choose how the core model is used for classification. At the moment a probability and metric based classifier is possible.
- Probability Classifiers: Probability classifiers are used to predict a probability distribution for different classes/categories. This is the standard case most common in literature.
- Prototype Based Classifiers: Prototype based classifiers are a kind of metric based classifiers. Here the classifiers do not predict a probability distribution. Instead it calculates a prototype for every class/category and measures the distance between a case and all prototypes. The class/category of the prototype with the smallest distance to the case is assigned to that case. In contrast to the probability classifiers these models can handle classes/categories that were not part of the training. For more details please refer to section 6.1.
Please note that creating, training, and predicting works for all types of classifiers as described in the sections below.
5.2 Create a Classifier
To show you how to create a classifier we use a classifier of class
TEClassifierSequential
as an example. With the sample data
from section 2.3 and the text embeddings from section 4.3, the creation
of a new classifier may look like:
classifier <- TEClassifierSequential$new()
classifier$configure(
label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = NULL,
target_levels = c("neg", "pos"),
skip_connection_type="ResidualGate",
cls_pooling_features=50,
cls_pooling_type="MinMax",
feat_act_fct="ELU",
feat_size=256,
feat_bias=TRUE,
feat_dropout=0.0,
feat_parametrizations="None",
feat_normalization_type="LayerNorm",
ng_conv_act_fct="ELU",
ng_conv_n_layers=0,
ng_conv_ks_min=2,
ng_conv_ks_max=4,
ng_conv_bias=FALSE,
ng_conv_dropout=0.1,
ng_conv_parametrizations="None",
ng_conv_normalization_type="LayerNorm",
ng_conv_residual_type="ResidualGate",
dense_act_fct="ELU",
dense_n_layers=1,
dense_dropout=0.5,
dense_bias=FALSE,
dense_parametrizations="None",
dense_normalization_type="LayerNorm",
dense_residual_type="ResidualGate",
rec_act_fct="Tanh",
rec_n_layers=2,
rec_type="GRU",
rec_bidirectional=FALSE,
rec_dropout=0.2,
rec_bias=FALSE,
rec_parametrizations="None",
rec_normalization_type="LayerNorm",
rec_residual_type="ResidualGate",
tf_act_fct="ELU",
tf_dense_dim=512,
tf_n_layers=0,
tf_dropout_rate_1=0.2,
tf_dropout_rate_2=0.5,
tf_attention_type="MultiHead",
tf_positional_type ="absolute",
tf_num_heads=1,
tf_bias=FALSE,
tf_parametrizations="None",
tf_normalization_type="LayerNorm",
tf_residual_type="ResidualGate"
)
Similarly to the text embedding model, you should provide a label
(label
) for your new classifier. With
text_embeddings
you have to provide a
LargeDataSetForTextEmbeddings
. The data set is created with
a TextEmbeddingModel
as described in section 4. We here
continue our example and use the embeddings produced by our BERT
model.
target_levels
take the categories/classes you classifier
should predict. This can be numbers or even words.
In case you would like to use ordinal data, it is very important that you provide the classes/categories in the correct order. That is, classes/categories representing a “higher” level must be stated after categories/classes with a “lower” level. If you provide the wrong order, the performance indices are not valid. In case of nominal data the order does not matter.
With feature_extractor
you can add a feature extractor
that tries to reduce the number of features of your text embeddings
before passing the embeddings to the classifier. You can read more on
this in Section 6.2.
With the help of the other parameters you can define the complexity and abilities of your model. A description of the different models can be found in A01 Layers and Stack of Layers.
Please note that you have to choose the parameter
feat_size
depending on the number of features of the
underlying text embedding model. You can request this number by calling
the method get_n_features
of the used text embedding model.
In our example this would be:
bert_modeling$get_n_features()
#> [1] 768
The number for feat_size
should be equal or lesser as
this value since this layer tries to compress the text embeddings to a
lower number of dimensions. While this reduces the number of parameters
for all following layers and decreases the time to train and use the
model it can cost information. You can experiment with this value to
find a good balance between speed and performance.
In addition, the parameter cls_pooling_features
should
be equal or less the number you used for feat_size
. With
cls_pooling_features
you determine how many of the
resulting features sould be used for classification. Thus, this value
acts as a filter.
The vignette 04 Model configuration provides details on how to configure a classifier.
In our example we use only two recurrent layers
(rec_n_layers=2
) and one dense layer
(dense_n_layers=1
). All other layers are omitted from the
model by setting the number of layers to zero
(conv_n_layers=0
,tf_n_layers=0
).
As pooling method we use minimum and maximum
(cls_pooling_type="MinMax"
). That is, the 15 highest and
the 15 lowest features are used for calculating the classes/labels. The
number of values can be determined with
intermediate_features=30
.
After you have created a new classifier, you can begin training. You can see the number of learnable parameters of your model with:
classifier$count_parameter()
#> [1] 1049450
5.3 Training a Classifier
To start the training of your classifier, you have to call the
train
method. Similarly, for the creation of the
classifier, you must provide the text embeddings to
data_embeddings
and the categories/classes as target data
to data_targets
. Please remember that
data_targets
expects a named factor where
the names correspond to the IDs of the corresponding text embeddings.
Text embeddings and target data that cannot be matched are omitted from
training.
For performance estimation, training splits the data into several
chunks based on cross-fold validation. The number of folds is set with
data_folds
. In every case, one fold is not used for
training and serves as a test sample. The remaining data is
used to create a training and a validation sample. The
percentage of cases within each fold used as a validation sample is
determined with data_val_size
. This sample is used to
determine the state of the model that generalizes best. All performance
values saved in the trained classifier refer to the test sample. This
data has never been used during training and provides a more realistic
estimation of a classifier’s performance.
classifier$train(
data_embeddings = review_embeddings,
data_targets = review_labels,
data_folds = 10,
data_val_size = 0.25,
loss_balance_class_weights = TRUE,
loss_balance_sequence_length = TRUE,
loss_cls_fct_name="FocalLoss",
use_sc = FALSE,
sc_method = "knnor",
sc_min_k = 1,
sc_max_k = 10,
use_pl = FALSE,
pl_max_steps = 3,
pl_max = 1.00,
pl_anchor = 1.00,
pl_min = 0.00,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 150,
batch_size = 32,
trace = TRUE,
ml_trace = 0,
log_dir = NULL,
log_write_interval = 10,
n_cores = auto_n_cores(),
lr_rate=1e-3,
lr_warm_up_ratio=0.02,
optimizer="AdamW"
)
#> Mon Aug 11 11:10:37 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:10:37 2025 Start
#> Mon Aug 11 11:10:39 2025 | Iteration 1 from 10
#> Mon Aug 11 11:10:39 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:10:56 2025 | Iteration 2 from 10
#> Mon Aug 11 11:10:56 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:11:13 2025 | Iteration 3 from 10
#> Mon Aug 11 11:11:13 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:11:29 2025 | Iteration 4 from 10
#> Mon Aug 11 11:11:29 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:11:46 2025 | Iteration 5 from 10
#> Mon Aug 11 11:11:46 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:12:03 2025 | Iteration 6 from 10
#> Mon Aug 11 11:12:03 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:12:20 2025 | Iteration 7 from 10
#> Mon Aug 11 11:12:20 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:12:37 2025 | Iteration 8 from 10
#> Mon Aug 11 11:12:37 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:12:53 2025 | Iteration 9 from 10
#> Mon Aug 11 11:12:53 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:13:10 2025 | Iteration 10 from 10
#> Mon Aug 11 11:13:10 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:13:27 2025 | Final training
#> Mon Aug 11 11:13:27 2025 | Final training | Training
#> Mon Aug 11 11:13:44 2025 Training Complete
You can further modify the training process with different arguments.
With loss_balance_class_weights=TRUE
the absolute
frequencies of the classes/categories are adjusted according to the
‘Inverse Class Frequency’ method. This option should be activated if you
have to deal with imbalanced data.
With loss_balance_sequence_length=TRUE
you can increase
performance if you have to deal with texts that differ in their lengths
and have an imbalanced frequency. If this option is enabled, the loss is
adjusted to the absolute frequencies of length of your texts according
to the ‘Inverse Class Frequency’ method.
epochs
determines the maximal number of epochs. During
training, the model with the best balanced accuracy is saved and
used.
batch_size
sets the number of cases that should be
processed simultaneously. Please adjust this value to your machine’s
capacities. Please note that the batch size can have an impact on the
classifier’s performance.
Since aifedcuation tries to address the special needs in educational and social science, some special training steps are integrated into this method.
Synthetic Cases: In case of imbalanced data, it is recommended to set
use_sc=TRUE
. Before training, a number of synthetic units is created via different techniques. Currently you can request the K-Nearest Neighbor OveRsampling Approach (KNNOR) developed by Islam et al. (2022). The aim is to create new cases that fill the gap to the majority class. Multi-class problems are reduced to a two class problem (class under investigation vs. all others) for generating these units. If the technique allows to set the number of neighbors during generation, you can configure the data generation withsc_min_k
andsc_max_k
. The synthetic cases for every class are generated for all k betweensc_min_k
andsc_max_k
. Every k contributes proportionally to the synthetic cases.Pseudo-Labeling: This technique is relevant if you have labeled target data and a large number of unlabeled target data. With the different parameters starting with “pl_”, you can configure the process of pseudo-labeling. Implementation of pseudo-labeling is based on Cascante-Bonilla et al. (2020). To apply pseudo-labeling, you have to set
use_pl=TRUE
.pl_max=1.00
,pl_anchor=1.00
, andpl_min=0.00
are used to describe the certainty of a prediction. 0 refers to random guessing while 1 refers to perfect certainty.pl_anchor
is used as a reference value. The distance topl_anchor
is calculated for every case. Then, they are sorted with an increasing distance frompl_anchor
. The proportion of added pseudo-labeled data into training increases with every step. The maximum number of steps is determined withpl_max_steps
. Cases close topl_anchor
are included first.
Figure 6 illustrates the training loop for the cases that all options
are set to TRUE
.

The example above applies the generation of synthetic cases and the
algorithm proposed by Cascante-Bonilla et al. (2020). For every fold,
the training starts with generating synthetic cases to fill the gap
between the classes and the majority class. After this, an initial
training of the classifiers starts. The trained classifier is used to
predict pseudo-labels for the unlabeled part of the data and adds 20% of
the cases with the highest certainty for their pseudo-labels to the
training data set. Now new synthetic cases are generated based on both
the labeled data and the newly added pseudo-labeled data. The classifier
is re-initialized and trained again. After training, the classifier
predicts the potential labels of all originally unlabeled data
and adds 40% of the pseudo-labeled data to the training data with the
highest certainty. Again, new synthetic cases are generated on both the
labeled and added pseudo-labeled data. The model is again re-initialized
and trained again until the maximum number of steps for pseudo labeling
(pl_max_steps
) is reached. After this, the logarithm is
restated for the next fold until the number of folds
(data_folds
) is reached. All of these steps are only used
to estimate the performance of the classifier to evaluate for unknown
data.
The last phase of the training begins after the last fold. In the final training, the data set is split only into a training and validation set without a test set to provide the maximum amount of data for the best performance in final training. All configurations of the performance estimation phase are used in the final training phase.
Since training a neural net is energy consuming,
aifeducation allows you to estimate its ecological impact with
the help of the python library codecarbon
. Thus,
sustain_track
is set to TRUE
by default. If
you use the sustainability tracker you must provide the alpha-3 code for
the country where your computer is located (e.g., “CAN”=“Canada”,
“DEU”=“Germany”). A list with the codes can be found on Wikipedia.
The reason is that different countries use different sources and
techniques for generating their energy resulting in a specific impact on
CO2 emissions. For the USA and Canada, you can additionally specify a
region by setting sustain_region
. Please refer to the
documentation of ‘codecarbon’ for more information.
Finally, trace
and ml_trace
allow you to
control how much information about the training progress is printed to
the console.
Please note that training the classifier can take some time. In case
options like the generation of synthetic cases (use_sc
) or
pseudo-labeling (use_pl
) are disabled, the training process
is shorter.
Please note that after performance estimation, the final training of the classifier makes use of all data available. That is, the test sample is left empty.
In order to visualize the progress of training you can request a plot which shows how important performance measures develop over epochs.
classifier$plot_training_history(
final_training=FALSE,
pl_step=NULL,
measure="loss",
y_min=NULL,
y_max=NULL,
add_min_max=TRUE,
text_size=10)

Figure 7: Training History of a Classifier
For this method it is important to decide which performance measure
you would like to plot. For classifiers loss ("loss"
),
accuracy ("accuracy"
), and balanced
accuracy("balanced_accuracy"
) are possible. If you would
like to see only the development of the last training phase set
final_training=TRUE
. If this parameter is
FALSE
the plot uses the data generated during the
performance estimation phase.
5.4 Evaluating Classifier’s Performance
After finishing training, you can evaluate the performance of the classifier. For every fold, the classifier is applied to the test sample and the results are compared to the true categories/classes. Since the test sample is never part of the training, all performance measures provide a more realistic idea of the classifier’s performance.
To support researchers in judging the quality of the predictions, aifeducation utilizes several measures and concepts from content analysis. These are
- Iota Concept of the Second Generation (Berding & Pargmann 2022).
- Krippendorff’s Alpha (Krippendorff 2019).
- Percentage Agreement.
- Gwet’s AC1/AC2 (Gwet 2014).
- Kendall’s coefficient of concordance W.
- Cohen’s Kappa unweighted (Cohen 1960).
- Cohen’s Kappa with equal weights (Cohen 1968).
- Cohen’s Kappa with squared weights (Cohen 1968).
- Fleiss’ Kappa for multiple raters without exact estimation (Fleiss 1971).
You can access the concrete values as mean values across all folds
via reliability$test_metric_mean
. In our example this would
be:
classifier$reliability$test_metric_mean
#> iota_index min_iota2 avg_iota2
#> 0.5784585 0.5061328 0.6170804
#> max_iota2 min_alpha avg_alpha
#> 0.7280280 0.6213095 0.7543452
#> max_alpha static_iota_index dynamic_iota_index
#> 0.8873810 0.2767836 0.4880141
#> kalpha_nominal kalpha_ordinal kendall
#> 0.5171082 0.5171082 0.7634111
#> c_kappa_unweighted c_kappa_linear c_kappa_squared
#> 0.5127365 0.5127365 0.5127365
#> kappa_fleiss percentage_agreement balanced_accuracy
#> 0.5061392 0.7869565 0.7543452
#> gwet_ac1_nominal gwet_ac2_linear gwet_ac2_quadratic
#> 0.6218700 0.6218700 0.6218700
#> avg_precision avg_recall avg_f1
#> 0.7780917 0.7543452 0.7530696
An addition, standard measures from machine learning are reported. These are
- Precision
- Recall
- F1-Score
You can access these values as follows:
classifier$reliability$standard_measures_mean
#> precision recall f1
#> neg 0.7170851 0.6553571 0.6654309
#> pos 0.8390982 0.8533333 0.8407083
Finally, you can plot a coding stream scheme showing how the cases of different classes are labeled.
classifier$plot_coding_stream()

Figure 8: Coding Stream of the Classifier
Evaluating the performance of a classifier is a complex task and and beyond the scope of this vignette. Instead, we would like to refer to the cited literature of content analysis and machine learning if you would like to dive deeper into this topic.
5.5 Sustainability
In case the classifier was trained with an active sustainability
tracker, you can receive information on sustainability by calling
classifier$get_sustainability_data()
.
classifier$get_sustainability_data()
#> $sustainability_tracked
#> [1] TRUE
#>
#> $date
#> [1] "Mon Aug 11 11:13:44 2025"
#>
#> $sustainability_data
#> $sustainability_data$duration_sec
#> [1] 185.3819
#>
#> $sustainability_data$co2eq_kg
#> [1] 0.001557352
#>
#> $sustainability_data$cpu_energy_kwh
#> [1] 0.002188148
#>
#> $sustainability_data$gpu_energy_kwh
#> [1] 0.00138514
#>
#> $sustainability_data$ram_energy_kwh
#> [1] 0.0005147879
#>
#> $sustainability_data$total_energy_kwh
#> [1] 0.004088076
#>
#>
#> $technical
#> $technical$tracker
#> [1] "codecarbon"
#>
#> $technical$py_package_version
#> [1] "3.0.2"
#>
#> $technical$cpu_count
#> [1] 12
#>
#> $technical$cpu_model
#> [1] "12th Gen Intel(R) Core(TM) i5-12400F"
#>
#> $technical$gpu_count
#> [1] 1
#>
#> $technical$gpu_model
#> [1] "1 x NVIDIA GeForce RTX 4070"
#>
#> $technical$ram_total_size
#> [1] 15.84258
#>
#>
#> $region
#> $region$country_name
#> [1] "Germany"
#>
#> $region$country_iso_code
#> [1] "DEU"
#>
#> $region$region
#> [1] NA
5.6 Saving and Loading a Classifier
Saving and loading follows the same pattern as for the other objects
in aifeducation. You can save the classifier by calling
save_to_disk
. In our example this may be:
save_to_disk(
object = classifier,
dir_path = "examples",
folder_name = "cls_imdb_movie_reviews"
)
The classifier is saved to
examples/cls_imdb_movie_reviews
. To load the model call
load_from_disk
.
classifier <- load_from_disk("examples/cls_imdb_movie_reviews")
5.7 Predicting New Data
If you would like to apply your classifier to new data, two steps are necessary. You must first transform the raw text into a numerical representation by using exactly the same text embedding model that was used to train your classifier (see section 4). In the case of our example classifier, we use our BERT model.
# If our mode is not loaded
bert_modeling <- load_from_disk("examples/bert_te_model")
# Create a numerical representation of the text
review_embeddings <- bert_modeling$embed_large(
large_datas_set = data_set_reviews_text,
trace = TRUE
)
#> Mon Aug 11 11:13:47 2025 Batch 1 / 10 done
#> Mon Aug 11 11:13:48 2025 Batch 2 / 10 done
#> Mon Aug 11 11:13:49 2025 Batch 3 / 10 done
#> Mon Aug 11 11:13:50 2025 Batch 4 / 10 done
#> Mon Aug 11 11:13:51 2025 Batch 5 / 10 done
#> Mon Aug 11 11:13:52 2025 Batch 6 / 10 done
#> Mon Aug 11 11:13:53 2025 Batch 7 / 10 done
#> Mon Aug 11 11:13:53 2025 Batch 8 / 10 done
#> Mon Aug 11 11:13:54 2025 Batch 9 / 10 done
#> Mon Aug 11 11:13:55 2025 Batch 10 / 10 done
To transform raw texts into a numeric representation just pass the
raw texts to the method embed_large
of the loaded model.
The raw texts should be an object of class
LargeDataSetForText
. To create such a data set, please
refer to section 2.
The resulting object can then be passed to the method
predict
of our classifier and you will get the predictions
together with an estimate of certainty for each class/category.
# If your classifier is not loaded
classifier <- load_from_disk("examples/cls_imdb_movie_reviews")
# Predict the classes of new texts
predicted_categories <- classifier$predict(
newdata = review_embeddings,
batch_size = 8
)
# Show predicted categories
head(predicted_categories)
#> neg pos expected_category
#> 11329 0.5059361 0.4940639 neg
#> 10460 0.7263541 0.2736459 neg
#> 10536 0.7444522 0.2555478 neg
#> 12035 0.7413885 0.2586115 neg
#> 7267 0.5032131 0.4967869 neg
#> 4761 0.7389779 0.2610221 neg
# Count frequencies
table(predicted_categories$expected_category)
#>
#> neg pos
#> 115 185
After the classifier finishes the prediction, the estimated
categories/classes are stored as predicted_categories
. This
object is a data.frame
containing texts’ IDs in the rows
and the probabilities of the different categories/classes in the
columns. The last column with the name expected_category
represents the category which is assigned to a text due the highest
probability.
The estimates can be used in further analysis with common methods of the educational and social sciences such as correlation analysis, regression analysis, structural equation modeling, latent class analysis or analysis of variance.
Now you are ready to to use aifeducation. In section 6 we describe further models for classification tasks and for improving model performance.
6 Extensions
6.1 Classifiers: ProtoNet
6.1.1 Introduction
The classifier introduced in section 5 is a regular classifier which comes with the traditional challenges of deep learning, such as the need for a large number of training data, expensive hardware requirements, and only a limited possibility to interpret the model’s parameters (Jadon & Garg 2020, pp.13-14). Since in the educational and social sciences data is a bottle neck, a classifier that can work with only small data sets would be preferable. These types of models are discussed in the literature with terms such as “meta-learning” (Zou 2023) or “few-shot learning” (Jadon & Garg 2020). The basic idea behind these approaches is that the model learns to use a supporting data set to predict the output for a query data set (e.g., Zou 2023, pp. 2-3). However, the model is not explicitly trained for the query data set.
One type of models within this area are Prototypical Networks (ProtoNet) which were initially proposed by Snell, Swersky, and Zemel (2017). This type of network was developed to create classifiers that are able to generalize to new classes that the model did not see during training, using only the information of a few examples of each class provided to the network (support data set). To achieve this goal, the networks learn to create a prototype for every class in the support data set with help of the examples for every class. Then, the network compares the new data with these prototypes and assigns the class of the nearest prototype to the new data. Since the network calculates the distance of every new case to every prototype, it belongs to the metric-based meta-learning approaches (Zhou 2023, pp. 48).
Since ProtoNet is a simple, easy to understand approach and provides good performance, several extensions have been suggested. aifeducation replaces the original loss function with the loss function suggested by Zhang et al. (2019) and adds the learnable metric described by Oreshkin, Rodriguez, and Lacoste (2019) to increase performance.
6.1.2 Configuration, Training, and Application without Sample Data
The application of a classifier based on ProtoNet is similar to the
regular classifiers. The only difference is embedding_dim
.
A ProtoNet classifier uses a network to project the similarity and
differences between the single cases and all prototypes into a
n-dimensional space. Similar cases are located near each other while
different cases are located further away. The number of dimensions of
this space is determined by embedding_dim
. In case
embedding_dim
is set to 1,2 or 3 the position of every case
and the prototypes can be easily visualized. For this example we use the
same data as in section 5. Let us first create and configure the new
classifier.
classifier_prototype <- TEClassifierSequentialPrototype$new()
classifier_prototype$configure(
label = "ProtoNet classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = NULL,
target_levels = c("neg", "pos"),
skip_connection_type = "ResidualGate",
cls_pooling_features = 50,
cls_pooling_type = "MinMax",
metric_type = "Euclidean",
feat_act_fct = "ELU",
feat_size = 256,
feat_bias = TRUE,
feat_dropout = 0.0,
feat_parametrizations = "None",
feat_normalization_type = "LayerNorm",
ng_conv_act_fct = "ELU",
ng_conv_n_layers = 0,
ng_conv_ks_min = 2,
ng_conv_ks_max = 4,
ng_conv_bias = FALSE,
ng_conv_dropout = 0.1,
ng_conv_parametrizations = "None",
ng_conv_normalization_type = "LayerNorm",
ng_conv_residual_type = "ResidualGate",
dense_act_fct = "ELU",
dense_n_layers = 1,
dense_dropout = 0.2,
dense_bias = FALSE,
dense_parametrizations = "None",
dense_normalization_type = "LayerNorm",
dense_residual_type = "ResidualGate",
rec_act_fct = "Tanh",
rec_n_layers = 2,
rec_type = "GRU",
rec_bidirectional = FALSE,
rec_dropout = 0.2,
rec_bias = FALSE,
rec_parametrizations = "None",
rec_normalization_type = "LayerNorm",
rec_residual_type = "ResidualGate",
tf_act_fct = "ELU",
tf_dense_dim = 512,
tf_n_layers = 0,
tf_dropout_rate_1 = 0.1,
tf_dropout_rate_2 = 0.2,
tf_attention_type = "MultiHead",
tf_positional_type = "absolute",
tf_num_heads = 1,
tf_bias = FALSE,
tf_parametrizations = "None",
tf_normalization_type = "LayerNorm",
tf_residual_type = "ResidualGate",
embedding_dim = 2
)
Now we can plot how the untrained classifiers embeds the different
cases and the prototypes. To create the corresponding plot you can call
the method plot_embeddings
. The argument
embeddings_q
takes the embeddings of the different cases as
the input of the classifier. In case you have the true classes for all
or some of the cases, you can add them to the plot by using the argument
classes_q
. The resulting plot is shown in the following
Figure.
plot_untrained <- classifier_prototype$plot_embeddings(
embeddings_q = review_embeddings,
classes_q = review_labels,
inc_margin = FALSE
)
plot_untrained

Figure 9: Embeddings of an untrained classifier of type ‘ProtoNet’
The large triangles represent the prototypes for every class while the dots refer to the labeled cases in the data set. For these, the color represents their true class. For unlabeled cases, a square is used. Here, the color indicates the estimated class. As you can see, all cases are located very similarly and there seems to be no clear structure. Let us see how this changes when we train the model.
classifier_prototype$train(
data_embeddings = review_embeddings,
data_targets = review_labels,
data_folds = 10,
data_val_size = 0.25,
loss_pt_fct_name = "MultiWayContrastiveLoss",
use_sc = FALSE,
sc_method = "knnor",
sc_min_k = 1,
sc_max_k = 10,
use_pl = FALSE,
pl_max_steps = 3,
pl_max = 1.00,
pl_anchor = 1.00,
pl_min = 0.00,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 300,
batch_size = 35,
Ns = 5,
Nq = 3,
loss_alpha = 0.5,
loss_margin = 0.05,
sampling_separate = FALSE,
sampling_shuffle = TRUE,
trace = TRUE,
ml_trace = 0,
log_dir = NULL,
log_write_interval = 10,
n_cores = auto_n_cores(),
lr_rate = 1e-3,
lr_warm_up_ratio = 0.02,
optimizer = "AdamW"
)
#> Mon Aug 11 11:13:56 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:13:56 2025 Start
#> Mon Aug 11 11:13:58 2025 | Iteration 1 from 10
#> Mon Aug 11 11:13:58 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:14:20 2025 | Iteration 2 from 10
#> Mon Aug 11 11:14:20 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:14:31 2025 | Iteration 3 from 10
#> Mon Aug 11 11:14:31 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:14:55 2025 | Iteration 4 from 10
#> Mon Aug 11 11:14:55 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:15:28 2025 | Iteration 5 from 10
#> Mon Aug 11 11:15:28 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:15:40 2025 | Iteration 6 from 10
#> Mon Aug 11 11:15:40 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:15:54 2025 | Iteration 7 from 10
#> Mon Aug 11 11:15:54 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:16:24 2025 | Iteration 8 from 10
#> Mon Aug 11 11:16:24 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:16:41 2025 | Iteration 9 from 10
#> Mon Aug 11 11:16:41 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:17:45 2025 | Iteration 10 from 10
#> Mon Aug 11 11:17:45 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:18:02 2025 | Final training
#> Mon Aug 11 11:18:02 2025 | Final training | Training
#> Mon Aug 11 11:18:19 2025 Training Complete
While there are no arguments for requesting a balance of the class
weights (loss_balance_class_weights
) or balancing the
sequence length (loss_balance_sequence_length
), four new
arguments are available. With Ns
you determine how many
examples of every class should be used during training within
the support sample. These examples are used to calculate the prototypes
for every class. With Nq
you determine how many examples of
every class should be part of the query sample. During training
the network tries to predict the correct classes of query sample.
The arguments loss_alpha
and loss_margin
refer to the configuration of the loss function describes by Zhang et
al. (2019). loss_margin
refers to the minimal distance all
examples of the query sample should have to all prototypes that do
not represent their class. loss_alpha
determines
if the loss should pay more attention to minimize the distance between
the examples to their corresponding prototype or if it should pay more
attention to maximize the distance to the prototypes that do not
represent their class. If you set loss_alpha=1,
the loss
tries to minimize the distance of the examples to their corresponding
prototype. If you set loss_alpha=0,
loss tries to maximize
the distance of all examples to all prototypes that do not reflect their
class.
The next two important arguments refer to the sampling strategies
during training. With sampling_separate=TRUE
, cases for
sample and query a drawn from the same pool of cases. Thus, a specific
case can be a sample case in one epoch and a query case in another
epoch. However, it is ensured that a specific cases never occurs as a
sample and a query case during the same training step.
In addition, it is ensured that every case exists only once during a
training step. If you set sampling_separate=FALSE
, the
training data set is split into one data pool for sample and one data
pool for query. Thus, a case can only be a sample case
or query case. With shuffle
you can
request that for every training step a random sample is chosen from the
training data set, resulting in different combinations of sample and
query cases. For the training we highly recommend to set
shuffle=TRUE
, since this will result in better performing
classifiers.
During training the model generates prototpyes based on all available data and classes of the training data set (sample and query). These classes and prototypes are used for the case that no sample data is provided.
After training we can request a visualization of the data again. We
first omit all unlabeled cases by setting
inc_unlabeled=FALSE
in order to get an impression of the
quality of training.
plot_trained_1 <- classifier_prototype$plot_embeddings(
embeddings_q = review_embeddings,
classes_q = review_labels,
inc_unlabeled = FALSE
)
plot_trained_1

Figure 9: Embeddings of a trained classifier of type ‘ProtoNet’ without unlabeled cases
As shown in the figure, all cases are now sorted. Cases of the class
“neg” are located close to the prototype for “neg”, while cases of the
class “pos” are located near the prototype for “pos”. The black cycle
around the prototpyes represent the margin used during training
(loss_margin
). Since we use the same data as during
training, this result has to be expected. Only a small number of cases
is located near the wrong prototype. This can be seen if a red dot is
close to the prototype for “pos” and a blue dot is close to the red
prototype for “neg”.
Let us now add the unlabeled cases to the plot by setting
inc_unlabeled=TRUE
.
As the following figure shows, the model estimates the class of these cases according to their distance to the two prototypes. Cases that are close to the prototype for “pos” are assigned to “pos”, while cases near the prototype for “neg” are assigned to “neg”.
plot_trained_2 <- classifier_prototype$plot_embeddings(
embeddings_q = review_embeddings,
classes_q = review_labels,
inc_unlabeled = TRUE
)
plot_trained_2

Figure 11: Embedding of a trained classifier of type ‘ProtoNet’ including unlabeled cases
Finally, let us report the reliability of this classifier.
classifier_prototype$reliability$test_metric_mean
#> iota_index min_iota2 avg_iota2
#> 0.6731225 0.6205375 0.7017015
#> max_iota2 min_alpha avg_alpha
#> 0.7828655 0.7279762 0.8205952
#> max_alpha static_iota_index dynamic_iota_index
#> 0.9132143 0.3505739 0.5784623
#> kalpha_nominal kalpha_ordinal kendall
#> 0.6407910 0.6407910 0.8232339
#> c_kappa_unweighted c_kappa_linear c_kappa_squared
#> 0.6360612 0.6360612 0.6360612
#> kappa_fleiss percentage_agreement balanced_accuracy
#> 0.6326775 0.8365613 0.8205952
#> gwet_ac1_nominal gwet_ac2_linear gwet_ac2_quadratic
#> 0.7042844 0.7042844 0.7042844
#> avg_precision avg_recall avg_f1
#> 0.8286150 0.8205952 0.8163388
6.1.3 Application with Sample Data
Up to this point we did not provide sample data. Thus, the model used the classes and prototypes available during training. This is not the regular case for this kind of models. In general, a query and a sample data is given to the classifier. We describe how this works in the following sections.
To illustrate this process, we modify the data from section 2.3. We label some of the positive reviews as “very positive” and some of the negative reviews as “very negative”. Thus, we increase the number of classes/categories from 2 to 4.
example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(1:15)] <- "very negative"
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
example_data$label[c(251:260)] <- "very positive"
example_data$label=factor(example_data$label)
table(example_data$label,useNA="ifany")
#>
#> neg pos very negative very positive <NA>
#> 60 140 15 10 75
Our aim is now to predict the 75 cases with no labels with the help of our trained model although the model was trained only for two and different classes. Thus, we first have to split the data. The cases with classes/categories form the sample set and the cases without any classes/categories form the query set.
sample_set_raw=subset(example_data,!is.na(example_data$label))
sample_classes=sample_set_raw$label
table(sample_classes,useNA="ifany")
#> sample_classes
#> neg pos very negative very positive
#> 60 140 15 10
sample_texts=LargeDataSetForText$new()
sample_texts$add_from_data.frame(sample_set_raw)
query_set_raw=subset(example_data,is.na(example_data$label))
table(query_set_raw$label,useNA="ifany")
#>
#> neg pos very negative very positive <NA>
#> 0 0 0 0 75
query_texts=LargeDataSetForText$new()
query_texts$add_from_data.frame(query_set_raw)
Now we must create text embeddings for the query and the sample data set. We have to use the same embedding model as we used during training the classifier.
sample_embeddings <- bert_modeling$embed_large(
large_datas_set = sample_texts,
trace = TRUE
)
#> Mon Aug 11 11:18:21 2025 Batch 1 / 8 done
#> Mon Aug 11 11:18:22 2025 Batch 2 / 8 done
#> Mon Aug 11 11:18:23 2025 Batch 3 / 8 done
#> Mon Aug 11 11:18:24 2025 Batch 4 / 8 done
#> Mon Aug 11 11:18:25 2025 Batch 5 / 8 done
#> Mon Aug 11 11:18:26 2025 Batch 6 / 8 done
#> Mon Aug 11 11:18:27 2025 Batch 7 / 8 done
#> Mon Aug 11 11:18:27 2025 Batch 8 / 8 done
query_embeddings <- bert_modeling$embed_large(
large_datas_set = query_texts,
trace = TRUE
)
#> Mon Aug 11 11:18:28 2025 Batch 1 / 3 done
#> Mon Aug 11 11:18:29 2025 Batch 2 / 3 done
#> Mon Aug 11 11:18:29 2025 Batch 3 / 3 done
Now we are ready to use our classifier. First, we predict the classes of the query sample.
new_predictions=classifier_prototype$predict_with_samples(
newdata = query_embeddings,
embeddings_s = sample_embeddings,
classes_s = sample_classes
)
head(new_predictions)
#> neg pos very negative very positive expected_category
#> 8016 0.2545253 0.2450940 0.2551192 0.2452614 very negative
#> 9118 0.2546990 0.2456401 0.2538508 0.2458101 neg
#> 6376 0.2544617 0.2450445 0.2552831 0.2452107 very negative
#> 5766 0.2544981 0.2449839 0.2553656 0.2451523 very negative
#> 5942 0.2544968 0.2450003 0.2553333 0.2451696 very negative
#> 478 0.2548592 0.2452956 0.2543800 0.2454651 neg
table(new_predictions$expected_category)
#>
#> neg pos very negative very positive
#> 13 37 11 14
Although we trained our classifier for two classes we can now use it to predict other classes. Next, we plot the results.
plot_with_samples<-classifier_prototype$plot_embeddings(
embeddings_q = query_embeddings,
embeddings_s = sample_embeddings,
classes_s = sample_classes,
inc_unlabeled = TRUE,
inc_margin = FALSE
)
plot_with_samples

Figure 12: Embeddings of a trained classifier of type ‘ProtoNet’ trained for two classes and applied for a sample set with four classes.
6.2 Feature Extractors
Another option to increase a model’s performance and/or to increase computational speed is to apply a feature extractor. For example, the work by Ganesan et al. (2021) indicates that a reduction of the hidden size can increase a model’s accuracy. In aifeducation, a feature extractor is a model that tries to reduce the number of features of given text embeddings before feeding the embeddings as input to a classifier.
The feature extractors implemented in aifeducation are auto-encoders that support sequential data and sequences of different length. The basic architecture of all extractors is shown in the following figure.

The learning objective of the feature extractors is first to compress information by reducing the number of features to the number of features of the latent space (Frochte 2019, p.281). In the figure above, this would mean to reduce the number of features from 8 to 4 and to store as much information as possible from the 8 dimensions in only 4 dimensions. In the next step, the extractor tries to reconstruct the original information from the compressed information of the latent space (Frochte 2019, pp.280-281). The information is extended from 4 dimensions to 8. After training, the hidden representation of the latent space is used as a compression of the original input.
You can create a feature extractor as follows. In this example we use the text embeddings from section 4.
feature_extractor <- TEFeatureExtractor$new()
feature_extractor$configure(
name = "feature_extractor_bert_movie_reviews",
label = "Feature extractor for Text Embeddings via BERT",
text_embeddings = review_embeddings,
features = 576,
method = "Dense",
noise_factor = 0.2
)
Similarly to the other models, you can use label
for the
model’s label. The argument text_embeddings
takes an object
of class EmbeddedText
or
LargeDataSetForTextEmbeddings
. With this object you connect
your feature extractor with a specific TextEmbeddingModel. That is, the
feature extractor works only with embeddings from exactly the same
TextEmbeddingModel
.
features
determines the number of features for the
compressed representation. The lower the number, the higher the
requested compression. This value corresponds to the features of the
latent space in the figure above.
You should use this value depending on the number of features of the
underlying text embedding model. You can request this value by calling
the method get_n_features
like this:
bert_modeling$get_n_features()
#> [1] 768
In our example we reduce the number of dimensions by 75 %.
With method
you determine the type of layer the feature
extractor should use. If set method="LSTM"
, all layers of
the model are long short-term memory layers. If set
method="Dense"
all layers are standard dense layers.
Independently from your choice, all models try to generate the latent
space such that the co-variance of the features to be zero. Thus, all
features represent unique information. In addition, all methods except
"LSTM"
use an orthogonal parameterization to prevent
over-fitting and apply parameter sharing. The opposite layers use the
same parameters. For more details please refer to Ranjan (2019).
With noise_factor
you can add some noise during training
making the feature extractor perform a denoising auto-encoder, which can
provide more robust generalizations.
Training the extractor is identical to the other models in
aifeducation. Please note that the text embeddings provided to
data_embeddings
must be generated with the same
TextEmbeddingModel
as the embeddings provided during the
configuration of your model.
feature_extractor$train(
data_embeddings = review_embeddings,
data_val_size = 0.25,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 200,
batch_size = 32,
trace = TRUE,
ml_trace = 0,
optimizer = "AdamW"
)
#> Mon Aug 11 11:18:30 2025 Start
#> Mon Aug 11 11:45:14 2025 Training finished
In this example we use the same text embeddings as we use to traing the classifier. It can be beneficial if you use a larger sample of texts for training a
TEFeatureExtractor
to improve performance and/or to allow a broad application of the feature extractor.
You can plot the training history of the model with:
feature_extractor$plot_training_history()

Figure 13: Training History of a Feature Extractor.
After you have trained your feature extractor, you can use it for
every classifier. Just pass the feature extractor to
feature_extractor
during configuration of the classifier.
Please note that you now have to set the values for
feat_size
and cls_pooling_features
depending
on the number of features of the feature extractor and not
depending on the number of features of the text embedding model since
the aim of the feature extractor is to reduce this number.
feat_size
should be equal or less the number for features
of the feature extractor and cls_pooling_features
should be
equal or less the value for feat_size
. For the classifier
described in section 5 this would look like:
classifier_with_fe <- TEClassifierSequential$new()
classifier_with_fe$configure(
label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = feature_extractor,
target_levels = c("neg", "pos"),
skip_connection_type="ResidualGate",
cls_pooling_features=50,
cls_pooling_type="MinMax",
feat_act_fct="ELU",
feat_size=256,
feat_bias=TRUE,
feat_dropout=0.0,
feat_parametrizations="None",
feat_normalization_type="LayerNorm",
ng_conv_act_fct="ELU",
ng_conv_n_layers=0,
ng_conv_ks_min=2,
ng_conv_ks_max=4,
ng_conv_bias=FALSE,
ng_conv_dropout=0.1,
ng_conv_parametrizations="None",
ng_conv_normalization_type="LayerNorm",
ng_conv_residual_type="ResidualGate",
dense_act_fct="ELU",
dense_n_layers=1,
dense_dropout=0.5,
dense_bias=FALSE,
dense_parametrizations="None",
dense_normalization_type="LayerNorm",
dense_residual_type="ResidualGate",
rec_act_fct="Tanh",
rec_n_layers=2,
rec_type="GRU",
rec_bidirectional=FALSE,
rec_dropout=0.2,
rec_bias=FALSE,
rec_parametrizations="None",
rec_normalization_type="LayerNorm",
rec_residual_type="ResidualGate",
tf_act_fct="ELU",
tf_dense_dim=512,
tf_n_layers=0,
tf_dropout_rate_1=0.2,
tf_dropout_rate_2=0.5,
tf_attention_type="MultiHead",
tf_positional_type ="absolute",
tf_num_heads=1,
tf_bias=FALSE,
tf_parametrizations="None",
tf_normalization_type="LayerNorm",
tf_residual_type="ResidualGate"
)
classifier_with_fe$train(
data_embeddings = review_embeddings,
data_targets = review_labels,
data_folds = 10,
data_val_size = 0.25,
loss_balance_class_weights = TRUE,
loss_balance_sequence_length = TRUE,
loss_cls_fct_name="FocalLoss",
use_sc = FALSE,
sc_method = "knnor",
sc_min_k = 1,
sc_max_k = 10,
use_pl = FALSE,
pl_max_steps = 3,
pl_max = 1.00,
pl_anchor = 1.00,
pl_min = 0.00,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 150,
batch_size = 32,
trace = TRUE,
ml_trace = 0,
log_dir = NULL,
log_write_interval = 10,
n_cores = auto_n_cores(),
lr_rate=1e-3,
lr_warm_up_ratio=0.02,
optimizer="AdamW"
)
#> Mon Aug 11 11:45:15 2025 Batch 1 / 10 done
#> Mon Aug 11 11:45:15 2025 Batch 2 / 10 done
#> Mon Aug 11 11:45:15 2025 Batch 3 / 10 done
#> Mon Aug 11 11:45:16 2025 Batch 4 / 10 done
#> Mon Aug 11 11:45:16 2025 Batch 5 / 10 done
#> Mon Aug 11 11:45:16 2025 Batch 6 / 10 done
#> Mon Aug 11 11:45:17 2025 Batch 7 / 10 done
#> Mon Aug 11 11:45:17 2025 Batch 8 / 10 done
#> Mon Aug 11 11:45:17 2025 Batch 9 / 10 done
#> Mon Aug 11 11:45:18 2025 Batch 10 / 10 done
#> Mon Aug 11 11:45:18 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:45:18 2025 Start
#> Mon Aug 11 11:45:20 2025 | Iteration 1 from 10
#> Mon Aug 11 11:45:20 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:45:36 2025 | Iteration 2 from 10
#> Mon Aug 11 11:45:36 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:45:53 2025 | Iteration 3 from 10
#> Mon Aug 11 11:45:53 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:46:10 2025 | Iteration 4 from 10
#> Mon Aug 11 11:46:10 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:46:26 2025 | Iteration 5 from 10
#> Mon Aug 11 11:46:26 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:46:43 2025 | Iteration 6 from 10
#> Mon Aug 11 11:46:43 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:46:59 2025 | Iteration 7 from 10
#> Mon Aug 11 11:46:59 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:47:16 2025 | Iteration 8 from 10
#> Mon Aug 11 11:47:16 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:47:32 2025 | Iteration 9 from 10
#> Mon Aug 11 11:47:32 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:47:49 2025 | Iteration 10 from 10
#> Mon Aug 11 11:47:49 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:48:05 2025 | Final training
#> Mon Aug 11 11:48:05 2025 | Final training | Training
#> Mon Aug 11 11:48:22 2025 Training Complete
That is all. Now you can use and train the classifier in the same way as you did without a feature extractor. You even do not need to save and load the feature extractor manually. This is done automatically for all classifiers.
For example, let us explore the performance of the classifier.
classifier_with_fe$reliability$test_metric_mean
#> iota_index min_iota2 avg_iota2
#> 0.5416996 0.4618898 0.5779117
#> max_iota2 min_alpha avg_alpha
#> 0.6939337 0.5655952 0.7169048
#> max_alpha static_iota_index dynamic_iota_index
#> 0.8682143 0.2782546 0.4608339
#> kalpha_nominal kalpha_ordinal kendall
#> 0.4402654 0.4402654 0.7287288
#> c_kappa_unweighted c_kappa_linear c_kappa_squared
#> 0.4384357 0.4384357 0.4384357
#> kappa_fleiss percentage_agreement balanced_accuracy
#> 0.4274902 0.7549407 0.7169048
#> gwet_ac1_nominal gwet_ac2_linear gwet_ac2_quadratic
#> 0.5651404 0.5651404 0.5651404
#> avg_precision avg_recall avg_f1
#> 0.7504277 0.7169048 0.7137451
If you would like to save and load a TEFeatureExtractor
independently from a classifier you can use the function pair
save_to_disk
and load_from_disk
as with the
other objects of this package. This is useful if you would like to use
the TEFeatureExtractor
at a later time point in combination
with other models.
References
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150
Berding, F., & Pargmann, J. (2022). Iota Reliability Concept of the Second Generation. Berlin: Logos. https://doi.org/10.30819/5581
Berding, F., Riebenbauer, E., Stütz, S., Jahncke, H., Slopinski, A., & Rebmann, K. (2022). Performance and Configuration of Artificial Intelligence in Educational Settings.: Introducing a New Reliability Concept Based on Content Analysis. Frontiers in Education, 1–21. https://doi.org/10.3389/feduc.2022.818365
Campesato, O. (2021). Natural Language Processing Fundamentals for Developers. Mercury Learning & Information. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6647713
Cascante-Bonilla, P., Tan, F., Qi, Y. & Ordonez, V. (2020). Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning. https://doi.org/10.48550/arXiv.2001.06001
Chollet, F., Kalinowski, T., & Allaire, J. J. (2022). Deep learning with R (Second edition). Manning Publications Co. https://learning.oreilly.com/library/view/-/9781633439849/?ar
Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. https://doi.org/10.48550/arXiv.2006.03236
Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Frochte, J. (2019). Maschinelles Lernen: Grundlagen und Algorithmen in Python (2., aktualisierte Auflage). Hanser.
Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Sajjad, H., Nakov, P., Chen, D., & Winslett, M. (2021). Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Transactions of the Association for Computational Linguistics, 9, 1061–1080. https://doi.org/10.1162/tacl_a_00413
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Fourth edition). Gaithersburg: STATAXIS.
He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://doi.org/10.48550/arXiv.2006.03654
Islam, A., Belhaouari, S. B., Rehman, A. U. & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288
Jadon, S., & Garg, A. (2020). Hands-On One-shot Learning with Python: Learn to Implement Fast and Accurate Deep Learning Models with Fewer Training Samples Using Pytorch. Packt Publishing Limited. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6175328
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Los Angeles: SAGE.
Lane, H., Howard, C., & Hapke, H. M. (2019). Natural language processing in action: Understanding, analyzing, and generating text with Python. Shelter Island: Manning.
Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice. New York: Springer. https://doi.org/10.1007/978-1-4614-3305-7
Lee, D.‑H. (2013). Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. CML 2013 Workshop: Challenges in Representation Learning.
Lee-Thorp, J., Ainslie, J., Eckstein, I. & Ontanon, S. (2021). FNet: Mixing Tokens with Fourier Transforms. https://doi.org/10.48550/arXiv.2105.03824
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Association for Computational Linguistics. https://aclanthology.org/P11-1015
Oreshkin, B. N., Rodriguez, P., & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. Advance online publication. https://doi.org/10.48550/arXiv.1805.10123
Papilloud, C., & Hinneburg, A. (2018). Qualitative Textanalyse mit Topic-Modellen: Eine Einführung für Sozialwissenschaftler. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-21980-2
Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., & Dehak, N. (2019). Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 838–844). IEEE. https://doi.org/10.1109/ASRU46091.2019.9003958
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D14-1162.pdf
Ranjan, & Chitta. (2019). Build the right Autoencoder — Tune and Optimize using PCA principles.: Part I. https://towardsdatascience.com/build-the-right-autoencoder-tune-and-optimize-using-pca-principles-part-i-1f01f821999b
Schreier, M. (2012). Qualitative Content Analysis in Practice. Los Angeles: SAGE.
Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.‑Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/arXiv.2004.09297
Tunstall, L., Werra, L. von, Wolf, T., & Géron, A. (2022). Natural language processing with transformers: Building language applications with hugging face (Revised edition). Heidelberg: O’Reilly.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762
Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J. & Poli, I. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. https://doi.org/10.48550/arXiv.2412.13663
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://doi.org/10.48550/arXiv.1609.08144
Zhang, X., Nie, J., Zong, L., Yu, H., & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang, & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
ou, L. (2023). Meta-Learning: Theory, Algorithms and Applications. Elsevier Science & Technology. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=7134465