03 Using R syntax

1 Introduction and Overview

1.1 Preface

This vignette introduces the package aifeducation and its usage with R syntax. For users who are unfamiliar with R or those who do not have coding skills in relevant languages (e.g. python), we recommend to start with the graphical user interface AI for Education - Studio which is described in the vignette 02 Using the graphical user interface Aifeducation - Studio.

We assume that aifeducation is installed as described in vignette 01 Get Started. The introduction starts with a brief explanation of basic concepts, which are necessary to work with this package.

1.2 Basic Concepts

In the educational and social sciences, assigning scientific concepts to an observation is an important task that allows researchers to understand an observation, to generate new insights, and to derive recommendations for research and practice.

In educational science, several areas deal with this kind of task. For example, diagnosing students’ characteristics is an important aspect of a teachers’ profession and necessary to understand and promote learning. Another example is the use of learning analytics, where data about students is used to provide learning environments adapted to their individual needs. On another level, educational institutions such as schools and universities can use this information for data-driven performance decisions (Laurusson & White 2014) as well as where and how to improve it. In any case, a real-world observation is aligned with scientific models to use scientific knowledge as a technology for improved learning and instruction.

Supervised machine learning is one concept that allows a link between real-world observations and existing scientific models and theories (Berding et al. 2022). For educational science, this is a great advantage because it allows researchers to use the existing knowledge and insights to apply AI. The drawback of this approach is that the training of AI requires both information about the real world observations and information on the corresponding alignment with scientific models and theories.

A valuable source of data in educational science are written texts, since textual data can be found almost everywhere in the realm of learning and teaching (Berding et al. 2022). For example, teachers often require students to solve a task which they provide in a written form. Students have to create a solution for the tasks which they often document with a short written essay or a presentation. This data can be used to analyze learning and teaching. Teachers’ written tasks for their students may provide insights into the quality of instruction while students’ solutions may provide insights into their learning outcomes and prerequisites.

AI can be a helpful assistant in analyzing textual data since the analysis of textual data is a challenging and time-consuming task for humans.

Please note that an introduction to content analysis, natural language processing or machine learning is beyond the scope of this vignette. If you would like to learn more, please refer to the cited literature.

Before we start, it is necessary to introduce a definition of our understanding of some basic concepts, since applying AI to educational contexts means to combine the knowledge of different scientific disciplines using different, sometimes overlapping, concepts. Even within a single research area, concepts are not unified. Figure 1 illustrates this package’s understanding.

Figure 1: Understanding of Central Concepts

Since aifeducation looks at the application of AI for classification tasks from the perspective of the empirical method of content analysis, there is some overlapping between the concepts of content analysis and machine learning. In content analysis, a phenomenon like performance or colors can be described as a scale/dimension which is made up by several categories (e.g. Schreier 2012, pp. 59). In our example, an exam’s performance (scale/dimension) could be “good”, “average” or “poor”. In terms of colors (scale/dimension) categories could be “blue”, “green”, etc. Machine learning literature uses other words to describe this kind of data. In machine learning, “scale” and “dimension” correspond to the term “label” while “categories” refer to the term “classes” (Chollet, Kalinowski & Allaire 2022, p. 114).

With these clarifications, classification means that a text is assigned to the correct category of a scale or, respectively, that the text is labeled with the correct class. As Figure 2 illustrates, two kinds of data are necessary to train an AI to classify text in line with supervised machine learning principles.

Figure 2: Basic Structure of Supervised Machine Learning

By providing AI with both the textual data as input data and the corresponding information about the class as target data, AI can learn which texts imply a specific class or category. In the above exam example, AI can learn which texts imply a “good”, an “average” or a “poor” judgment. After training, AI can be applied to new texts and predict the most likely class of every new text. The generated class can be used for further statistical analysis or to derive recommendations about learning and teaching.

In use cases as described in this vignette, AI has to “understand” natural language: „Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English and Mandarin. This processing generally involves translating natural language into data (numbers) that a computer can use to learn about the world. (…)” (Lane , Howard & Hapke 2019, p. 4)

Thus, the first step is to transform raw texts into a a form that is usable for a computer, hence raw texts must be transformed into numbers. In modern approaches, this is usually done through word embeddings. Campesato (2021, p. 102) describes them as “the collective name for a set of language modeling and feature learning techniques (…) where words or phrases from the vocabulary are mapped to vectors of real numbers.” The definition of a word vector is similar: „Word vectors represent the semantic meaning of words as vectors in the context of the training corpus.” (Lane, Howard & Hapke 2019, p. 191). In the next step, the words or text embeddings can be used as input data and the labels as target data when training AI to classify a text.

In aifeducation, these steps are covered with three different types of models, as shown in Figure 3.

Figure 3: Model Types in aifeducation

Base Models: The base models contain the capacities to understand natural language. In general, these are transformers such as BERT, RoBERTa, etc. A huge number of pre-trained models can be found on Hugging Face.
Text Embedding Models: The modes are built on top of base models and store directions on how to use these base models for converting raw texts into sequences of numbers. Please note that the same base model can be used to create different text embedding models.
Classifiers: Classifiers are used on top of a text embedding model. They are used to classify a text into categories/classes based on the numeric representation provided by the corresponding text embedding model. Please note that a text embedding model can be used to create different classifiers (e.g. one classifier for colors, one classifier to estimate the quality of a text, etc.).

2 Start Working

2.1 Starting a New Session

Before you can work with aifeducation, you must set up a new R session. First, you can load aifeducation. Second, it is necessary that you set up python via ‘reticulate’ and chose the environment where all necessary python libraries are available. In case you installed python as suggested in vignette 01 Get started you may start a new session like this:

library(aifeducation)
prepare_session()
#> Python is already initalized with the virtual environment ' aifeducation '.
#> Try to use this environment.
#> Detected OS: windows
#> Checking python packages. This can take a moment.
#> All necessary python packages are available.
#> python: 3.10
#> torch: 2.7.1+cu126
#> pyarrow: 20.0.0
#> transformers: 4.52.4
#> tokenizers: 0.21.1
#> pandas: 2.3.0
#> datasets: 3.6.0
#> codecarbon: 3.0.2
#> safetensors: 0.5.3
#> torcheval: 0.0.7
#> accelerate: 1.8.1
#> numpy: 2.2.6
#> GPU Acceleration: TRUE
#> Location for Temporary Files:C:\Users\User\AppData\Local\Temp\RtmpMHnMI8/r_aifeducation

Please remember: Every time you start a new session in R, you have to load the library aifeducation and to configure python. We recommend to use the function prepare_session because it performs all necessary steps for setting up python correctly.

Now you can start your work.

2.2 Data Management

2.2.1 Introducation

In the context of use cases for aifeducation, three different types of data are necessary: raw texts, text embeddings, and target data which represent the categories/classes of a text.

To deal with the first two types and to allow the use of large data sets that may not fit into the memory of your machine, the packages ships with two specialized objects.

The first is LargeDataSetForText. Objects of this class are used to read raw texts from .txt, .pdf, and .xlsx files and store them for further computations. The second is LargeDataSetForTextEmbeddings which are used to store the text embeddings of raw texts which are generated with TextEmbeddingModels. We will describe the transformation of raw texts into text embeddings later.

2.2.2 Raw Texts

The creation of a LargeDataSetForText is necessary if you would like to create or train a base model or to generate text embeddings. In case you would like to create such a data set for the first time you have to create an empty data set first:

raw_texts <- LargeDataSetForText$new()

To fill this object with raw texts different methods are available depending on the file type you use for storing raw texts.

.txt files

The first alternative is to store raw texts in .txt files. To use these you have to structure your data in a specific way:

Create a main folder for storing your data.
Store every raw text/document into a single .txt file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
Add an additional .txt file to the folder named bib_entry.txt. This file contains bibliographic information for the raw text.
Add an additional .txt file to the folder named license.txt which contains a short statement for the license of the text such as “CC BY”.
Add an additional .txt file to the folder named url_license.txt which contains the url/link to the license’s text such as “https://creativecommons.org/licenses/by/4.0/”.
Add an additional .txt file to the folder named text_license.txt which contains the full license in raw texts.
Add an additional .txt file to the folder named url_source.txt which contains the url/link to the text file in the internet.

Applying these rules may result in a data structure as follows:

Folder “main folder”
- Folder Text A
  - text_a.txt
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt
- Folder Text B
  - text_b.txt
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt
- Folder Text C
  - text_C.txt
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt

Now you can call the method add_from_files_txt by passing the path to the directory of the main folder to dir_path.

raw_texts$add_from_files_txt(
  dir_path = "main folder",
  clean_text=TRUE
)

The data set will now read all the raw texts in the main folder and will assign every text the corresponding bib entry, license, etc. Please note that adding a bib_entry.txt, license.txt, url_license.txt, text_license.txt, and url_soruce.text to every folder is optional. If there is no such file in the corresponding folder, there will be an empty entry in the data set. However, against the backdrop of the European AI Act, we recommend to provide both the license and bibliographic information to make the documentation of your models more straightforward. Furthermore, some licenses such as those provided by Creative Commons require statements about the creators, a copyright note, a URL or link to the source material (if possible), the license of the material and a URL or link to the license’s text on the internet or the license text itself. Please check the licenses of the material you are using for the requirements.

.pdf files

The second alternative is to use .pdf files as a source for raw texts. Here, the necessary structure is similar to .txt files:

Create a main folder for storing your data.
Store every raw text/document into a single .pdf file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
Add an additional .txt file to the folder named bib_entry.txt. This file contains bibliographic information for the raw text.
Add an additional .txt file to the folder named license.txt which contains a short statement for the license of the text such as “CC BY”.
Add an additional .txt file to the folder named url_license.txt which contains the URL/link to the license’s text such as “https://creativecommons.org/licenses/by/4.0/”.
Add an additional .txt file to the folder named text_license.txt which contains the full license in raw texts.
Add an additional .txt file to the folder named url_source.txt which contains the url/link to the text file in the internet.

Applying these rules may result in a data structure as follows:

Folder “main folder”
- Folder Text A
  - text_a.pdf
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt
- Folder Text B
  - text_b.pdf
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt
- Folder Text C
  - text_C.pdf
  - bib_entry.txt
  - license.txt
  - url_license.txt
  - text_license.txt
  - url_source.txt

Please not that all files except the text file must be .txt, not .pdf.

Now you can call the method add_from_files_pdf by passing the path to the directory of the main folder to dir_path.

raw_texts$add_from_files_pdf(
  dir_path = "main folder",
  clean_text=TRUE
)

As stated above, bib_entry.txt, license.txt, url_license.txt, text_license.txt, and url_soruce.text are optional.

.xlsx files

The third alternative is to store the raw texts into .xlsx files. This alternative is useful if you have many small raw texts. For raw texts that are very large such as books or papers we recommend to store them as .txt or .pdf files.

In order to add raw texts from .xlsx files, the files need a special structure:

Create a main folder for storing all .xlsx files you would like to read.
All .xlsx files must contain the names of the columns in the first row and the names must be identical for each column across all .xslx files you would like to read.
Every .xslx files must contain a column storing the text ID and must contain a column storing the raw text. Every text must have a unique ID across all .xlsx files.
Every .xslx file can contain an additional column for the bib entry.
Every .xslx file can contain an additional column for the license.
Every .xslx file can contain an additional column for the license’s URL.
Every .xslx file can contain an additional column for the license’s text.
Every .xslx file can contain an additional column for the source’s URL.

Your .xlsx file may look like

id	text	bib	license	url_license	text_license	url_source
z3	This is an example.	Author (2019)	CC BY	Example URL	Text	Example URL
a3	This is a second example.	Author (2022)	CC BY	Example URL	Text	Example URL
…	…	…	…

Now you can call the method add_from_files_xlsx by passing the path to the directory of the main folder to dir_path. Please do not forget to specify the column names for ID and text as well as for the other columns.

raw_texts$add_from_files_xlsx(
  dir_path = "main folder",
  id_column = "id",
  text_column = "text",
  bib_entry_column = "bib_entry",
  license_column = "license",
  url_license_column = "url_license",
  text_license_column = "text_license",
  url_source_column = "url_source"
)

Cleant text

For .txt and .pdf files you can set the argument clean_text=TRUE. This requests an algorithm that should pre-process the raw texts and applies the following modifications:

Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.

The aim of these changes is to provide a clean plain text in order to increase the performance and quality of all analyses.

IDs In case of .xlsx files, the texts’ IDs are set to the IDs stored in the corresponding column for ID. In case of .pdf and .txt files, the file names are used as ID (without the file extension).

Please note that a consequence of this is that two files text_01.txt and text_01.pdf have the same ID, which is not allowed. Please ensure that you use unique IDs across file formats.

Saving and loading a data set

Once you have create a LargeDataSetForText you can save your data to disk by calling the function save_to_disk. In our example the code would be:

save_to_disk(
  object = raw_texts,
  dir_path = "examples",
  folder_name = "raw_texts"
)

The argument object requires the object you would like to save. In our case this is raw_texts. With dir_path you specific the location where to save the object and with folder_name you define the name of the folder that will be created within that directory. In this folder the data set is saved.

To load an existing data set, you can call the function load_from_disk with the directory path where you stored the data. In our case this would be:

raw_text_dataset <- load_from_disk("examples/raw_texts")

Now you can work with your data.

2.2.3 Text Embeddings

The numerical representations of raw texts (called text embeddings) are stored with objects of class LargeDataSetForTextEmbeddings. These kinds of data sets are generated by some models such as TextEmbeddingModels. Thus, you will never need to create such a data set manually.

However, you will need this kind of data set to train a classifier or to predict the categories/classes of raw texts. Thus, it may be advantageous to save already transformed data. You can save and load an object of this class with the functions save_to_disk and load_from_disk.

Let us assume that we have a LargeDataSetForTextEmbeddings called text_embeddings. Saving this object may look like:

save_to_disk(
  object = text_embeddings,
  dir_path = "examples",
  folder_name = "text_embeddings"
)

The data set will be saved at examples/text_embeddings. Loading this data set may look like:

new_text_embeddings <- load_from_disk("examples/text_embeddings")

2.2.4 Target Data

The last data type necessary for working with aifeducation are the categories/classes of given raw texts. For this kind of data we currently do not provide a special object. You just need a named factor storing the classes/categories for a dimension. It is also important that the names equal the ID of the corresponding raw texts/text embeddings since matching the classes/categories to texts is done with the help of these names.

Saving and loading can be done with R’s functions save and load.

2.3 Example Data for this Vignette

To illustrate the steps in this vignette, we cannot use data from educational settings since these data is generally protected by privacy policies. Therefore, we use a subset of the Standford Movie Review Dataset provided by Maas et al. (2011) which is part of the package. You can access the data set with imdb_movie_reviews.

We now have a data set with three columns. The first column contains the raw text, the second contains the rating of the movie (positive or negative), and the third column the ID of the movie review. About 200 reviews imply a positive rating of a movie and about 100 imply a negative rating.

For this tutorial, we modify this data set by setting about 50 positive and 25 negative reviews to NA, indicating that these reviews are not labeled.

example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
table(example_data$label)
#> 
#> neg pos 
#>  75 150

We will now create a LargeDataSetForText from this data.frame. Before we can do this we must ensure that the data.set has all necessary columns:

colnames(example_data)
#> [1] "text"  "label" "id"

Now we have to add two columns. For this tutorial we do not add any bibliographic or license information although this is recommended in practice.

example_data$bib_entry <- NA
example_data$license <- NA
colnames(example_data)
#> [1] "text"      "label"     "id"        "bib_entry" "license"

Now the data.frame is ready as input for our data set. The “label” column will not be included.

data_set_reviews_text <- LargeDataSetForText$new()
data_set_reviews_text$add_from_data.frame(example_data)

We save the categories/labels within a separate factor.

review_labels <- as.factor(example_data$label)
names(review_labels) <- example_data$id

We will now use this data to show you how to use the different objects and functions in aifeducation.

3 Base Models

3.1 Overview

Base models are the foundation of all further models in aifeducation. At the moment, these are transformer models such modernBERT (Warner et al. 2024), MPNet (Song et al. 2020), BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), Funnel-Transformer (Dai et al. 2020), and Longformer (Beltagy, Peters & Cohan 2020). In general, these models are trained on a large corpus of general texts in the first step. In the next step, the models are fine-tuned to domain-specific texts and/or fine-tuned for specific tasks. Since the creation of base models requires a huge number of texts resulting in high computational time, it is recommended to use pre-trained models. These can be found on Hugging Face. Sometimes, however, it is more straightforward to create a new model to fit a specific purpose. aifeducation supports the option to both create and train/fine-tune base models.

3.2 Creation of Base Models

Every transformer model is composed of two parts:

the tokenizer which splits raw texts into smaller pieces to model a large number of words with a limited, small number of tokens and
the neural network that is used to model the capabilities for understanding natural language.

At the beginning you can choose between the different supported transformer architectures. Depending on the architecture, you have different options determining the shape of your neural network. For this vignette we use a BERT (Devlin et al. 2019) model which can be created with the function aife_transformer.make.

base_model <- aife_transformer.make("bert")
#> [1] "BERT Model has been initialized."
base_model$create(
  model_dir = "examples/my_own_transformer",
  text_dataset = data_set_reviews_text,
  vocab_size = 30522,
  vocab_do_lower_case = FALSE,
  max_position_embeddings = 512,
  hidden_size = 768,
  num_hidden_layer = 12,
  num_attention_heads = 12,
  intermediate_size = 3072,
  hidden_act = "gelu",
  hidden_dropout_prob = 0.1,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  trace = TRUE,
  log_dir = NULL,
  log_write_interval = 2
)
#> Mon Aug 11 11:10:08 2025 Start Sustainability Tracking
#> Mon Aug 11 11:10:09 2025 Creating Tokenizer Draft
#> Mon Aug 11 11:10:09 2025 Start Computing Vocabulary
#> Mon Aug 11 11:10:09 2025 Start Computing Vocabulary - Done
#> Mon Aug 11 11:10:09 2025 Saving Draft
#> Mon Aug 11 11:10:10 2025 Creating Tokenizer
#> Mon Aug 11 11:10:10 2025 Creating Tokenizer - Done
#> Mon Aug 11 11:10:13 2025 Creating Transformer Model
#> Mon Aug 11 11:10:13 2025 Saving BERT Model
#> Mon Aug 11 11:10:14 2025 Saving Tokenizer Model
#> Mon Aug 11 11:10:14 2025 Saving Sustainability Data
#> Mon Aug 11 11:10:14 2025 Done

For this function to work, you must provide a path to a directory where your new transformer should be saved (model_dir).

Furthermore, you must provide raw texts to text_dataset. This object should be of class LargeDataSetForText as described in section 2.2.2. These texts are not used to train the transformer but for calculating the vocabulary. In this example we use the text from the movie reviews. Please note, that this data set is to small for creating a new transformer. We use this here only for a fast running illustration. For real use cases a larger data set is necessary.

The maximum size of the vocabulary is determined by vocab_size. Modern tokenizers such as WordPiece (Wu et al. 2016) use algorithms that splits words into smaller elements, allowing them to build a huge number of words with a small number of elements. Thus, even with only small number of about 30,000 tokens, they are able to represent a very large number of words.

The other parameters allow you to customize your BERT model. For example, you could increase the number of hidden layers from 12 to 24 or reduce the hidden size from 768 to 256, allowing you to build and to test larger or smaller models.

The vignette 04 Model configuration provides details on how to configure a base model.

Please note that with max_position_embeddings you determine how many tokens your transformer can process. If your text has more tokens, these tokens are ignored. However, if you would like to analyze long documents, please avoid to increase this number too significantly because the computational time does not increase in a linear way but quadratic (Beltagy, Peters & Cohan 2020). For long documents you can use another architecture of BERT (e.g. Longformer from Beltagy, Peters & Cohan 2020) or split a long document into several chunks which are used sequentially for classification (e.g. Pappagari et al. 2019). Using chunks is supported by aifedcuation for all models.

Since creating a transformer model is energy consuming, aifeducation allows you to estimate its ecological impact with help of the python library codecarbon. Thus, sustain_track is set to TRUE by default. If you use the sustainability tracker you must provide the alpha-3 code for the country where your computer is located (e.g., “CAN”=“Canada”, “DEU”=“Germany”). A list with the codes can be found on Wikipedia. The reason is that different countries use different sources and techniques for generating their energy resulting in a specific impact on CO2 emissions. For the USA and Canada you can additionally specify a region by setting sustain_region. Please refer to the documentation of codecarbon for more information.

After calling the function, you will find your new model in your model directory.

3.3 Train/Fine-Tune a Base Model

If you would like to train a new base model (see section 3.2) for the first time or want to adapt a pre-trained model to a domain-specific language or task, you can call the corresponding train-method.

base_model <- aife_transformer.make("bert")
base_model$train(
  output_dir = "examples/my_own_transformer_trained",
  model_dir_path = "examples/my_own_transformer",
  text_dataset = data_set_reviews_text,
  p_mask = 0.15,
  whole_word = TRUE,
  val_size = 0.1,
  n_epoch = 1,
  batch_size = 12,
  chunk_size = 250,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  trace = TRUE,
  log_dir = NULL,
  log_write_interval = 2
)
#> Mon Aug 11 11:10:14 2025 Start Sustainability Tracking
#> Mon Aug 11 11:10:15 2025 Loading Existing Model
#> Mon Aug 11 11:10:15 2025 Creating Chunks of Sequences for Training
#> Mon Aug 11 11:10:16 2025 459 Chunks Created
#> Mon Aug 11 11:10:16 2025 Using Whole Word Masking
#> Mon Aug 11 11:10:16 2025 Preparing Training of the Model
#> Mon Aug 11 11:10:16 2025 Start Fine Tuning
#> Mon Aug 11 11:10:24 2025 Saving BERT Model
#> Mon Aug 11 11:10:25 2025 Saving Tokenizer
#> Mon Aug 11 11:10:25 2025 Saving Sustainability Data
#> Mon Aug 11 11:10:25 2025 Done

Here it is important that you provide the path to the directory where your new transformer is stored. Furthermore, it is important that you provide another directory where your trained transformer should be saved to avoid reading and writing collisions.

Now, the provided raw data is used to train your model. In case of a Bert model, the learning objective is Masked Language Modeling. Other models may use other learning objectives. Please refer to the documentation for more details on every model.

First, you can set the length of token sequences with chunk_size leading the tokenizer to split long texts into several chunks with the given size. With val_size, you set how many of these chunks should be used for the validation sample. With whole_word you can choose between masking single tokens or masking complete words (Please remember that modern tokenizers split words into several tokens. Thus, tokens and words are not forced to match each other directly). Finally, with p_mask you can determine how many tokens should be masked.

Please remember to set the correct alpha-3 code for tracking the ecological impact of training your model (sustain_iso_code).

If you work on a machine and your graphic device only has small memory capacity, please reduce the batch size significantly.

After the training finishes, you can find the transformer ready to use in your output_dir. Now you are able to create a text embedding model.

4 Text Embedding Models

4.1 Introduction

The text embedding model is used to transform raw texts into numerical representations. In order to create a new model, you need a base model that provides the ability to understand natural language. A text embedding model is stored as an object of class TextEmbeddingModel.

In aifedcuation, the transformation of raw texts into numbers is a separate step from downstream tasks such as classification. This is to reduce computational time on machines with low performance. By separating text embedding from other tasks, the text embedding has to be calculated only once and can be used for different tasks at the same time. Another advantage is that the training of the downstream tasks involves only the downstream tasks an not the parameters of the embedding model, making training less time-consuming, thus decreasing computational intensity. Finally, this approach allows the analysis of long documents by applying the same algorithm to different parts of a text.

The text embedding model provides a unified interface: After creating the model with different methods, the handling of the model is always the same.

4.2 Create a Text Embedding Model

First you have to choose the base model that forms the foundation of your new text embedding model. In order to illustrate its use we apply a pre-trained model from Hugging Face called BERT base model (uncased) published by Devlin et al. (2019). Download all files into a new folder. Here we store the model in "examples/bert_uncased".

bert_modeling <- TextEmbeddingModel$new()
bert_modeling$configure(
  model_label = "Text Embedding via BERT",
  model_language = "english",
  max_length = 512,
  chunks = 4,
  overlap = 30,
  emb_layer_min = "Middle",
  emb_layer_max = "2_3_layer",
  emb_pool_type = "Average",
  model_dir = "examples/bert_uncased"
)

Next, you have to provide the directory where your base model is stored. In this example this would be model_dir="examples/bert_uncased". Of course you can use any other pre-trained model from Hugging Face which addresses your needs and is supported by aifeducation.

Using a BERT model for text embedding is not a problem since your text does not provide more tokens than the transformer can process. This maximum value is set in the configuration of the transformer (see section 3.2). If the text produces more tokens, the last tokens are ignored. In some instances you might want to analyze long texts. In these situations, reducing the text to the first tokens (e.g. only the first 512 tokens) could result in a problematic loss of information. To deal with these situations, you can configure a text embedding model in aifecuation to split long texts into several chunks which are processed by the base model. The maximum number of chunks is set with chunks. In our example above, the text embedding model would split a text consisting of 1024 tokens into two chunks with every chunk consisting of 512 tokens. For every chunk, a text embedding is calculated. As a result, you receive a sequence of embeddings. The first embedding characterizes the first part of the text and the second embedding characterizes the second part of the text (and so on). Thus, our sample text embedding model is able to process texts with about $4*512=2048$ tokens. This approach is inspired by the work by Pappagari et al. (2019).

Since transformers are able to account for the context, it may be useful to interconnect every chunk to bring context into the calculations. This can be done with overlap to determine how many tokens of the end of a prior chunk should be added to the next. In our example the last 30 tokens of the prior chunks are added at the beginning of the following chunk. This can help to add the correct context of the text sections into the analysis. Altogether, this model can analyse a maximum of $512+(4-1)*(512-30)=1958$ tokens of a text.

Finally, you have to decide from which hidden layer(s) the embeddings should be drawn. With emb_layer_min and emb_layer_max you can decide from which layers the average value for every token should be calculated. Please note that the calculation considers all layers between emb_layer_min and emb_layer_max. In their initial work, Devlin et al. (2019) used the hidden states of different layers for classification.

With emb_pool_type, you decide which tokens are used for pooling within every layer. In the case of emb_pool_type="CLS", only the cls token is used. In the case of emb_pool_type="Average" all tokens within a layer are averaged except padding tokens.

The vignette 04 Model configuration provides details on how to configure a text embedding model.

After deciding about the configuration, you can use your model.

You can see the number of learnable parameters of the underlying base model with

bert_modeling$count_parameter()
#> [1] 108891648

Another important value is the number of features which you can request by calling get_n_features.

bert_modeling$get_n_features()
#> [1] 768

This number describes the number of dimensions for a text embedding. That is, the number of dimensions which is used to characterize the content of every chunk of text. This value is important as it determines the complexity a classifier or feature extractor has to deal with. Some of classifier’s and feature extractor’s parameters depend on this value. We elaborate this at the relevant point for the different models.

4.3 Transforming Raw Texts into Embedded Texts

To transform raw text into a numeric representation, you only have to use the embed_large method of your model. To do this, you must provide a LargeDataSetForText to large_datas_set. Relying on the sample data from section 2.3, we can use the movie reviews as raw texts.

review_embeddings <- bert_modeling$embed_large(
  large_datas_set = data_set_reviews_text,
  trace = TRUE
)
#> Mon Aug 11 11:10:28 2025 Batch 1 / 10 done 
#> Mon Aug 11 11:10:29 2025 Batch 2 / 10 done 
#> Mon Aug 11 11:10:30 2025 Batch 3 / 10 done 
#> Mon Aug 11 11:10:31 2025 Batch 4 / 10 done 
#> Mon Aug 11 11:10:31 2025 Batch 5 / 10 done 
#> Mon Aug 11 11:10:32 2025 Batch 6 / 10 done 
#> Mon Aug 11 11:10:33 2025 Batch 7 / 10 done 
#> Mon Aug 11 11:10:34 2025 Batch 8 / 10 done 
#> Mon Aug 11 11:10:35 2025 Batch 9 / 10 done 
#> Mon Aug 11 11:10:36 2025 Batch 10 / 10 done

The method embed_largecreates an object of class LargeDataSetForTextEmbeddings. This is just a data set consisting of the embeddings of every text. The embeddings are an array, of which the first dimension refers to specific texts, the second dimension refers to chunks/sequences, and the third dimension refers to the features.

With the embedded texts you now have the input to train a new classifier or to apply a pre-trained classifier for predicting categories/classes. In the next chapter we will show you how to use these classifiers. But before we start, we will show you how to save and load your model.

4.4 Saving and Loading Embedded Texts

Since transforming raw texts into text embeddings is time and energy consuming we recommend to save them to disk in order to use the embeddings for further tasks and analysis.

To save the them just call the function save_to_disk as shown below.

save_to_disk(
  object = review_embeddings,
  dir_path = "examples",
  folder_name = "imdb_movie_reviews"
)

To load the embeddings you can call the function load_from_disk.

review_embeddings<-load_from_disk("examples/imdb_movie_reviews")

4.5 Saving and Loading Text Embedding Models

Saving a created text embedding model is very easy in aifeducation by using the function save_to_disk. This function provides a unique interface for all text embedding models. For saving your work you can pass your model to object and the directory where to save the model to dir_path. With folder_name you can determine the name of the folder that should be created in that directory to store the model.

save_to_disk(
  object = bert_modeling,
  dir_path = "examples",
  folder_name = "bert_te_model"
)

In this example the model is saved in a folder at the location examples/bert_te_model. If you want to load your model you can call load_from_disk.

bert_modeling <- load_from_disk("examples/bert_te_model")

4.6 Sustainability

In case the underlying model was trained with an active sustainability tracker (section 3.2 and 3.3) you can receive a table showing you the energy consumption, CO2 emissions, and hardware used during training by calling the method get_sustainability_data(). For our example this would be bert_modeling$get_sustainability_data().

4.7 Training History

If you would like to see the training history of the underlying base model you can call a special method.

bert_modeling$plot_training_history()

Please note that this plot is not available for this example since the necessary data is not directly available for this model on Hugging Face. If you train a model with this package the training history is always saved.

5 Classifiers

5.1 Overview

Classifiers are built on top of a TextEmbeddingModel. They use the embedded texts produced by these models and predict classes/categories. You can build your classifier with the help of two components.

First, you choose a core model. It determines where different layers are located and how the outputs of the different layers are combined into the final output of the model.

Figure 4: Basic Architecture of a Sequential Classifier

The sequential architecture (Figure 4) provides models where the input is passed to a specific number of layers step by step. All layers are grouped by their kind into stacks.

In contrast, the parallel architecture (Figure 5) offers a model where an input is passed to different types of layers separately. At the end the outputs are combined to create the final output of the whole model.

Figure 5: Basic Architecture of a Parallel Classifier

You can find the name of the used core model in the name of the classifier. For example TEClassifierSequential uses a sequential core model while TEClassifierParallel uses a parallel core model.

In general, all layers within a core model allow further customization allowing you to build a high number of different models.

A detailed description of all layers can be found in vignette A01 Layers and Stack of Layers:

Second, you can choose how the core model is used for classification. At the moment a probability and metric based classifier is possible.

Probability Classifiers: Probability classifiers are used to predict a probability distribution for different classes/categories. This is the standard case most common in literature.
Prototype Based Classifiers: Prototype based classifiers are a kind of metric based classifiers. Here the classifiers do not predict a probability distribution. Instead it calculates a prototype for every class/category and measures the distance between a case and all prototypes. The class/category of the prototype with the smallest distance to the case is assigned to that case. In contrast to the probability classifiers these models can handle classes/categories that were not part of the training. For more details please refer to section 6.1.

Please note that creating, training, and predicting works for all types of classifiers as described in the sections below.

5.2 Create a Classifier

To show you how to create a classifier we use a classifier of class TEClassifierSequential as an example. With the sample data from section 2.3 and the text embeddings from section 4.3, the creation of a new classifier may look like:

classifier <- TEClassifierSequential$new()
classifier$configure(
  label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = NULL,
  target_levels = c("neg", "pos"),
  skip_connection_type="ResidualGate",
  cls_pooling_features=50,
  cls_pooling_type="MinMax",
  feat_act_fct="ELU",
  feat_size=256,
  feat_bias=TRUE,
  feat_dropout=0.0,
  feat_parametrizations="None",
  feat_normalization_type="LayerNorm",
  ng_conv_act_fct="ELU",
  ng_conv_n_layers=0,
  ng_conv_ks_min=2,
  ng_conv_ks_max=4,
  ng_conv_bias=FALSE,
  ng_conv_dropout=0.1,
  ng_conv_parametrizations="None",
  ng_conv_normalization_type="LayerNorm",
  ng_conv_residual_type="ResidualGate",
  dense_act_fct="ELU",
  dense_n_layers=1,
  dense_dropout=0.5,
  dense_bias=FALSE,
  dense_parametrizations="None",
  dense_normalization_type="LayerNorm",
  dense_residual_type="ResidualGate",
  rec_act_fct="Tanh",
  rec_n_layers=2,
  rec_type="GRU",
  rec_bidirectional=FALSE,
  rec_dropout=0.2,
  rec_bias=FALSE,
  rec_parametrizations="None",
  rec_normalization_type="LayerNorm",
  rec_residual_type="ResidualGate",
  tf_act_fct="ELU",
  tf_dense_dim=512,
  tf_n_layers=0,
  tf_dropout_rate_1=0.2,
  tf_dropout_rate_2=0.5,
  tf_attention_type="MultiHead",
  tf_positional_type ="absolute",
  tf_num_heads=1,
  tf_bias=FALSE,
  tf_parametrizations="None",
  tf_normalization_type="LayerNorm",
  tf_residual_type="ResidualGate"
)

Similarly to the text embedding model, you should provide a label (label) for your new classifier. With text_embeddings you have to provide a LargeDataSetForTextEmbeddings. The data set is created with a TextEmbeddingModel as described in section 4. We here continue our example and use the embeddings produced by our BERT model.

target_levels take the categories/classes you classifier should predict. This can be numbers or even words.

In case you would like to use ordinal data, it is very important that you provide the classes/categories in the correct order. That is, classes/categories representing a “higher” level must be stated after categories/classes with a “lower” level. If you provide the wrong order, the performance indices are not valid. In case of nominal data the order does not matter.

With feature_extractor you can add a feature extractor that tries to reduce the number of features of your text embeddings before passing the embeddings to the classifier. You can read more on this in Section 6.2.

With the help of the other parameters you can define the complexity and abilities of your model. A description of the different models can be found in A01 Layers and Stack of Layers.

Please note that you have to choose the parameter feat_size depending on the number of features of the underlying text embedding model. You can request this number by calling the method get_n_features of the used text embedding model. In our example this would be:

bert_modeling$get_n_features()
#> [1] 768

The number for feat_size should be equal or lesser as this value since this layer tries to compress the text embeddings to a lower number of dimensions. While this reduces the number of parameters for all following layers and decreases the time to train and use the model it can cost information. You can experiment with this value to find a good balance between speed and performance.

In addition, the parameter cls_pooling_features should be equal or less the number you used for feat_size. With cls_pooling_features you determine how many of the resulting features sould be used for classification. Thus, this value acts as a filter.

The vignette 04 Model configuration provides details on how to configure a classifier.

In our example we use only two recurrent layers (rec_n_layers=2) and one dense layer (dense_n_layers=1). All other layers are omitted from the model by setting the number of layers to zero (conv_n_layers=0,tf_n_layers=0).

As pooling method we use minimum and maximum (cls_pooling_type="MinMax"). That is, the 15 highest and the 15 lowest features are used for calculating the classes/labels. The number of values can be determined with intermediate_features=30.

After you have created a new classifier, you can begin training. You can see the number of learnable parameters of your model with:

classifier$count_parameter()
#> [1] 1049450

5.3 Training a Classifier

To start the training of your classifier, you have to call the train method. Similarly, for the creation of the classifier, you must provide the text embeddings to data_embeddings and the categories/classes as target data to data_targets. Please remember that data_targets expects a named factor where the names correspond to the IDs of the corresponding text embeddings. Text embeddings and target data that cannot be matched are omitted from training.

For performance estimation, training splits the data into several chunks based on cross-fold validation. The number of folds is set with data_folds. In every case, one fold is not used for training and serves as a test sample. The remaining data is used to create a training and a validation sample. The percentage of cases within each fold used as a validation sample is determined with data_val_size. This sample is used to determine the state of the model that generalizes best. All performance values saved in the trained classifier refer to the test sample. This data has never been used during training and provides a more realistic estimation of a classifier’s performance.

classifier$train(
  data_embeddings = review_embeddings,
  data_targets = review_labels,
  data_folds = 10,
  data_val_size = 0.25,
  loss_balance_class_weights = TRUE,
  loss_balance_sequence_length = TRUE,
  loss_cls_fct_name="FocalLoss",
  use_sc = FALSE,
  sc_method = "knnor",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = FALSE,
  pl_max_steps = 3,
  pl_max = 1.00,
  pl_anchor = 1.00,
  pl_min = 0.00,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 150,
  batch_size = 32,
  trace = TRUE,
  ml_trace = 0,
  log_dir = NULL,
  log_write_interval = 10,
  n_cores = auto_n_cores(),
  lr_rate=1e-3,
  lr_warm_up_ratio=0.02,
  optimizer="AdamW"
)
#> Mon Aug 11 11:10:37 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:10:37 2025 Start
#> Mon Aug 11 11:10:39 2025 | Iteration 1 from 10
#> Mon Aug 11 11:10:39 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:10:56 2025 | Iteration 2 from 10
#> Mon Aug 11 11:10:56 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:11:13 2025 | Iteration 3 from 10
#> Mon Aug 11 11:11:13 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:11:29 2025 | Iteration 4 from 10
#> Mon Aug 11 11:11:29 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:11:46 2025 | Iteration 5 from 10
#> Mon Aug 11 11:11:46 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:12:03 2025 | Iteration 6 from 10
#> Mon Aug 11 11:12:03 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:12:20 2025 | Iteration 7 from 10
#> Mon Aug 11 11:12:20 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:12:37 2025 | Iteration 8 from 10
#> Mon Aug 11 11:12:37 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:12:53 2025 | Iteration 9 from 10
#> Mon Aug 11 11:12:53 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:13:10 2025 | Iteration 10 from 10
#> Mon Aug 11 11:13:10 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:13:27 2025 | Final training
#> Mon Aug 11 11:13:27 2025 | Final training | Training
#> Mon Aug 11 11:13:44 2025 Training Complete

You can further modify the training process with different arguments. With loss_balance_class_weights=TRUE the absolute frequencies of the classes/categories are adjusted according to the ‘Inverse Class Frequency’ method. This option should be activated if you have to deal with imbalanced data.

With loss_balance_sequence_length=TRUE you can increase performance if you have to deal with texts that differ in their lengths and have an imbalanced frequency. If this option is enabled, the loss is adjusted to the absolute frequencies of length of your texts according to the ‘Inverse Class Frequency’ method.

epochs determines the maximal number of epochs. During training, the model with the best balanced accuracy is saved and used.

batch_sizesets the number of cases that should be processed simultaneously. Please adjust this value to your machine’s capacities. Please note that the batch size can have an impact on the classifier’s performance.

Since aifedcuation tries to address the special needs in educational and social science, some special training steps are integrated into this method.

Synthetic Cases: In case of imbalanced data, it is recommended to set use_sc=TRUE. Before training, a number of synthetic units is created via different techniques. Currently you can request the K-Nearest Neighbor OveRsampling Approach (KNNOR) developed by Islam et al. (2022). The aim is to create new cases that fill the gap to the majority class. Multi-class problems are reduced to a two class problem (class under investigation vs. all others) for generating these units. If the technique allows to set the number of neighbors during generation, you can configure the data generation with sc_min_k and sc_max_k. The synthetic cases for every class are generated for all k between sc_min_k and sc_max_k. Every k contributes proportionally to the synthetic cases.
Pseudo-Labeling: This technique is relevant if you have labeled target data and a large number of unlabeled target data. With the different parameters starting with “pl_”, you can configure the process of pseudo-labeling. Implementation of pseudo-labeling is based on Cascante-Bonilla et al. (2020). To apply pseudo-labeling, you have to set use_pl=TRUE. pl_max=1.00, pl_anchor=1.00, and pl_min=0.00 are used to describe the certainty of a prediction. 0 refers to random guessing while 1 refers to perfect certainty. pl_anchor is used as a reference value. The distance to pl_anchor is calculated for every case. Then, they are sorted with an increasing distance from pl_anchor. The proportion of added pseudo-labeled data into training increases with every step. The maximum number of steps is determined with pl_max_steps. Cases close to pl_anchor are included first.

Figure 6 illustrates the training loop for the cases that all options are set to TRUE.

Figure 6: Overview of the Steps to Perform a Classification

The example above applies the generation of synthetic cases and the algorithm proposed by Cascante-Bonilla et al. (2020). For every fold, the training starts with generating synthetic cases to fill the gap between the classes and the majority class. After this, an initial training of the classifiers starts. The trained classifier is used to predict pseudo-labels for the unlabeled part of the data and adds 20% of the cases with the highest certainty for their pseudo-labels to the training data set. Now new synthetic cases are generated based on both the labeled data and the newly added pseudo-labeled data. The classifier is re-initialized and trained again. After training, the classifier predicts the potential labels of all originally unlabeled data and adds 40% of the pseudo-labeled data to the training data with the highest certainty. Again, new synthetic cases are generated on both the labeled and added pseudo-labeled data. The model is again re-initialized and trained again until the maximum number of steps for pseudo labeling (pl_max_steps) is reached. After this, the logarithm is restated for the next fold until the number of folds (data_folds) is reached. All of these steps are only used to estimate the performance of the classifier to evaluate for unknown data.

The last phase of the training begins after the last fold. In the final training, the data set is split only into a training and validation set without a test set to provide the maximum amount of data for the best performance in final training. All configurations of the performance estimation phase are used in the final training phase.

Since training a neural net is energy consuming, aifeducation allows you to estimate its ecological impact with the help of the python library codecarbon. Thus, sustain_track is set to TRUE by default. If you use the sustainability tracker you must provide the alpha-3 code for the country where your computer is located (e.g., “CAN”=“Canada”, “DEU”=“Germany”). A list with the codes can be found on Wikipedia. The reason is that different countries use different sources and techniques for generating their energy resulting in a specific impact on CO2 emissions. For the USA and Canada, you can additionally specify a region by setting sustain_region. Please refer to the documentation of ‘codecarbon’ for more information.

Finally, trace and ml_trace allow you to control how much information about the training progress is printed to the console.

Please note that training the classifier can take some time. In case options like the generation of synthetic cases (use_sc) or pseudo-labeling (use_pl) are disabled, the training process is shorter.

Please note that after performance estimation, the final training of the classifier makes use of all data available. That is, the test sample is left empty.

In order to visualize the progress of training you can request a plot which shows how important performance measures develop over epochs.

classifier$plot_training_history(
  final_training=FALSE,
  pl_step=NULL,
  measure="loss",
  y_min=NULL,
  y_max=NULL,
  add_min_max=TRUE,
  text_size=10)

Figure 7: Training History of a Classifier

For this method it is important to decide which performance measure you would like to plot. For classifiers loss ("loss"), accuracy ("accuracy"), and balanced accuracy("balanced_accuracy") are possible. If you would like to see only the development of the last training phase set final_training=TRUE. If this parameter is FALSE the plot uses the data generated during the performance estimation phase.

5.4 Evaluating Classifier’s Performance

After finishing training, you can evaluate the performance of the classifier. For every fold, the classifier is applied to the test sample and the results are compared to the true categories/classes. Since the test sample is never part of the training, all performance measures provide a more realistic idea of the classifier’s performance.

To support researchers in judging the quality of the predictions, aifeducation utilizes several measures and concepts from content analysis. These are

Iota Concept of the Second Generation (Berding & Pargmann 2022).
Krippendorff’s Alpha (Krippendorff 2019).
Percentage Agreement.
Gwet’s AC1/AC2 (Gwet 2014).
Kendall’s coefficient of concordance W.
Cohen’s Kappa unweighted (Cohen 1960).
Cohen’s Kappa with equal weights (Cohen 1968).
Cohen’s Kappa with squared weights (Cohen 1968).
Fleiss’ Kappa for multiple raters without exact estimation (Fleiss 1971).

You can access the concrete values as mean values across all folds via reliability$test_metric_mean. In our example this would be:

classifier$reliability$test_metric_mean
#>           iota_index            min_iota2            avg_iota2 
#>            0.5784585            0.5061328            0.6170804 
#>            max_iota2            min_alpha            avg_alpha 
#>            0.7280280            0.6213095            0.7543452 
#>            max_alpha    static_iota_index   dynamic_iota_index 
#>            0.8873810            0.2767836            0.4880141 
#>       kalpha_nominal       kalpha_ordinal              kendall 
#>            0.5171082            0.5171082            0.7634111 
#>   c_kappa_unweighted       c_kappa_linear      c_kappa_squared 
#>            0.5127365            0.5127365            0.5127365 
#>         kappa_fleiss percentage_agreement    balanced_accuracy 
#>            0.5061392            0.7869565            0.7543452 
#>     gwet_ac1_nominal      gwet_ac2_linear   gwet_ac2_quadratic 
#>            0.6218700            0.6218700            0.6218700 
#>        avg_precision           avg_recall               avg_f1 
#>            0.7780917            0.7543452            0.7530696

An addition, standard measures from machine learning are reported. These are

Precision
Recall
F1-Score

You can access these values as follows:

classifier$reliability$standard_measures_mean
#>     precision    recall        f1
#> neg 0.7170851 0.6553571 0.6654309
#> pos 0.8390982 0.8533333 0.8407083

Finally, you can plot a coding stream scheme showing how the cases of different classes are labeled.

classifier$plot_coding_stream()

Figure 8: Coding Stream of the Classifier

Evaluating the performance of a classifier is a complex task and and beyond the scope of this vignette. Instead, we would like to refer to the cited literature of content analysis and machine learning if you would like to dive deeper into this topic.

5.5 Sustainability

In case the classifier was trained with an active sustainability tracker, you can receive information on sustainability by calling classifier$get_sustainability_data().

classifier$get_sustainability_data()
#> $sustainability_tracked
#> [1] TRUE
#> 
#> $date
#> [1] "Mon Aug 11 11:13:44 2025"
#> 
#> $sustainability_data
#> $sustainability_data$duration_sec
#> [1] 185.3819
#> 
#> $sustainability_data$co2eq_kg
#> [1] 0.001557352
#> 
#> $sustainability_data$cpu_energy_kwh
#> [1] 0.002188148
#> 
#> $sustainability_data$gpu_energy_kwh
#> [1] 0.00138514
#> 
#> $sustainability_data$ram_energy_kwh
#> [1] 0.0005147879
#> 
#> $sustainability_data$total_energy_kwh
#> [1] 0.004088076
#> 
#> 
#> $technical
#> $technical$tracker
#> [1] "codecarbon"
#> 
#> $technical$py_package_version
#> [1] "3.0.2"
#> 
#> $technical$cpu_count
#> [1] 12
#> 
#> $technical$cpu_model
#> [1] "12th Gen Intel(R) Core(TM) i5-12400F"
#> 
#> $technical$gpu_count
#> [1] 1
#> 
#> $technical$gpu_model
#> [1] "1 x NVIDIA GeForce RTX 4070"
#> 
#> $technical$ram_total_size
#> [1] 15.84258
#> 
#> 
#> $region
#> $region$country_name
#> [1] "Germany"
#> 
#> $region$country_iso_code
#> [1] "DEU"
#> 
#> $region$region
#> [1] NA

5.6 Saving and Loading a Classifier

Saving and loading follows the same pattern as for the other objects in aifeducation. You can save the classifier by calling save_to_disk. In our example this may be:

save_to_disk(
  object = classifier,
  dir_path = "examples",
  folder_name = "cls_imdb_movie_reviews"
)

The classifier is saved to examples/cls_imdb_movie_reviews. To load the model call load_from_disk.

classifier <- load_from_disk("examples/cls_imdb_movie_reviews")

5.7 Predicting New Data

If you would like to apply your classifier to new data, two steps are necessary. You must first transform the raw text into a numerical representation by using exactly the same text embedding model that was used to train your classifier (see section 4). In the case of our example classifier, we use our BERT model.

# If our mode is not loaded
bert_modeling <- load_from_disk("examples/bert_te_model")

# Create a numerical representation of the text
review_embeddings <- bert_modeling$embed_large(
  large_datas_set = data_set_reviews_text,
  trace = TRUE
)
#> Mon Aug 11 11:13:47 2025 Batch 1 / 10 done 
#> Mon Aug 11 11:13:48 2025 Batch 2 / 10 done 
#> Mon Aug 11 11:13:49 2025 Batch 3 / 10 done 
#> Mon Aug 11 11:13:50 2025 Batch 4 / 10 done 
#> Mon Aug 11 11:13:51 2025 Batch 5 / 10 done 
#> Mon Aug 11 11:13:52 2025 Batch 6 / 10 done 
#> Mon Aug 11 11:13:53 2025 Batch 7 / 10 done 
#> Mon Aug 11 11:13:53 2025 Batch 8 / 10 done 
#> Mon Aug 11 11:13:54 2025 Batch 9 / 10 done 
#> Mon Aug 11 11:13:55 2025 Batch 10 / 10 done

To transform raw texts into a numeric representation just pass the raw texts to the method embed_large of the loaded model. The raw texts should be an object of class LargeDataSetForText. To create such a data set, please refer to section 2.

The resulting object can then be passed to the method predict of our classifier and you will get the predictions together with an estimate of certainty for each class/category.

# If your classifier is not loaded
classifier <- load_from_disk("examples/cls_imdb_movie_reviews")

# Predict the classes of new texts
predicted_categories <- classifier$predict(
  newdata = review_embeddings,
  batch_size = 8
)

# Show predicted categories
head(predicted_categories)
#>             neg       pos expected_category
#> 11329 0.5059361 0.4940639               neg
#> 10460 0.7263541 0.2736459               neg
#> 10536 0.7444522 0.2555478               neg
#> 12035 0.7413885 0.2586115               neg
#> 7267  0.5032131 0.4967869               neg
#> 4761  0.7389779 0.2610221               neg

# Count frequencies
table(predicted_categories$expected_category)
#> 
#> neg pos 
#> 115 185

After the classifier finishes the prediction, the estimated categories/classes are stored as predicted_categories. This object is a data.frame containing texts’ IDs in the rows and the probabilities of the different categories/classes in the columns. The last column with the name expected_category represents the category which is assigned to a text due the highest probability.

The estimates can be used in further analysis with common methods of the educational and social sciences such as correlation analysis, regression analysis, structural equation modeling, latent class analysis or analysis of variance.

Now you are ready to to use aifeducation. In section 6 we describe further models for classification tasks and for improving model performance.

6 Extensions

6.1 Classifiers: ProtoNet

6.1.1 Introduction

The classifier introduced in section 5 is a regular classifier which comes with the traditional challenges of deep learning, such as the need for a large number of training data, expensive hardware requirements, and only a limited possibility to interpret the model’s parameters (Jadon & Garg 2020, pp.13-14). Since in the educational and social sciences data is a bottle neck, a classifier that can work with only small data sets would be preferable. These types of models are discussed in the literature with terms such as “meta-learning” (Zou 2023) or “few-shot learning” (Jadon & Garg 2020). The basic idea behind these approaches is that the model learns to use a supporting data set to predict the output for a query data set (e.g., Zou 2023, pp. 2-3). However, the model is not explicitly trained for the query data set.

One type of models within this area are Prototypical Networks (ProtoNet) which were initially proposed by Snell, Swersky, and Zemel (2017). This type of network was developed to create classifiers that are able to generalize to new classes that the model did not see during training, using only the information of a few examples of each class provided to the network (support data set). To achieve this goal, the networks learn to create a prototype for every class in the support data set with help of the examples for every class. Then, the network compares the new data with these prototypes and assigns the class of the nearest prototype to the new data. Since the network calculates the distance of every new case to every prototype, it belongs to the metric-based meta-learning approaches (Zhou 2023, pp. 48).

Since ProtoNet is a simple, easy to understand approach and provides good performance, several extensions have been suggested. aifeducation replaces the original loss function with the loss function suggested by Zhang et al. (2019) and adds the learnable metric described by Oreshkin, Rodriguez, and Lacoste (2019) to increase performance.

6.1.2 Configuration, Training, and Application without Sample Data

The application of a classifier based on ProtoNet is similar to the regular classifiers. The only difference is embedding_dim. A ProtoNet classifier uses a network to project the similarity and differences between the single cases and all prototypes into a n-dimensional space. Similar cases are located near each other while different cases are located further away. The number of dimensions of this space is determined by embedding_dim. In case embedding_dim is set to 1,2 or 3 the position of every case and the prototypes can be easily visualized. For this example we use the same data as in section 5. Let us first create and configure the new classifier.

classifier_prototype <- TEClassifierSequentialPrototype$new()
classifier_prototype$configure(
  label = "ProtoNet classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = NULL,
  target_levels = c("neg", "pos"),
  skip_connection_type = "ResidualGate",
  cls_pooling_features = 50,
  cls_pooling_type = "MinMax",
  metric_type = "Euclidean",
  feat_act_fct = "ELU",
  feat_size = 256,
  feat_bias = TRUE,
  feat_dropout = 0.0,
  feat_parametrizations = "None",
  feat_normalization_type = "LayerNorm",
  ng_conv_act_fct = "ELU",
  ng_conv_n_layers = 0,
  ng_conv_ks_min = 2,
  ng_conv_ks_max = 4,
  ng_conv_bias = FALSE,
  ng_conv_dropout = 0.1,
  ng_conv_parametrizations = "None",
  ng_conv_normalization_type = "LayerNorm",
  ng_conv_residual_type = "ResidualGate",
  dense_act_fct = "ELU",
  dense_n_layers = 1,
  dense_dropout = 0.2,
  dense_bias = FALSE,
  dense_parametrizations = "None",
  dense_normalization_type = "LayerNorm",
  dense_residual_type = "ResidualGate",
  rec_act_fct = "Tanh",
  rec_n_layers = 2,
  rec_type = "GRU",
  rec_bidirectional = FALSE,
  rec_dropout = 0.2,
  rec_bias = FALSE,
  rec_parametrizations = "None",
  rec_normalization_type = "LayerNorm",
  rec_residual_type = "ResidualGate",
  tf_act_fct = "ELU",
  tf_dense_dim = 512,
  tf_n_layers = 0,
  tf_dropout_rate_1 = 0.1,
  tf_dropout_rate_2 = 0.2,
  tf_attention_type = "MultiHead",
  tf_positional_type  = "absolute",
  tf_num_heads = 1,
  tf_bias = FALSE,
  tf_parametrizations = "None",
  tf_normalization_type = "LayerNorm",
  tf_residual_type = "ResidualGate",
  embedding_dim = 2
)

Now we can plot how the untrained classifiers embeds the different cases and the prototypes. To create the corresponding plot you can call the method plot_embeddings. The argument embeddings_q takes the embeddings of the different cases as the input of the classifier. In case you have the true classes for all or some of the cases, you can add them to the plot by using the argument classes_q. The resulting plot is shown in the following Figure.

plot_untrained <- classifier_prototype$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q = review_labels,
  inc_margin = FALSE
)
plot_untrained

Figure 9: Embeddings of an untrained classifier of type ‘ProtoNet’

The large triangles represent the prototypes for every class while the dots refer to the labeled cases in the data set. For these, the color represents their true class. For unlabeled cases, a square is used. Here, the color indicates the estimated class. As you can see, all cases are located very similarly and there seems to be no clear structure. Let us see how this changes when we train the model.

classifier_prototype$train(
  data_embeddings = review_embeddings,
  data_targets = review_labels,
  data_folds = 10,
  data_val_size = 0.25,
  loss_pt_fct_name = "MultiWayContrastiveLoss",
  use_sc = FALSE,
  sc_method = "knnor",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = FALSE,
  pl_max_steps = 3,
  pl_max = 1.00,
  pl_anchor = 1.00,
  pl_min = 0.00,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 300,
  batch_size = 35,
  Ns = 5,
  Nq = 3,
  loss_alpha = 0.5,
  loss_margin = 0.05,
  sampling_separate = FALSE,
  sampling_shuffle = TRUE,
  trace = TRUE,
  ml_trace = 0,
  log_dir = NULL,
  log_write_interval = 10,
  n_cores = auto_n_cores(),
  lr_rate = 1e-3,
  lr_warm_up_ratio = 0.02,
  optimizer = "AdamW"
)
#> Mon Aug 11 11:13:56 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:13:56 2025 Start
#> Mon Aug 11 11:13:58 2025 | Iteration 1 from 10
#> Mon Aug 11 11:13:58 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:14:20 2025 | Iteration 2 from 10
#> Mon Aug 11 11:14:20 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:14:31 2025 | Iteration 3 from 10
#> Mon Aug 11 11:14:31 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:14:55 2025 | Iteration 4 from 10
#> Mon Aug 11 11:14:55 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:15:28 2025 | Iteration 5 from 10
#> Mon Aug 11 11:15:28 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:15:40 2025 | Iteration 6 from 10
#> Mon Aug 11 11:15:40 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:15:54 2025 | Iteration 7 from 10
#> Mon Aug 11 11:15:54 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:16:24 2025 | Iteration 8 from 10
#> Mon Aug 11 11:16:24 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:16:41 2025 | Iteration 9 from 10
#> Mon Aug 11 11:16:41 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:17:45 2025 | Iteration 10 from 10
#> Mon Aug 11 11:17:45 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:18:02 2025 | Final training
#> Mon Aug 11 11:18:02 2025 | Final training | Training
#> Mon Aug 11 11:18:19 2025 Training Complete

While there are no arguments for requesting a balance of the class weights (loss_balance_class_weights) or balancing the sequence length (loss_balance_sequence_length), four new arguments are available. With Ns you determine how many examples of every class should be used during training within the support sample. These examples are used to calculate the prototypes for every class. With Nq you determine how many examples of every class should be part of the query sample. During training the network tries to predict the correct classes of query sample.

The arguments loss_alpha and loss_margin refer to the configuration of the loss function describes by Zhang et al. (2019). loss_margin refers to the minimal distance all examples of the query sample should have to all prototypes that do not represent their class. loss_alpha determines if the loss should pay more attention to minimize the distance between the examples to their corresponding prototype or if it should pay more attention to maximize the distance to the prototypes that do not represent their class. If you set loss_alpha=1, the loss tries to minimize the distance of the examples to their corresponding prototype. If you set loss_alpha=0, loss tries to maximize the distance of all examples to all prototypes that do not reflect their class.

The next two important arguments refer to the sampling strategies during training. With sampling_separate=TRUE, cases for sample and query a drawn from the same pool of cases. Thus, a specific case can be a sample case in one epoch and a query case in another epoch. However, it is ensured that a specific cases never occurs as a sample and a query case during the same training step. In addition, it is ensured that every case exists only once during a training step. If you set sampling_separate=FALSE, the training data set is split into one data pool for sample and one data pool for query. Thus, a case can only be a sample case or query case. With shuffle you can request that for every training step a random sample is chosen from the training data set, resulting in different combinations of sample and query cases. For the training we highly recommend to set shuffle=TRUE, since this will result in better performing classifiers.

During training the model generates prototpyes based on all available data and classes of the training data set (sample and query). These classes and prototypes are used for the case that no sample data is provided.

After training we can request a visualization of the data again. We first omit all unlabeled cases by setting inc_unlabeled=FALSE in order to get an impression of the quality of training.

plot_trained_1 <- classifier_prototype$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q = review_labels,
  inc_unlabeled = FALSE
)
plot_trained_1

Figure 9: Embeddings of a trained classifier of type ‘ProtoNet’ without unlabeled cases

As shown in the figure, all cases are now sorted. Cases of the class “neg” are located close to the prototype for “neg”, while cases of the class “pos” are located near the prototype for “pos”. The black cycle around the prototpyes represent the margin used during training (loss_margin). Since we use the same data as during training, this result has to be expected. Only a small number of cases is located near the wrong prototype. This can be seen if a red dot is close to the prototype for “pos” and a blue dot is close to the red prototype for “neg”.

Let us now add the unlabeled cases to the plot by setting inc_unlabeled=TRUE.

As the following figure shows, the model estimates the class of these cases according to their distance to the two prototypes. Cases that are close to the prototype for “pos” are assigned to “pos”, while cases near the prototype for “neg” are assigned to “neg”.

plot_trained_2 <- classifier_prototype$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q = review_labels,
  inc_unlabeled = TRUE
)
plot_trained_2

Figure 11: Embedding of a trained classifier of type ‘ProtoNet’ including unlabeled cases

Finally, let us report the reliability of this classifier.

classifier_prototype$reliability$test_metric_mean
#>           iota_index            min_iota2            avg_iota2 
#>            0.6731225            0.6205375            0.7017015 
#>            max_iota2            min_alpha            avg_alpha 
#>            0.7828655            0.7279762            0.8205952 
#>            max_alpha    static_iota_index   dynamic_iota_index 
#>            0.9132143            0.3505739            0.5784623 
#>       kalpha_nominal       kalpha_ordinal              kendall 
#>            0.6407910            0.6407910            0.8232339 
#>   c_kappa_unweighted       c_kappa_linear      c_kappa_squared 
#>            0.6360612            0.6360612            0.6360612 
#>         kappa_fleiss percentage_agreement    balanced_accuracy 
#>            0.6326775            0.8365613            0.8205952 
#>     gwet_ac1_nominal      gwet_ac2_linear   gwet_ac2_quadratic 
#>            0.7042844            0.7042844            0.7042844 
#>        avg_precision           avg_recall               avg_f1 
#>            0.8286150            0.8205952            0.8163388

6.1.3 Application with Sample Data

Up to this point we did not provide sample data. Thus, the model used the classes and prototypes available during training. This is not the regular case for this kind of models. In general, a query and a sample data is given to the classifier. We describe how this works in the following sections.

To illustrate this process, we modify the data from section 2.3. We label some of the positive reviews as “very positive” and some of the negative reviews as “very negative”. Thus, we increase the number of classes/categories from 2 to 4.

example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(1:15)] <- "very negative"
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
example_data$label[c(251:260)] <- "very positive"
example_data$label=factor(example_data$label)
table(example_data$label,useNA="ifany")
#> 
#>           neg           pos very negative very positive          <NA> 
#>            60           140            15            10            75

Our aim is now to predict the 75 cases with no labels with the help of our trained model although the model was trained only for two and different classes. Thus, we first have to split the data. The cases with classes/categories form the sample set and the cases without any classes/categories form the query set.

sample_set_raw=subset(example_data,!is.na(example_data$label))
sample_classes=sample_set_raw$label
table(sample_classes,useNA="ifany")
#> sample_classes
#>           neg           pos very negative very positive 
#>            60           140            15            10

sample_texts=LargeDataSetForText$new()
sample_texts$add_from_data.frame(sample_set_raw)

query_set_raw=subset(example_data,is.na(example_data$label))
table(query_set_raw$label,useNA="ifany")
#> 
#>           neg           pos very negative very positive          <NA> 
#>             0             0             0             0            75

query_texts=LargeDataSetForText$new()
query_texts$add_from_data.frame(query_set_raw)

Now we must create text embeddings for the query and the sample data set. We have to use the same embedding model as we used during training the classifier.

sample_embeddings <- bert_modeling$embed_large(
  large_datas_set = sample_texts,
  trace = TRUE
)
#> Mon Aug 11 11:18:21 2025 Batch 1 / 8 done 
#> Mon Aug 11 11:18:22 2025 Batch 2 / 8 done 
#> Mon Aug 11 11:18:23 2025 Batch 3 / 8 done 
#> Mon Aug 11 11:18:24 2025 Batch 4 / 8 done 
#> Mon Aug 11 11:18:25 2025 Batch 5 / 8 done 
#> Mon Aug 11 11:18:26 2025 Batch 6 / 8 done 
#> Mon Aug 11 11:18:27 2025 Batch 7 / 8 done 
#> Mon Aug 11 11:18:27 2025 Batch 8 / 8 done

query_embeddings <- bert_modeling$embed_large(
  large_datas_set = query_texts,
  trace = TRUE
)
#> Mon Aug 11 11:18:28 2025 Batch 1 / 3 done 
#> Mon Aug 11 11:18:29 2025 Batch 2 / 3 done 
#> Mon Aug 11 11:18:29 2025 Batch 3 / 3 done

Now we are ready to use our classifier. First, we predict the classes of the query sample.

new_predictions=classifier_prototype$predict_with_samples(
  newdata = query_embeddings,
  embeddings_s = sample_embeddings,
  classes_s = sample_classes
)
head(new_predictions)
#>            neg       pos very negative very positive expected_category
#> 8016 0.2545253 0.2450940     0.2551192     0.2452614     very negative
#> 9118 0.2546990 0.2456401     0.2538508     0.2458101               neg
#> 6376 0.2544617 0.2450445     0.2552831     0.2452107     very negative
#> 5766 0.2544981 0.2449839     0.2553656     0.2451523     very negative
#> 5942 0.2544968 0.2450003     0.2553333     0.2451696     very negative
#> 478  0.2548592 0.2452956     0.2543800     0.2454651               neg
table(new_predictions$expected_category)
#> 
#>           neg           pos very negative very positive 
#>            13            37            11            14

Although we trained our classifier for two classes we can now use it to predict other classes. Next, we plot the results.

plot_with_samples<-classifier_prototype$plot_embeddings(
  embeddings_q = query_embeddings,
  embeddings_s = sample_embeddings,
  classes_s = sample_classes,
  inc_unlabeled = TRUE,
  inc_margin = FALSE
)
plot_with_samples

Figure 12: Embeddings of a trained classifier of type ‘ProtoNet’ trained for two classes and applied for a sample set with four classes.

6.2 Feature Extractors

Another option to increase a model’s performance and/or to increase computational speed is to apply a feature extractor. For example, the work by Ganesan et al. (2021) indicates that a reduction of the hidden size can increase a model’s accuracy. In aifeducation, a feature extractor is a model that tries to reduce the number of features of given text embeddings before feeding the embeddings as input to a classifier.

The feature extractors implemented in aifeducation are auto-encoders that support sequential data and sequences of different length. The basic architecture of all extractors is shown in the following figure.

Figure 13: Basic architecture of feature extractors

The learning objective of the feature extractors is first to compress information by reducing the number of features to the number of features of the latent space (Frochte 2019, p.281). In the figure above, this would mean to reduce the number of features from 8 to 4 and to store as much information as possible from the 8 dimensions in only 4 dimensions. In the next step, the extractor tries to reconstruct the original information from the compressed information of the latent space (Frochte 2019, pp.280-281). The information is extended from 4 dimensions to 8. After training, the hidden representation of the latent space is used as a compression of the original input.

You can create a feature extractor as follows. In this example we use the text embeddings from section 4.

feature_extractor <- TEFeatureExtractor$new()
feature_extractor$configure(
  name = "feature_extractor_bert_movie_reviews",
  label = "Feature extractor for Text Embeddings via BERT",
  text_embeddings = review_embeddings,
  features = 576,
  method = "Dense",
  noise_factor = 0.2
)

Similarly to the other models, you can use label for the model’s label. The argument text_embeddings takes an object of class EmbeddedText or LargeDataSetForTextEmbeddings. With this object you connect your feature extractor with a specific TextEmbeddingModel. That is, the feature extractor works only with embeddings from exactly the same TextEmbeddingModel.

features determines the number of features for the compressed representation. The lower the number, the higher the requested compression. This value corresponds to the features of the latent space in the figure above.

You should use this value depending on the number of features of the underlying text embedding model. You can request this value by calling the method get_n_features like this:

bert_modeling$get_n_features()
#> [1] 768

In our example we reduce the number of dimensions by 75 %.

With method you determine the type of layer the feature extractor should use. If set method="LSTM", all layers of the model are long short-term memory layers. If set method="Dense" all layers are standard dense layers.

Independently from your choice, all models try to generate the latent space such that the co-variance of the features to be zero. Thus, all features represent unique information. In addition, all methods except "LSTM" use an orthogonal parameterization to prevent over-fitting and apply parameter sharing. The opposite layers use the same parameters. For more details please refer to Ranjan (2019).

With noise_factor you can add some noise during training making the feature extractor perform a denoising auto-encoder, which can provide more robust generalizations.

Training the extractor is identical to the other models in aifeducation. Please note that the text embeddings provided to data_embeddings must be generated with the same TextEmbeddingModel as the embeddings provided during the configuration of your model.

feature_extractor$train(
  data_embeddings = review_embeddings,
  data_val_size = 0.25,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 200,
  batch_size = 32,
  trace = TRUE,
  ml_trace = 0,
  optimizer = "AdamW"
)
#> Mon Aug 11 11:18:30 2025 Start
#> Mon Aug 11 11:45:14 2025 Training finished

In this example we use the same text embeddings as we use to traing the classifier. It can be beneficial if you use a larger sample of texts for training a TEFeatureExtractor to improve performance and/or to allow a broad application of the feature extractor.

You can plot the training history of the model with:

feature_extractor$plot_training_history()

Figure 13: Training History of a Feature Extractor.

After you have trained your feature extractor, you can use it for every classifier. Just pass the feature extractor to feature_extractor during configuration of the classifier. Please note that you now have to set the values for feat_size and cls_pooling_features depending on the number of features of the feature extractor and not depending on the number of features of the text embedding model since the aim of the feature extractor is to reduce this number. feat_size should be equal or less the number for features of the feature extractor and cls_pooling_features should be equal or less the value for feat_size. For the classifier described in section 5 this would look like:

classifier_with_fe <- TEClassifierSequential$new()
classifier_with_fe$configure(
  label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = feature_extractor,
  target_levels = c("neg", "pos"),
  skip_connection_type="ResidualGate",
  cls_pooling_features=50,
  cls_pooling_type="MinMax",
  feat_act_fct="ELU",
  feat_size=256,
  feat_bias=TRUE,
  feat_dropout=0.0,
  feat_parametrizations="None",
  feat_normalization_type="LayerNorm",
  ng_conv_act_fct="ELU",
  ng_conv_n_layers=0,
  ng_conv_ks_min=2,
  ng_conv_ks_max=4,
  ng_conv_bias=FALSE,
  ng_conv_dropout=0.1,
  ng_conv_parametrizations="None",
  ng_conv_normalization_type="LayerNorm",
  ng_conv_residual_type="ResidualGate",
  dense_act_fct="ELU",
  dense_n_layers=1,
  dense_dropout=0.5,
  dense_bias=FALSE,
  dense_parametrizations="None",
  dense_normalization_type="LayerNorm",
  dense_residual_type="ResidualGate",
  rec_act_fct="Tanh",
  rec_n_layers=2,
  rec_type="GRU",
  rec_bidirectional=FALSE,
  rec_dropout=0.2,
  rec_bias=FALSE,
  rec_parametrizations="None",
  rec_normalization_type="LayerNorm",
  rec_residual_type="ResidualGate",
  tf_act_fct="ELU",
  tf_dense_dim=512,
  tf_n_layers=0,
  tf_dropout_rate_1=0.2,
  tf_dropout_rate_2=0.5,
  tf_attention_type="MultiHead",
  tf_positional_type ="absolute",
  tf_num_heads=1,
  tf_bias=FALSE,
  tf_parametrizations="None",
  tf_normalization_type="LayerNorm",
  tf_residual_type="ResidualGate"
)

classifier_with_fe$train(
  data_embeddings = review_embeddings,
  data_targets = review_labels,
  data_folds = 10,
  data_val_size = 0.25,
  loss_balance_class_weights = TRUE,
  loss_balance_sequence_length = TRUE,
  loss_cls_fct_name="FocalLoss",
  use_sc = FALSE,
  sc_method = "knnor",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = FALSE,
  pl_max_steps = 3,
  pl_max = 1.00,
  pl_anchor = 1.00,
  pl_min = 0.00,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 150,
  batch_size = 32,
  trace = TRUE,
  ml_trace = 0,
  log_dir = NULL,
  log_write_interval = 10,
  n_cores = auto_n_cores(),
  lr_rate=1e-3,
  lr_warm_up_ratio=0.02,
  optimizer="AdamW"
)
#> Mon Aug 11 11:45:15 2025 Batch 1 / 10 done 
#> Mon Aug 11 11:45:15 2025 Batch 2 / 10 done 
#> Mon Aug 11 11:45:15 2025 Batch 3 / 10 done 
#> Mon Aug 11 11:45:16 2025 Batch 4 / 10 done 
#> Mon Aug 11 11:45:16 2025 Batch 5 / 10 done 
#> Mon Aug 11 11:45:16 2025 Batch 6 / 10 done 
#> Mon Aug 11 11:45:17 2025 Batch 7 / 10 done 
#> Mon Aug 11 11:45:17 2025 Batch 8 / 10 done 
#> Mon Aug 11 11:45:17 2025 Batch 9 / 10 done 
#> Mon Aug 11 11:45:18 2025 Batch 10 / 10 done
#> Mon Aug 11 11:45:18 2025 Total Cases: 300 Unique Cases: 300 Labeled Cases: 225
#> Mon Aug 11 11:45:18 2025 Start
#> Mon Aug 11 11:45:20 2025 | Iteration 1 from 10
#> Mon Aug 11 11:45:20 2025 | Iteration 1 from 10 | Training
#> Mon Aug 11 11:45:36 2025 | Iteration 2 from 10
#> Mon Aug 11 11:45:36 2025 | Iteration 2 from 10 | Training
#> Mon Aug 11 11:45:53 2025 | Iteration 3 from 10
#> Mon Aug 11 11:45:53 2025 | Iteration 3 from 10 | Training
#> Mon Aug 11 11:46:10 2025 | Iteration 4 from 10
#> Mon Aug 11 11:46:10 2025 | Iteration 4 from 10 | Training
#> Mon Aug 11 11:46:26 2025 | Iteration 5 from 10
#> Mon Aug 11 11:46:26 2025 | Iteration 5 from 10 | Training
#> Mon Aug 11 11:46:43 2025 | Iteration 6 from 10
#> Mon Aug 11 11:46:43 2025 | Iteration 6 from 10 | Training
#> Mon Aug 11 11:46:59 2025 | Iteration 7 from 10
#> Mon Aug 11 11:46:59 2025 | Iteration 7 from 10 | Training
#> Mon Aug 11 11:47:16 2025 | Iteration 8 from 10
#> Mon Aug 11 11:47:16 2025 | Iteration 8 from 10 | Training
#> Mon Aug 11 11:47:32 2025 | Iteration 9 from 10
#> Mon Aug 11 11:47:32 2025 | Iteration 9 from 10 | Training
#> Mon Aug 11 11:47:49 2025 | Iteration 10 from 10
#> Mon Aug 11 11:47:49 2025 | Iteration 10 from 10 | Training
#> Mon Aug 11 11:48:05 2025 | Final training
#> Mon Aug 11 11:48:05 2025 | Final training | Training
#> Mon Aug 11 11:48:22 2025 Training Complete

That is all. Now you can use and train the classifier in the same way as you did without a feature extractor. You even do not need to save and load the feature extractor manually. This is done automatically for all classifiers.

For example, let us explore the performance of the classifier.

classifier_with_fe$reliability$test_metric_mean
#>           iota_index            min_iota2            avg_iota2 
#>            0.5416996            0.4618898            0.5779117 
#>            max_iota2            min_alpha            avg_alpha 
#>            0.6939337            0.5655952            0.7169048 
#>            max_alpha    static_iota_index   dynamic_iota_index 
#>            0.8682143            0.2782546            0.4608339 
#>       kalpha_nominal       kalpha_ordinal              kendall 
#>            0.4402654            0.4402654            0.7287288 
#>   c_kappa_unweighted       c_kappa_linear      c_kappa_squared 
#>            0.4384357            0.4384357            0.4384357 
#>         kappa_fleiss percentage_agreement    balanced_accuracy 
#>            0.4274902            0.7549407            0.7169048 
#>     gwet_ac1_nominal      gwet_ac2_linear   gwet_ac2_quadratic 
#>            0.5651404            0.5651404            0.5651404 
#>        avg_precision           avg_recall               avg_f1 
#>            0.7504277            0.7169048            0.7137451

If you would like to save and load a TEFeatureExtractor independently from a classifier you can use the function pair save_to_disk and load_from_disk as with the other objects of this package. This is useful if you would like to use the TEFeatureExtractor at a later time point in combination with other models.

References

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150

Berding, F., & Pargmann, J. (2022). Iota Reliability Concept of the Second Generation. Berlin: Logos. https://doi.org/10.30819/5581

Berding, F., Riebenbauer, E., Stütz, S., Jahncke, H., Slopinski, A., & Rebmann, K. (2022). Performance and Configuration of Artificial Intelligence in Educational Settings.: Introducing a New Reliability Concept Based on Content Analysis. Frontiers in Education, 1–21. https://doi.org/10.3389/feduc.2022.818365

Campesato, O. (2021). Natural Language Processing Fundamentals for Developers. Mercury Learning & Information. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6647713

Cascante-Bonilla, P., Tan, F., Qi, Y. & Ordonez, V. (2020). Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning. https://doi.org/10.48550/arXiv.2001.06001

Chollet, F., Kalinowski, T., & Allaire, J. J. (2022). Deep learning with R (Second edition). Manning Publications Co. https://learning.oreilly.com/library/view/-/9781633439849/?ar

Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. https://doi.org/10.48550/arXiv.2006.03236

Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Frochte, J. (2019). Maschinelles Lernen: Grundlagen und Algorithmen in Python (2., aktualisierte Auflage). Hanser.

Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Sajjad, H., Nakov, P., Chen, D., & Winslett, M. (2021). Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Transactions of the Association for Computational Linguistics, 9, 1061–1080. https://doi.org/10.1162/tacl_a_00413

Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Fourth edition). Gaithersburg: STATAXIS.

He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://doi.org/10.48550/arXiv.2006.03654

Islam, A., Belhaouari, S. B., Rehman, A. U. & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288. https://doi.org/10.1016/j.asoc.2021.108288

Jadon, S., & Garg, A. (2020). Hands-On One-shot Learning with Python: Learn to Implement Fast and Accurate Deep Learning Models with Fewer Training Samples Using Pytorch. Packt Publishing Limited. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6175328

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Los Angeles: SAGE.

Lane, H., Howard, C., & Hapke, H. M. (2019). Natural language processing in action: Understanding, analyzing, and generating text with Python. Shelter Island: Manning.

Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice. New York: Springer. https://doi.org/10.1007/978-1-4614-3305-7

Lee, D.‑H. (2013). Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. CML 2013 Workshop: Challenges in Representation Learning.

Lee-Thorp, J., Ainslie, J., Eckstein, I. & Ontanon, S. (2021). FNet: Mixing Tokens with Fourier Transforms. https://doi.org/10.48550/arXiv.2105.03824

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Association for Computational Linguistics. https://aclanthology.org/P11-1015

Oreshkin, B. N., Rodriguez, P., & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. Advance online publication. https://doi.org/10.48550/arXiv.1805.10123

Papilloud, C., & Hinneburg, A. (2018). Qualitative Textanalyse mit Topic-Modellen: Eine Einführung für Sozialwissenschaftler. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-21980-2

Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., & Dehak, N. (2019). Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 838–844). IEEE. https://doi.org/10.1109/ASRU46091.2019.9003958

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D14-1162.pdf

Ranjan, & Chitta. (2019). Build the right Autoencoder — Tune and Optimize using PCA principles.: Part I. https://towardsdatascience.com/build-the-right-autoencoder-tune-and-optimize-using-pca-principles-part-i-1f01f821999b

Schreier, M. (2012). Qualitative Content Analysis in Practice. Los Angeles: SAGE.

Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175

Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.‑Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/arXiv.2004.09297

Tunstall, L., Werra, L. von, Wolf, T., & Géron, A. (2022). Natural language processing with transformers: Building language applications with hugging face (Revised edition). Heidelberg: O’Reilly.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J. & Poli, I. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. https://doi.org/10.48550/arXiv.2412.13663

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://doi.org/10.48550/arXiv.1609.08144

Zhang, X., Nie, J., Zong, L., Yu, H., & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang, & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24

ou, L. (2023). Meta-Learning: Theory, Algorithms and Applications. Elsevier Science & Technology. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=7134465

Florian Berding, Yuliia Tykhonova, Julia Pargmann, Andreas Slopinski, Elisabeth Riebenbauer, Karin Rebmann