Skip to contents

1 Introduction and Overview

1.1 Preface

This vignette introduces the package aifeducation and its usage with R syntax. For users who are unfamiliar with R or those who do not have coding skills in relevant languages (e.g., python), we recommend to start with the graphical user interface Aifeducation - Studio, which is described in the vignette 02 Using the graphical user interface Aifeducation - Studio.

We assume that aifeducation is installed as described in vignette 01 Get Started. The introduction starts with a brief explanation of basic concepts, which are necessary to work with this package.

1.2 Basic Concepts

In the educational and social sciences, assigning scientific concepts to an observation is an important task that allows researchers to understand an observation, to generate new insights, and to derive recommendations for research and practice.

In educational science, several areas deal with this kind of task. For example, diagnosing students’ characteristics is an important aspect of a teachers’ profession and necessary to understand and promote learning. Another example is the use of learning analytics, where data about students is used to provide learning environments adapted to their individual needs. On another level, educational institutions such as schools and universities can use this information for data-driven performance decisions (Laurusson & White 2014) as well as where and how to improve it. In any case, a real-world observation is aligned with scientific models to use scientific knowledge as a technology for improved learning and instruction.

Supervised machine learning is one concept that allows a link between real-world observations and existing scientific models and theories (Berding et al. 2022). For educational science, this is a great advantage because it allows researchers to use the existing knowledge and insights to apply AI. The drawback of this approach is that the training of AI requires both information about the real world observations and information on the corresponding alignment with scientific models and theories.

A valuable source of data in educational science are written texts, since textual data can be found almost everywhere in the realm of learning and teaching (Berding et al. 2022). For example, teachers often require students to solve a task which they provide in a written form. Students have to create a solution for the tasks which they often document with a short written essay or a presentation. This data can be used to analyze learning and teaching. Teachers’ written tasks for their students may provide insights into the quality of instruction while students’ solutions may provide insights into their learning outcomes and prerequisites.

AI can be a helpful assistant in analyzing textual data since the analysis of textual data is a challenging and time-consuming task for humans.

Please note that an introduction to content analysis, natural language processing or machine learning is beyond the scope of this vignette. If you would like to learn more, please refer to the cited literature.

Before we start, it is necessary to introduce a definition of our understanding of some basic concepts, since applying AI to educational contexts means to combine the knowledge of different scientific disciplines using different, sometimes overlapping, concepts. Even within a single research area, concepts are not unified. Figure 1 illustrates this package’s understanding.

Figure 1: Understanding of Central Concepts
Figure 1: Understanding of Central Concepts

Since aifeducation looks at the application of AI for classification tasks from the perspective of the empirical method of content analysis, there is some overlapping between the concepts of content analysis and machine learning. In content analysis, a phenomenon like performance or colors can be described as a scale/dimension which is made up by several categories (e.g. Schreier 2012, pp. 59). In our example, an exam’s performance (scale/dimension) could be “good”, “average” or “poor”. In terms of colors (scale/dimension) categories could be “blue”, “green”, etc. Machine learning literature uses other words to describe this kind of data. In machine learning, “scale” and “dimension” correspond to the term “label” while “categories” refer to the term “classes” (Chollet, Kalinowski & Allaire 2022, p. 114).

With these clarifications, classification means that a text is assigned to the correct category of a scale or, respectively, that the text is labeled with the correct class. As Figure 2 illustrates, two kinds of data are necessary to train an AI to classify text in line with supervised machine learning principles.

Figure 2: Basic Structure of Supervised Machine Learning
Figure 2: Basic Structure of Supervised Machine Learning

By providing AI with both the textual data as input data and the corresponding information about the class as target data, AI can learn which texts imply a specific class or category. In the above exam example, AI can learn which texts imply a “good”, an “average” or a “poor” judgment. After training, AI can be applied to new texts and predict the most likely class of every new text. The generated class can be used for further statistical analysis or to derive recommendations about learning and teaching.

In use cases as described in this vignette, AI has to “understand” natural language: „Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English and Mandarin. This processing generally involves translating natural language into data (numbers) that a computer can use to learn about the world. (…)” (Lane , Howard & Hapke 2019, p. 4)

Thus, the first step is to transform raw texts into a a form that is usable for a computer, hence raw texts must be transformed into numbers. In modern approaches, this is usually done through word embeddings. Campesato (2021, p. 102) describes them as “the collective name for a set of language modeling and feature learning techniques (…) where words or phrases from the vocabulary are mapped to vectors of real numbers.” The definition of a word vector is similar: „Word vectors represent the semantic meaning of words as vectors in the context of the training corpus.” (Lane, Howard & Hapke 2019, p. 191). In the next step, the words or text embeddings can be used as input data and the labels as target data when training AI to classify a text.

In aifeducation, these steps are covered with three different types of models, as shown in Figure 3.

Figure 3: Model Types in aifeducation
Figure 3: Model Types in aifeducation
  • Base Models: The base models contain the capacities to understand natural language. In general, these are transformers such as BERT, RoBERTa, etc. A huge number of pre-trained models can be found on Hugging Face.

  • Text Embedding Models: The modes are built on top of base models and store directions on how to use these base models for converting raw texts into sequences of numbers. Please note that the same base model can be used to create different text embedding models.

  • Classifiers: Classifiers are used on top of a text embedding model. They are used to classify a text into categories/classes based on the numeric representation provided by the corresponding text embedding model. Please note that a text embedding model can be used to create different classifiers (e.g. one classifier for colors, one classifier to estimate the quality of a text, etc.).

2 Start Working

2.1 Starting a New Session

Before you can work with aifeducation, you must set up a new R session. First, it is necessary that you set up python via ‘reticulate’ and chose the conda environment where all necessary python libraries are available. Second, you can load aifeducation. In case you installed python as suggested in vignette 01 Get started you may start a new session like this:

reticulate::use_condaenv(condaenv = "aifeducation")
library(aifeducation)

Note: Please remember: Every time you start a new session in R, you have to to set the correct conda environment and to load the library aifeducation.

2.2 Data Management

2.2.1 Introducation

In the context of use cases for aifeducation, three different types of data are necessary: raw texts, text embeddings, and target data which represent the categories/classes of a text.

To deal with the first two types and to allow the use of large data sets that may not fit into the memory of your machine, the packages ships with two specialized objects.

The first is LargeDataSetForText. Objects of this class are used to read raw texts from .txt, .pdf, and .xlsx files and store them for further computations. The second is LargeDataSetForTextEmbeddings which are used to store the text embeddings of raw texts which are generated with TextEmbeddingModels. We will describe the transformation of raw texts into text embeddings later.

2.2.2 Raw Texts

The creation of a LargeDataSetForText is necessary if you would like to create or train a base model or to generate text embeddings. In case you would like to create such a data set for the first time you have to call the method:

raw_texts <- LargeDataSetForText$new()

Now you have an empty data set. To fill this object with raw texts different methods are available depending on the file type you use for storing raw texts.

.txt files

The first alternative is to store raw texts in .txt files. To use these you have to structure your data in a specific way:

  • Create a main folder for storing your data.
  • Store every raw text/document into a single .txt file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
  • Add an additional .txt file to the folder named bib_entry.txt. This file contains bibliographic information for the raw text.
  • Add an additional .txt file to the folder named license.txt which contains a short statement for the license of the text such as “CC BY”.
  • Add an additional .txt file to the folder named url_license.txt which contains the url/link to the license’ text such as “https://creativecommons.org/licenses/by/4.0/”.
  • Add an additional .txt file to the folder named text_license.txt which contains the full license in raw texts.
  • Add an additional .txt file to the folder named url_source.txt which contains the url/link to the text file in the internet.

Applying these rules may result in a data structure as follows:

  • Folder “main folder”
    • Folder Text A
      • text_a.txt
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt
    • Folder Text B
      • text_b.txt
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt
    • Folder Text C
      • text_C.txt
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt

Now you can call the method add_from_files_txt by passing the path to the directory of the main folder to dir_path.

raw_texts$add_from_files_txt(
  dir_path = "main folder"
)

The data set will now read all the raw texts in the main folder and will assign every text the corresponding bib entry and license. Please note that adding a bib_entry.txt, license.txt, url_license.txt, text_license.txt, and url_soruce.text to every folder is optional. If there is no such file in the corresponding folder, there will be an empty entry in the data set. However, against the backdrop of the European AI Act, we recommend to provide both the license and bibliographic information to make the documentation of your models more straightforward. Furthermore, some licenses such as those provided by Creative Commons require statements about the creators, a copyright note, a URL or link to the source material (if possible), the license of the material and a URL or link to the license’s text on the internet or the license text itself. Please check the licenses of the material you are using for the requirements.

.pdf files

The second alternative is to use .pdf files as a source for raw texts. Here, the necessary structure is similar to .txt files:

  • Create a main folder for storing your data.
  • Store every raw text/document into a single .pdf file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
  • Add an additional .txt file to the folder named bib_entry.txt. This file contains bibliographic information for the raw text.
  • Add an additional .txt file to the folder named license.txt which contains a short statement for the license of the text such as “CC BY”.
  • Add an additional .txt file to the folder named url_license.txt which contains the URL/link to the license text such as “https://creativecommons.org/licenses/by/4.0/”.
  • Add an additional .txt file to the folder named text_license.txt which contains the full license in raw texts.
  • Add an additional .txt file to the folder named url_source.txt which contains the url/link to the text file in the internet.

Applying these rules may result in a data structure as follows:

  • Folder “main folder”
    • Folder Text A
      • text_a.pdf
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt
    • Folder Text B
      • text_b.pdf
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt
    • Folder Text C
      • text_C.pdf
      • bib_entry.txt
      • license.txt
      • url_license.txt
      • text_license.txt
      • url_source.txt

Please not that all files except the text file must be .txt, not .pdf.

Now you can call the method add_from_files_pdf by passing the path to the directory of the main folder to dir_path.

raw_texts$add_from_files_pdf(
  dir_path = "main folder"
)

As stated above, bib_entry.txt, license.txt, url_license.txt, text_license.txt, and url_soruce.text are optional.

.xlsx files

The third alternative is to store the raw texts into .xlsx files. This alternative is useful if you have many small raw texts. For raw texts that are very large such as books or papers we recommend to store them as .txt or .pdf files.

In order to add raw texts from .xlsx files, the files need a special structure:

  • Create a main folder for storing all .xlsx files you would like to read.
  • All .xlsx files must contain the names of the columns in the first row and the names must be identical for each column across all .xslx files you would like to read.
  • Every .xslx files must contain a column storing the text ID and must contain a column storing the raw text. Every text must have a unique ID across all .xlsx files.
  • Every .xslx file can contain an additional column for the bib entry.
  • Every .xslx file can contain an additional column for the license.
  • Every .xslx file can contain an additional column for the license’s URL.
  • Every .xslx file can contain an additional column for the license text.
  • Every .xslx file can contain an additional column for the source’s URL.

Your .xlsx file may look like

id text bib license url_license text_license url_source
z3 This is an example. Author (2019) CC BY Example URL Text Example URL
a3 This is a second example. Author (2022) CC BY Example URL Text Example URL

Now you can call the method add_from_files_xlsx by passing the path to the directory of the main folder to dir_path. Please do not forget to specify the column names for ID, text as well as bibliographic and license information.

raw_texts$add_from_files_xlsx(
  dir_path = "main folder",
  id_column = "id",
  text_column = "text",
  bib_entry_column = "bib_entry",
  license_column = "license",
  url_license_column = "url_license",
  text_license_column = "text_license",
  url_source_column = "url_source"
)

Saving and loading a data set

Once you have create a LargeDataSetForText you can save your data to disk by calling the function save_to_disk. In our example the code would be:

save_to_disk(
  object = raw_texts,
  dir_path = "C:/",
  folder_name = "raw_texts"
)

The argument object requires the object you would like to save. In our case this is raw_texts. With dir_path you specific the location where to save the object and with folder_name you define the name of the folder that will be created within that directory. In this folder the data set is saved.

To load an existing data set, you can call the function load_from_disk with the directory path where you stored the data. In our case this would be.

raw_text_dataset <- load_from_disk("C:/raw_texts")

Now you can work with your data.

2.2.3 Text Embeddings

The numerical representations of raw texts (called text embeddings) are stored with objects of class LargeDataSetForTextEmbeddings. These kinds of data sets are generated by some models such as TextEmbeddingModels. Thus, you will never need to create such a data set manually.

However, you will need this kind of data set to train a classifier or to predict the categories/classes of raw texts. Thus, it may be advantageous to save already transformed data. You can save and load an object of this class with the functions save_to_disk and load_from_disk.

Let us assume that we have a LargeDataSetForTextEmbeddings text_embeddings. Saving this object may look like:

save_to_disk(
  object = text_embeddings,
  dir_path = "C:/",
  folder_name = "text_embeddings"
)

The data set will be saved at C:/text_embeddings. Loading this data set may look like:

new_text_embeddings <- load_from_disk("C:/text_embeddings")

2.2.4 Target Data

The last data type necessary for working with aifeducation are the categories/classes of given raw texts. For this kind of data we currently do not provide a special object. You just need a named factor storing the classes/categories for a dimension. It is also important that the names equal the ID of the corresponding raw texts/text embeddings since matching the classes/categories to texts is done with the help of these names.

Saving and loading can be done with R’s functions save and load.

2.3 Example Data for this Vignette

To illustrate the steps in this vignette, we cannot use data from educational settings since these data is generally protected by privacy policies. Therefore, we use a subset of the Standford Movie Review Dataset provided by Maas et al. (2011) which is part of the package. You can access the data set with imdb_movie_reviews.

We now have a data set with three columns. The first column contains the raw text, the second contains the rating of the movie (positive or negative), and the third column the ID of the movie review. About 200 reviews imply a positive rating of a movie and about 100 imply a negative rating.

For this tutorial, we modify this data set by setting about 50 positive and 25 negative reviews to NA, indicating that these reviews are not labeled.

example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
example_targets <- as.factor(example_data$label)
table(example_data$label)
#> 
#> neg pos 
#>  75 150

We will now create a LargeDataSetForText from this data.frame. Before we can do this we must ensure that the data.set has all necessary columns:

colnames(example_data)
#> [1] "text"  "label" "id"

Now we have to add two columns. For this tutorial we do not add any bibliographic or license information although this is recommended in practice.

example_data$bib_entry <- NA
example_data$license <- NA
colnames(example_data)
#> [1] "text"      "label"     "id"        "bib_entry" "license"

Now the data.frame is ready as input for our data set. The “label” column will not be included in this data set.

data_set_reviews_text <- LargeDataSetForText$new()
data_set_reviews_text$add_from_data.frame(example_data)

We save the categories/labels within a separate factor.

review_labels <- example_data$label
names(review_labels) <- example_data$id

We will now use this data to show you how to use the different objects and functions in aifeducation.

3 Base Models

3.1 Overview

Base models are the foundation of all further models in aifeducation. At the moment, these are transformer models such as MPNet(), BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), DeBERTa version 2 (He et al. 2020), Funnel-Transformer (Dai et al. 2020), and Longformer (Beltagy, Peters & Cohan 2020). In general, these models are trained on a large corpus of general texts in the first step. In the next step, the models are fine-tuned to domain-specific texts and/or fine-tuned for specific tasks. Since the creation of base models requires a huge number of texts resulting in high computational time, it is recommended to use pre-trained models. These can be found on Hugging Face. Sometimes, however, it is more straightforward to create a new model to fit a specific purpose. aifeducation supports the option to both create and train/fine-tune base models.

3.2 Creation of Base Models

Every transformer model is composed of two parts: 1) the tokenizer which splits raw texts into smaller pieces to model a large number of words with a limited, small number of tokens and 2) the neural network that is used to model the capabilities for understanding natural language.

At the beginning you can choose between the different supported transformer architectures. Depending on the architecture, you have different options determining the shape of your neural network. For this vignette we use a BERT (Devlin et al. 2019) model which can be created with the create-method of the Transformer class. Use aife_transformer_maker to create a transformer object.

See p. 3 Transformer Maker 01 Transformers for Developers for details.

base_model<-aife_transformer_maker$make("bert")
base_model$create(
  ml_framework = "pytorch",
  model_dir = "my_own_transformer",
  text_dataset = LargeDataSetForText$new(example_data),
  vocab_size = 30522,
  vocab_do_lower_case = FALSE,
  max_position_embeddings = 512,
  hidden_size = 768,
  num_hidden_layer = 12,
  num_attention_heads = 12,
  intermediate_size = 3072,
  hidden_act = "gelu",
  hidden_dropout_prob = 0.1,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  trace = TRUE,
  log_dir = NULL,
  log_write_interval = 2
)

First, the function receives the machine learning framework you chose at the start of the session. However, you can change this by setting ml_framework="tensorflow" or by ml_framework="pytorch".

For this function to work, you must provide a path to a directory where your new transformer should be saved (model_dir). Furthermore, you must provide raw texts. These texts are not used to train the transformer but for the vocabulary. The maximum size of the vocabulary is determined by vocab_size. Modern tokenizers such as WordPiece (Wu et al. 2016) use algorithms that splits tokens into smaller elements, allowing them to build a huge number of words with a small number of elements. Thus, even with only small number of about 30,000 tokens, they are able to represent a very large number of words.

The other parameters allow you to customize your BERT model. For example, you could increase the number of hidden layers from 12 to 24 or reduce the hidden size from 768 to 256, allowing you to build and to test larger or smaller models.

The vignette 04 Model configuration provides details on how to configure a base model.

Please note that with max_position_embeddings you determine how many tokens your transformer can process. If your text has more tokens, these tokens are ignored. However, if you would like to analyze long documents, please avoid to increase this number too significantly because the computational time does not increase in a linear way but quadratic (Beltagy, Peters & Cohan 2020). For long documents you can use another architecture of BERT (e.g. Longformer from Beltagy, Peters & Cohan 2020) or split a long document into several chunks which are used sequentially for classification (e.g., Pappagari et al. 2019). Using chunks is supported by aifedcuation for all models.

Since creating a transformer model is energy consuming, aifeducation allows you to estimate its ecological impact with help of the python library codecarbon. Thus, sustain_track is set to TRUE by default. If you use the sustainability tracker you must provide the alpha-3 code for the country where your computer is located (e.g., “CAN”=“Canada”, “Deu”=“Germany”). A list with the codes can be found on Wikipedia. The reason is that different countries use different sources and techniques for generating their energy resulting in a specific impact on CO2 emissions. For the USA and Canada you can additionally specify a region by setting sustain_region. Please refer to the documentation of codecarbon for more information.

After calling the function, you will find your new model in your model directory.

3.3 Train/Fine-Tune a Base Model

If you would like to train a new base model (see section 3.2) for the first time or want to adapt a pre-trained model to a domain-specific language or task, you can call the corresponding train-method.

See p. 3 Transformer Maker 01 Transformers for Developers for details.

base_model$train(
  ml_framework = "pytorch",
  output_dir = "my_own_transformer_trained",
  model_dir_path = "my_own_transformer",
  text_dataset = LargeDataSetForText$new(example_data[1:10, ]),
  p_mask = 0.15,
  whole_word = TRUE,
  val_size = 0.1,
  n_epoch = 1,
  batch_size = 12,
  chunk_size = 250,
  n_workers = 1,
  multi_process = FALSE,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  trace = TRUE,
  log_dir = NULL,
  log_write_interval = 2
)

Here it is important that you provide the path to the directory where your new transformer is stored. Furthermore, it is important that you provide another directory where your trained transformer should be saved to avoid reading and writing collisions.

Now, the provided raw data is used to train your model. In case of a BERT model, the learning objective is Masked Language Modeling. Other models may use other learning objectives. Please refer to the documentation for more details on every model.

First, you can set the length of token sequences with chunk_size. With whole_word you can choose between masking single tokens or masking complete words (Please remember that modern tokenizers split words into several tokens. Thus, tokens and words are not forced to match each other directly). With p_mask you can determine how many tokens should be masked. Finally, with val_size, you set how many chunks of tokens should be used for the validation sample. Minimum is 2.

Please remember to set the correct alpha-3 code for tracking the ecological impact of training your model (sustain_iso_code).

If you work on a machine and your graphic device only has small memory capacity, please reduce the batch size significantly. We also recommend to change the usage of memory with set_config_gpu_low_memory() at the beginning of the session if you use tensorflow as framework.

After the training finishes, you can find the transformer ready to use in your output_directory. Now you are able to create a text embedding model.

Again you can change the machine learning framework by setting ml_framework="tensorflow" or ml_framework="pytorch". If you do not change this argument, the framework you chose at the beginning is used.

4 Text Embedding Models

4.1 Introduction

The text embedding model is the interface to R in aifeducation. In order to create a new model, you need a base model that provides the ability to understand natural language. A text embedding model is stored as an object of class TextEmbeddingModel. This object contains all relevant information for transforming raw texts into a numeric representation that can be used for machine learning.

In aifedcuation, the transformation of raw texts into numbers is a separate step from downstream tasks such as classification. This is to reduce computational time on machines with low performance. By separating text embedding from other tasks, the text embedding has to be calculated only once and can be used for different tasks at the same time. Another advantage is that the training of the downstream tasks involves only the downstream tasks an not the parameters of the embedding model, making training less time-consuming, thus decreasing computational intensity. Finally, this approach allows the analysis of long documents by applying the same algorithm to different parts.

The text embedding model provides a unified interface: After creating the model with different methods, the handling of the model is always the same.

4.2 Create a Text Embedding Model

First you have to choose the base model that forms the foundation of your new text embedding model. Since we use a BERT model in our example, we have to set method = "bert".

bert_modeling <- TextEmbeddingModel$new()
bert_modeling$configure(
  model_name = "bert_embedding",
  model_label = "Text Embedding via BERT",
  model_language = "english",
  method = "bert",
  max_length = 512,
  chunks = 4,
  overlap = 30,
  emb_layer_min = "middle",
  emb_layer_max = "2_3_layer",
  emb_pool_type = "average",
  model_dir = "my_own_transformer_trained"
)

Next, you have to provide the directory where your base model is stored. In this example this would be model_dir="my_own_transformer_trained. Of course you can use any other pre-trained model from Hugging Face which addresses your needs.

Using a BERT model for text embedding is not a problem since your text does not provide more tokens than the transformer can process. This maximum value is set in the configuration of the transformer (see section 3.2). If the text produces more tokens, the last tokens are ignored. In some instances you might want to analyze long texts. In these situations, reducing the text to the first tokens (e.g. only the first 512 tokens) could result in a problematic loss of information. To deal with these situations, you can configure a text embedding model in aifecuation to split long texts into several chunks which are processed by the base model. The maximum number of chunks is set with chunks. In our example above, the text embedding model would split a text consisting of 1024 tokens into two chunks with every chunk consisting of 512 tokens. For every chunk, a text embedding is calculated. As a result, you receive a sequence of embeddings. The first embedding characterizes the first part of the text and the second embedding characterizes the second part of the text (and so on). Thus, our sample text embedding model is able to process texts with about 4*512=2048 tokens. This approach is inspired by the work by Pappagari et al. (2019).

Since transformers are able to account for the context, it may be useful to interconnect every chunk to bring context into the calculations. This can be done with overlap to determine how many tokens of the end of a prior chunk should be added to the next. In our example the last 30 tokens of the prior chunks are added at the beginning of the following chunk. This can help to add the correct context of the text sections into the analysis. Altogether, this sample model can analyse a maximum of 512+(41)*(51230)=1958512+(4-1)*(512-30)=1958 tokens of a text.

Finally, you have to decide from which hidden layer(s) the embeddings should be drawn. With emb_layer_min and emb_layer_max you can decide from which layers the average value for every token should be calculated. Please note that the calculation considers all layers between emb_layer_min and emb_layer_max. In their initial work, Devlin et al. (2019) used the hidden states of different layers for classification.

With emb_pool_type, you decide which tokens are used for pooling within every layer. In the case of emb_pool_type="cls", only the cls token is used. In the case of emb_pool_type="average" all tokens within a layer are averaged except padding tokens.

The vignette 04 Model configuration provides details on how to configure a text embedding model.

After deciding about the configuration, you can use your model.

4.3 Transforming Raw Texts into Embedded Texts

To transform raw text into a numeric representation, you only have to use the embed_large method of your model. To do this, you must provide a LargeDataSetForText to large_datas_set. Relying on the sample data from section 2.3, we can use the movie reviews as raw texts.

review_embeddings <- bert_modeling$embed_large(
  large_datas_set = data_set_reviews_text,
  trace = TRUE
)

The method embed_largecreates an object of class LargeDataSetForTextEmbeddings. This is just a data set consisting of the embeddings of every text. The embeddings are an array, of which the first dimension refers to specific texts, the second dimension refers to chunks/sequences, and the third dimension refers to the features.

With the embedded texts you now have the input to train a new classifier or to apply a pre-trained classifier for predicting categories/classes. In the next chapter we will show you how to use these classifiers. But before we start, we will show you how to save and load your model.

4.4 Saving and Loading Text Embedding Models

Saving a created text embedding model is very easy in aifeducation by using the function save_to_disk. This function provides a unique interface for all text embedding models. For saving your work you can pass your model to object and the directory where to save the model to dir_path. With folder_name you can determine the name of the folder that should be created in that directory to store the model.

save_ai_model(
  object = bert_modeling,
  dir_path = "C:/text_embedding_models",
  folder_name = "bert_model"
)

In this example the model is saved in a folder at the location C:/text_embedding_models/bert_model. If you want to load your model you can call load_from_disk.

bert_modeling <- load_from_disk("C:/text_embedding_models/bert_model")

4.5 Sustainability

In case the underlying model was trained with an active sustainability tracker (section 3.2 and 3.3) you can receive a table showing you the energy consumption, CO2 emissions, and hardware used during training by calling the method get_sustainability_data(). For our example this would be bert_modeling$get_sustainability_data().

5 Classifiers

5.1 Create a Classifier

Classifiers are built on top of a TextEmbeddingModel. You can create a new classifier by calling TEClassifierRegular$new(). The TE in the object class refers to the idea that the classifiers uses text embeddings instead of raw texts.

With the sample data from section 2.3 and the text embeddings from section 4.3, the creation of a new classifier may look like:

classifier <- TEClassifierRegular$new()
classifier$configure(
  name = "movie_review_classifier",
  label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = NULL,
  target_levels = c("neg", "pos"),
  dense_layers=2,
  dense_size=5,
  rec_layers=2,
  rec_size=10,
  rec_type = "gru",
  rec_bidirectional = FALSE,
  self_attention_heads = 0,
  intermediate_size = NULL,
  attention_type = "fourier",
  add_pos_embedding = FALSE,
  rec_dropout = 0.5,
  repeat_encoder = 0,
  dense_dropout = 0.2,
  recurrent_dropout = 0.6,
  encoder_dropout = 0.1,
  optimizer = "adam"
)

Similarly to the text embedding model, you should provide a name (name) and a label (label) for your new classifier. With text_embeddings you have to provide a LargeDataSetForTextEmbeddings. The data set is created with a TextEmbeddingModel as described in section 4. We here continue our example and use the embedding produced by our BERT model.

target_levels take the categories/classes you classifier should predict. This can be numbers or even words.

In case you would like to use ordinal data, it is very important that you provide the classes/categories in the correct order. That is, classes/categories representing a “higher” level must be stated before categories/classes with a lower level. If you provide the wrong order, the performance indices are not valid. In case of nominal data the order does not matter.

With feature_extractor you can add a feature extractor that tries to reduce the number of features of your text embedding before passing the embeddings to the classifier. You can read more on this in Section 6.2.

With the other parameters you decide about the structure of your classifier. Figure 4 illustrates this.

Figure 4: Overview of Possible Structure of a Classifier
Figure 4: Overview of Possible Structure of a Classifier

dense_layers takes a vector of integers, determining the number of layers and dense_size determines the number of neurons for all dense layers. In our example, there are two dense layers with 5 neurons. rec_layers also takes a vector of integers determining the number of layers while rec_size determines the size of all recurrent layers. In this example, we use two layer with 10 neurons each. With rec_type you can choose between two types of recurrent layers. rec_type="gru" implements a Gated Recurrent Unit (GRU) network and rec_type="lstm" implements a Long Short-Term Memory layer. With rec_bidirectional you can decide whether the recurrent layer should be unidirectional or bidirectional.

Since the classifiers in aifeducation use a standardized scheme for their creation, dense layers are used after the gru layers. If you want to omit gru layers or dense layers, set the corresponding argument for the number of layers to 0 (dense_layers=0, rec_layers=0).

If you use a text embedding model that processes more than one chunk we recommend to use recurrent layers, since they use the sequential structure of your data. In all other cases you can rely on dense layers only.

If you use text embeddings with more than one chunk, you can try self-attention layering in order to take the context of all chunks into account. To add self-attention you have two choices:

  • You can use the attention mechanism used in classic transformer models as multi-head attention (Vaswani et al. 2017). For this variant you have to set attention_type="multihead", repeat_encoder to a value of at least 1, and self_attention_heads to a value of at least 1.

  • Furthermore you can use the attention mechanism described in Lee-Thorp et al. (2021) of the FNet model which allows much fast computations at low accuracy costs. To use this kind of attention you have to set attention_type="fourier and repeat_encoder to a value of at least 1.

With repeat_encoder you can choose how many times an encoder layer should be added. The encoder is implemented as described by Chollet, Kalinowski, and Allaire (2022, pp. 373) for both variants of attention. In our example we have only 300 cases altogether and only 4 chunks. Thus, we do not use any encoder layers.

You can further extend the abilities of your network by adding positional embeddings. Positional embeddings take care of the order of your chunks. Thus, adding such a layer may increase performance if the order of information is important. You can add this layer by setting add_pos_embedding=TRUE. The layer is created as described by Chollet, Kalinowski, and Allaire (2022, pp. 378).

The vignette 04 Model configuration provides details on how to configure a classifier.

Masking, normalization, and the creation of the input layer as well as the output layer are done automatically.

After you have created a new classifier, you can begin training.

5.2 Training a Classifier

To start the training of your classifier, you have to call the train method. Similarly, for the creation of the classifier, you must provide the text embedding to data_embeddings and the categories/classes as target data to data_targets. Please remember that data_targets expects a named factor where the names correspond to the IDs of the corresponding text embeddings. Text embeddings and target data that cannot be matched are omitted from training.

To train a classifier, it is necessary that you provide a path to dir_checkpoint. This directory stores the best set of weights during each training epoch. After training, these weights are automatically used as final weights for the classifier.

For performance estimation, training splits the data into several chunks based on cross-fold validation. The number of folds is set with data_folds. In every case, one fold is not used for training and serves as a test sample. The remaining data is used to create a training and a validation sample. The percentage of cases within each fold used as a validation sample is determined with data_val_size. This sample is used to determine the state of the model that generalizes best. All performance values saved in the trained classifier refer to the test sample. This data has never been used during training and provides a more realistic estimation of a classifier’s performance.

classifier$train(
  data_embeddings = review_embeddings,
  data_targets = review_labels,
  data_folds = 10,
  data_val_size = 0.25,
  balance_class_weights = TRUE,
  balance_sequence_length = TRUE,
  use_sc = FALSE,
  sc_method = "dbsmote",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = FALSE,
  pl_max_steps = 5,
  pl_max = 1.00,
  pl_anchor = 1.00,
  pl_min = 0.00,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 300,
  batch_size = 32,
  dir_checkpoint = "training/classifier",
  trace = TRUE,
  ml_trace = 1
)

You can further modify the training process with different arguments. With balance_class_weights=TRUE the absolute frequencies of the classes/categories are adjusted according to the ‘Inverse Class Frequency’ method. This option should be activated if you have to deal with imbalanced data.

With balance_sequence_length=TRUE you can increase performance if you have to deal with texts that differ in their lengths and have an imbalanced frequency. If this option is enabled, the loss is adjusted to the absolute frequencies of length of your texts according to the ‘Inverse Class Frequency’ method.

epochs determines the maximal number of epochs. During training, the model with the best balanced accuracy is saved and used.

batch_sizesets the number of cases that should be processed simultaneously. Please adjust this value to your machine’s capacities. Please note that the batch size can have an impact on the classifier’s performance.

Since aifedcuation tries to address the special needs in educational and social science, some special training steps are integrated into this method.

  • Synthetic Cases: In case of imbalanced data, it is recommended to set use_sc=TRUE. Before training, a number of synthetic units is created via different techniques. Currently you can request Basic Synthetic Minority Oversampling Technique, Density-Bases Synthetic Minority Oversampling Technique, and Adaptive Synthetic Sampling Approach for Imbalanced Learning. The aim is to create new cases that fill the gap to the majority class. Multi-class problems are reduced to a two class problem (class under investigation vs. each other) for generating these units. If the technique allows to set the number of neighbors during generation, you can configure the data generation with sc_min_k and sc_max_k. The synthetic cases for every class are generated for all k between sc_min_k and sc_max_k. Every k contributes proportionally to the synthetic cases.

  • Pseudo-Labeling: This technique is relevant if you have labeled target data and a large number of unlabeled target data. With the different parameters starting with “pl_”, you can configure the process of pseudo-labeling. Implementation of pseudo-labeling is based on Cascante-Bonilla et al. (2020). To apply pseudo-labeling, you have to set use_pl=TRUE. pl_max=1.00, pl_anchor=1.00, and pl_min=0.00 are used to describe the certainty of a prediction. 0 refers to random guessing while 1 refers to perfect certainty. pl_anchor is used as a reference value. The distance to pl_anchor is calculated for every case. Then, they are sorted with an increasing distance from pl_anchor. The proportion of added pseudo-labeled data into training increases with every step. The maximum number of steps is determined with pl_max_steps.

Figure 5 illustrates the training loop for the cases that all options are set to TRUE.

Figure 5: Overview of the Steps to Perform a Classification
Figure 5: Overview of the Steps to Perform a Classification

The example above applies the generation of synthetic cases and the algorithm proposed by Cascante-Bonilla et al. (2020). For every fold, the training starts with generating synthetic cases to fill the gap between the classes and the majority class. After this, an initial training of the classifiers starts. The trained classifier is used to predict pseudo-labels for the unlabeled part of the data and adds 20% of the cases with the highest certainty for their pseudo-labels to the training data set. Now new synthetic cases are generated based on both the labeled data and the newly added pseudo-labeled data. The classifier is re-initialized and trained again. After training, the classifier predicts the potential labels of all originally unlabeled data and adds 40% of the pseudo-labeled data to the training data with the highest certainty. Again, new synthetic cases are generated on both the labeled and added pseudo-labeled data. The model is again re-initialized and trained again until the maximum number of steps for pseudo labeling (pl_max_steps) is reached. After this, the logarithm is restated for the next fold until the number of folds (data_folds) is reached. All of these steps are only used to estimate the performance of the classifier to evaluate for the classifier’s unknown data.

The last phase of the training begins after the last fold. In the final training, the data set is split only into a training and validation set without a test set to provide the maximum amount of data for the best performance in final training.

In case options like the generation of synthetic cases (use_sc) or pseudo-labeling (use_pl) are disabled, the training process is shorter.

Since training a neural net is energy consuming, aifeducation allows you to estimate its ecological impact with the help of the python library codecarbon. Thus, sustain_track is set to TRUE by default. If you use the sustainability tracker you must provide the alpha-3 code for the country where your computer is located (e.g., “CAN”=“Canada”, “Deu”=“Germany”). A list with the codes can be found on Wikipedia. The reason is that different countries use different sources and techniques for generating their energy resulting in a specific impact on CO2 emissions. For the USA and Canada, you can additionally specify a region by setting sustain_region. Please refer to the documentation of codecarbon for more information.

Finally, trace, and ml_trace allow you to control how much information about the training progress is printed to the console. Please note that training the classifier can take some time.

Please note that after performance estimation, the final training of the classifier makes use of all data available. That is, the test sample is left empty.

5.3 Evaluating Classifier’s Performance

After finishing training, you can evaluate the performance of the classifier. For every fold, the classifier is applied to the test sample and the results are compared to the true categories/classes. Since the test sample is never part of the training, all performance measures provide a more realistic idea of the classifier’s performance.

To support researchers in judging the quality of the predictions, aifeducation utilizes several measures and concepts from content analysis. These are

  • Iota Concept of the Second Generation (Berding & Pargmann 2022)
  • Krippendorff’s Alpha (Krippendorff 2019)
  • Percentage Agreement
  • Gwet’s AC1/AC2 (Gwet 2014)
  • Kendall’s coefficient of concordance W
  • Cohen’s Kappa unweighted
  • Cohen’s Kappa with equal weights
  • Cohen’s Kappa with squared weights
  • Fleiss’ Kappa for multiple raters without exact estimation

You can access the concrete values by accessing the field reliability, which stores all relevant information. In this list you will find the reliability values for every fold. In addition, the reliability of every step within pseudo-labeling is reported.

The central estimates for the reliability values can be found via reliability$test_metric_mean. In our example this would be:

classifier$reliability$test_metric_mean
#>              iota_index               min_iota2               avg_iota2 
#>               0.5606719               0.4584235               0.5869457 
#>               max_iota2               min_alpha               avg_alpha 
#>               0.7154678               0.5785714               0.7226190 
#>               max_alpha       static_iota_index      dynamic_iota_index 
#>               0.8666667               0.2620308               0.4736155 
#>          kalpha_nominal          kalpha_ordinal                 kendall 
#>               0.4654527               0.4654527               0.7369689 
#>       kappa2_unweighted   kappa2_equal_weighted kappa2_squared_weighted 
#>               0.4613610               0.4613610               0.4613610 
#>            kappa_fleiss    percentage_agreement       balanced_accuracy 
#>               0.4533283               0.7693676               0.7226190 
#>                 gwet_ac           avg_precision              avg_recall 
#>               0.5980910               0.7610960               0.7226190 
#>                  avg_f1 
#>               0.7266641

Of particular interest are the values for alpha from the Iota Concept, since they represent a measure of reliability which is independent from the frequency distribution of the classes/categories. The alpha values describe the probability that a case of a specific class is recognized as that specific class. As you can see, compared to the baseline model, applying Balanced Synthetic Cases increased increases the minimal value of alpha, reducing the risk to miss cases which belong to a rare class (see row with “BSC”). On the contrary, the alpha values for the major category decrease slightly, thus losing its unjustified bonus from a high number of cases in the training set. This provides a more realistic performance estimation of the classifier.

An addition, standard measures from machine learning are reported. These are

  • Precision
  • Recall
  • F1-Score

You can access these values as follows:

classifier$reliability$standard_measures_mean
#>     precision    recall        f1
#> neg 0.7155556 0.5785714 0.6209740
#> pos 0.8066364 0.8666667 0.8323543

Finally, you can plot a coding stream scheme showing how the cases of different classes are labeled. Here we use the package iotarelr.

library(iotarelr)
iotarelr::plot_iota2_alluvial(classifier$reliability$iota_object_end_free)
Figure 6: Coding Stream of the Classifier

Figure 6: Coding Stream of the Classifier

Here you can see that a small number of negative reviews is treated as a good review, while a larger number of positive reviews is treated as a bad review. Thus, the data for the major class (negative reviews) is more reliable and valid as the the data for the minor class (positive reviews).

Evaluating the performance of a classifier is a complex task and and beyond the scope of this vignette. Instead, we would like to refer to the cited literature of content analysis and machine learning if you would like to dive deeper into this topic.

5.4 Sustainability

In case the classifier was trained with an active sustainability tracker, you can receive information on sustainability by calling classifier$get_sustainability_data().

classifier$get_sustainability_data()
#> $sustainability_tracked
#> [1] TRUE
#> 
#> $date
#> [1] "Tue Oct  1 20:56:37 2024"
#> 
#> $sustainability_data
#> $sustainability_data$duration_sec
#> [1] 343.5135
#> 
#> $sustainability_data$co2eq_kg
#> [1] 0.0005406515
#> 
#> $sustainability_data$cpu_energy_kwh
#> [1] 0.001012826
#> 
#> $sustainability_data$gpu_energy_kwh
#> [1] 0
#> 
#> $sustainability_data$ram_energy_kwh
#> [1] 0.0004664782
#> 
#> $sustainability_data$total_energy_kwh
#> [1] 0.001479304
#> 
#> 
#> $technical
#> $technical$tracker
#> [1] "codecarbon"
#> 
#> $technical$py_package_version
#> [1] "2.3.4"
#> 
#> $technical$cpu_count
#> [1] 8
#> 
#> $technical$cpu_model
#> [1] "11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz"
#> 
#> $technical$gpu_count
#> [1] NA
#> 
#> $technical$gpu_model
#> [1] NA
#> 
#> $technical$ram_total_size
#> [1] 15.73279
#> 
#> 
#> $region
#> $region$country_name
#> [1] "Germany"
#> 
#> $region$country_iso_code
#> [1] "DEU"
#> 
#> $region$region
#> [1] NA

5.5 Saving and Loading a Classifier

Saving and loading follows the same pattern as for the other objects in aifeducation. You can save the classifier by calling save_to_disk. In our example this may be:

save_to_disk(
  object = classifier,
  dir_path = "C:/classifiers",
  folder_name = "imdb_movie_reviews"
)

The classifier is saved to C:/classifiers/imdb_movie_reviews. To load the model call load_from_disk.

classifier <- load_from_disk("C:/classifiers/imdb_movie_reviews")

5.6 Predicting New Data

If you would like to apply your classifier to new data, two steps are necessary. You must first transform the raw text into a numerical expression by using exactly the same text embedding model that was used to train your classifier (see section 4). In the case of our example classifier, we use our BERT model.

# If our mode is not loaded
bert_modeling <- load_from_disk("C:/text_embedding_models/bert_model")

# Create a numerical representation of the text
review_embeddings <- bert_modeling$embed_large(
  large_datas_set = data_set_reviews_text,
  trace = TRUE
)

To transform raw texts into a numeric representation just pass the raw texts to the method embed_large of the loaded model. The raw texts should be an object of class LargeDataSetForText. To create such a data set, please refer to section 2.

In the example above, the text embeddings are stored in review_embeddings. Since embedding texts may take some time, it is a good idea to save the embeddings for future analysis (see section 2 for more details). This allows you to load the embeddings without the need to apply the text embedding model on the same raw texts again.

The resulting object can then be passed to the method predict of our classifier and you will get the predictions together with an estimate of certainty for each class/category.

# If your classifier is not loaded
classifier <- load_from_disk("C:/classifiers/imdb_movie_reviews")

# Predict the classes of new texts
predicted_categories <- classifier$predict(
  newdata = review_embeddings,
  batch_size = 8
)

After the classifier finishes the prediction, the estimated categories/classes are stored as predicted_categories. This object is a data.frame containing texts’ IDs in the rows and the probabilities of the different categories/classes in the columns. The last column with the name expected_category represents the category which is assigned to a text due the highest probability.

The estimates can be used in further analysis with common methods of the educational and social sciences such as correlation analysis, regression analysis, structural equation modeling, latent class analysis or analysis of variance.

Now you are ready to to use aifeducation. In section 6 we describe further models for classification tasks and for improving model performance.

6 Extensions

6.1 Classifiers: ProtoNet

The classifier introduced in section 5 is a regular classifier which comes with the traditional challenges of deep learning, such as the need for a large number of training data, expensive hardware requirements, and only a limited possibility to interpret the model’s parameters (Jadon & Garg 2020, pp.13-14). Since in the educational and social sciences data is a bottle neck, a classifier that can work with only small data sets would be preferable. These types of models are discussed in the literature with terms such as “meta-learning” (Zou 2023) or “few-shot learning” (Jadon & Garg 2020). The basic idea behind these approaches is that the model learns to use a supporting data set to predict the output for a query data set (e.g., Zou 2023, pp. 2-3). However, the model is not explicitly trained for the query data set.

One type of models within this area are Prototypical Networks (ProtoNet) which were initially proposed by Snell, Swersky, and Zemel (2017). This type of network was developed to create classifiers that are able to generalize to new classes that the model did not see during training, using only the information of a few examples of each class provided to the network (support data set). To achieve this goal, the networks learn to create a prototype for every class in the support data set with help of the examples for every class. Then, the network compares the new data with these prototypes and assigns the class of the nearest prototype to the new data. Since the network calculates the distance of every new case to every prototype, it belongs to the metric-based meta-learning approaches (Zhou 2023, pp. 48).

Since ProtoNet is a simple, easy to understand approach and provie3w good performance, several extensions have been suggested. aifeducation replaces the original loss function with the loss function suggested by Zhang et al. (2019) and adds the learnable metric described by Oreshkin, Rodriguez, and Lacoste (2019) to increase performance.

The implementation provided in aifeducation currently applies only to a fixed set of classes and the prototypes are learned during training by using all available training data. This will be extended/changed in the future to allow the selection of the support data set by the user.

The application of a classifier based on ProtoNet is similar to the regular classifiers. The only difference is embedding_dim. A ProtoNet classifier uses a network to project the similarity and differences between the single cases and all prototypes into a n-dimensional space. Similar cases are located near each other while different cases are located further away. The number of dimensions of this space is determined by embedding_dim. In case embedding_dim is set to 1,2 or 3 the position of every case and the prototypes can be easily visualized. For this example we use the same data as in section 5. Let us first create and configure the new classifier.

classifier <- TEClassifierProtoNet$new()
classifier$configure(
  name = "proto_net_movie_review_classifier",
  label = "ProtoNet classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = NULL,
  target_levels = c("neg", "pos"),
  hidden = c(5),
  rec = c(6, 6),
  rec_type = "gru",
  rec_bidirectional = FALSE,
  embedding_dim = 2,
  self_attention_heads = 0,
  intermediate_size = NULL,
  attention_type = "fourier",
  add_pos_embedding = TRUE,
  rec_dropout = 0.3,
  repeat_encoder = 0,
  dense_dropout = 0.4,
  recurrent_dropout = 0.4,
  encoder_dropout = 0.1,
  optimizer = "adam"
)

Now we can plot how the untrained classifiers embeds the different cases and the prototypes. To create the corresponding plot you can call the method plot_embeddings. The argument embeddings_q takes the embeddings of the different cases as the input of the classifier. In case you have the true classes for all or some of the cases, you can add them to the plot by using the argument classes_q. The resulting plot is shown in the following Figure.

plot_untrained<-classifier$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q=review_labels,
)
plot_untrained

Figure 7: Embedding of an untrained classifier of type “ProtoNet” The large triangles represent the prototypes for every class while the dots refer to the labeled cases in the data set. For these, the color represents their true class. For unlabeled cases, a square is used. Here, the color indicates the class of the estimates. As you can see, all cases are located very similarly and there seems to be no clear structure. Let us see how this changes when we train the model.

classifier$train(
  data_embeddings = review_embeddings,
  data_targets = review_labels,
  data_folds = 5,
  data_val_size = 0.25,
  use_sc = TRUE,
  sc_method = "dbsmote",
  sc_min_k = 1,
  sc_max_k = 10,
  use_pl = TRUE,
  pl_max_steps = 5,
  pl_max = 1.00,
  pl_anchor = 1.00,
  pl_min = 0.00,
  sustain_track = TRUE,
  sustain_iso_code = "DEU",
  sustain_region = NULL,
  sustain_interval = 15,
  epochs = 400,
  batch_size = 32,
  Ns = 2,
  Nq = 10,
  loss_alpha = 0.5,
  loss_margin = 0.5,
  sampling_separate=FALSE,
  sampling_shuffle = TRUE,
  dir_checkpoint = "training/classifier",
  trace = TRUE,
  ml_trace=1
)

While there are no arguments for requesting a balance of the class weights (balance_class_weights) or balancing the sequence length (balance_sequence_length), four new arguments are available. With Ns you determine how many examples of every class should be used during training within the support sample. These examples are used to calculate the prototypes for every class. With Nq you determine how many examples of every class should be part of the query sample. During training the network tries to predict the correct classes of the examples.

The arguments loss_alpha and loss_margin refer to the configuration of the loss function describes by Zhang et al. (2019). loss_margin refers to the minimal distance all examples of the query sample should have to all prototypes that do no represent their class. loss_alpha determines if the loss should pay more attention to minimize the distance between the examples to their corresponding prototype or if it should pay more attention to maximize the distance to the prototypes that do not represent their class. If you set loss_alpha=1, the loss tries to minimize the distance of the examples to their corresponding prototype. If you set loss_alpha=0, loss pays tries to maximize the distance of all examples to all prototypes that do not reflect their class.

The next two important arguments refer to the sampling strategies during training. With sampling_separate=TRUE, cases for sample and query a drawn from the same pool of cases. Thus, a specific case can be a sample case in one epoch and a query case in another epoch. However, it is ensured that a specific cases never occurs as a sample and a query during the same training step. In addition, it is ensured that every case exists only once during a training step. If you set sampling_separate=FALSE, the training data set is split into one data pool for sample and one data pool for query. Thus, a case can only be a sample case or query case. With shuffle you can request that for every training step a random sample is chosen from the training data set, resulting in different combinations of sample and query cases. For the training we highly recommend to set shuffle=TRUE, since this will result in better performing classifiers.

After training we can request a visualization of the data again. We first omit all unlabeled cases by setting inc_unlabeled=FALSE in order to get an impression of the quality of training.

plot_trained_1<-classifier$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q=review_labels,
  inc_unlabeled=FALSE
)
plot_trained_1

As shown in the figure, all cases are now sorted. Cases of the class “neg” are located close to the prototype for “neg”, while cases of the class “pos” are located near the prototype for “pos”. Since we use the same data as during training, this result has to be expected. Only a small number of cases is located near the wrong prototype. This can be seen if a red dot is close to the prototype for “pos” and a green dot is close to the red prototype for “neg”.

Figure 8: Embedding of a trained classifier of type “ProtoNet” without unlabeled cases
Figure 8: Embedding of a trained classifier of type “ProtoNet” without unlabeled cases

Let us now add the unlabeled cases to the plot by setting inc_unlabeled=TRUE.

plot_trained_2<-classifier$plot_embeddings(
  embeddings_q = review_embeddings,
  classes_q=review_labels,
  inc_unlabeled=FALSE
)
plot_trained_2

As the following figure shows, the model estimates the class of these cases according to their distance to the two prototypes. Cases that are close to the prototype for “pos” are assigned to “pos”, while cases near the prototype for “neg” are assigned to “neg”.

Figure 9: Embedding of a trained classifier of type “ProtoNet” including unlabeled cases
Figure 9: Embedding of a trained classifier of type “ProtoNet” including unlabeled cases

Finally, let us report the reliability of this classifier.

classifier$reliability$test_metric_mean
#>              iota_index               min_iota2               avg_iota2 
#>               0.4375494               0.3161485               0.4643332 
#>               max_iota2               min_alpha               avg_alpha 
#>               0.6125178               0.4013095               0.6123214 
#>               max_alpha       static_iota_index      dynamic_iota_index 
#>               0.8233333               0.1946912               0.3727251 
#>          kalpha_nominal          kalpha_ordinal                 kendall 
#>               0.2297705               0.2297705               0.6241177 
#>       kappa2_unweighted   kappa2_equal_weighted kappa2_squared_weighted 
#>               0.2345552               0.2345552               0.2345552 
#>            kappa_fleiss    percentage_agreement       balanced_accuracy 
#>               0.2122470               0.6717391               0.6123214 
#>                 gwet_ac           avg_precision              avg_recall 
#>               0.4255430               0.6485922               0.6123214 
#>                  avg_f1 
#>               0.6061235

6.2 Feature Extractors

Another option to increase a model’s performance and/or to increase computational speed is to apply a feature extractor. For example, the work by Ganesan et al. (2021) indicates that a reduction of the hidden size can increase a model’s accuracy. In aifeducation, a feature extractor is a model that tries to reduce the number of features of given text embeddings before feeding the embeddings as input to a classifier.

The feature extractors implemented in aifeducation are auto-encoders that support sequential data and sequences of different length. The basic architecture of all extractors is shown in the following figure.

Figure 10: Basic architecture of feature extractors
Figure 10: Basic architecture of feature extractors

The learning objective of the feature extractors is first to compress information by reducing the number of features to the number of features of the latent space (Frochte 2019, p.281). In the figure above, this would mean to reduce the number of features from 8 to 4 and to store as much information as possible from the 8 dimensions in only 4 dimensions. In the next step, the extractor tries to reconstruct the original information from the compressed information of the latent space (Frochte 2019, pp.280-281). The information is extended from 4 dimensions to 8. After training, the hidden representation of the latent space is used as a compression of the original input.

You can create a feature extractor as follows.

feature_extractor<-TEFeatureExtractor$new()
feature_extractor$configure(
   name = "feature_extractor_bert_movie_reviews",
   label = "Feature extractor for Text Embeddings via BERT",
   text_embeddings = review_embeddings,
   features = 128,
   method = "lstm",
   noise_factor = 0.2,
   optimizer = "adam"
)

Similarly to the other models, you can use name for the model’s name and label for the model’s label. The argument text_embeddings takes on object of class EmbeddedText or LargeDataSetForTextEmbeddings. With this object you connect your feature extractor with a specific TextEmbeddingModel. That is, the feature extractor works only with embeddings from exactly the same TextEmbeddingModel.

features determines the number of features for the compressed representation. The lower the number, the higher the requested compression. This value corresponds to the features of the latent space in the figure above.

With method you determine the type of layer the feature extractor should use. If set method="lstm", all layers of the model are long short-term memory layers. If set method="dense" all layers are standard dense layers.

Independently from your choice, all models try to generate the latent space such that the co-variance of the features to be zero. Thus, all features represent unique information. In addition, all methods except "lstm" use an orthogonal parameterization to prevent over-fitting and apply parameter sharing. The opposite layers use the same parameters. For more details please refer to Ranjan (2019).

With noise_factor you can add some noise during training making the feature extractor perform a denoising auto-encoder, which can provide more robust generalizations.

Training the extractor is identical to the other models in aifeducation. Please note that the text embeddings provided to data_embeddings must be generated with the same TextEmbeddingModels as the embeddings provided during the configuration of your model.

feature_extractor$train(
 data_embeddings=review_embeddings,
 data_val_size = 0.25,
 sustain_track = TRUE,
 sustain_iso_code = "DEU",
 sustain_region = NULL,
 sustain_interval = 15,
 epochs = 40,
 batch_size = 32,
 dir_checkpoint,
 trace = TRUE,
 ml_trace = 1,
)

After you have trained your feature extractor, you can use it for every classifier. Just pass the feature extractor to feature_extractor during configuration of the classifier. For the classifier described in section 5 this would look like:

classifier <- TEClassifierRegular$new()
classifier$configure(
  name = "movie_review_classifier",
  label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
  text_embeddings = review_embeddings,
  feature_extractor = feature_extractor,
  target_levels = c("neg", "pos"),
  hidden = c(5),
  rec = c(6, 6),
  rec_type = "gru",
  rec_bidirectional = FALSE,
  self_attention_heads = 0,
  intermediate_size = NULL,
  attention_type = "fourier",
  add_pos_embedding = TRUE,
  rec_dropout = 0.1,
  repeat_encoder = 0,
  dense_dropout = 0.4,
  recurrent_dropout = 0.4,
  encoder_dropout = 0.1,
  optimizer = "adam"
)

That is all. Now you can use and train the classifier in the same way you did without a feature extractor. Even saving and loading is done automatically.

References

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150

Berding, F., & Pargmann, J. (2022). Iota Reliability Concept of the Second Generation. Berlin: Logos. https://doi.org/10.30819/5581

Berding, F., Riebenbauer, E., Stütz, S., Jahncke, H., Slopinski, A., & Rebmann, K. (2022). Performance and Configuration of Artificial Intelligence in Educational Settings.: Introducing a New Reliability Concept Based on Content Analysis. Frontiers in Education, 1–21. https://doi.org/10.3389/feduc.2022.818365

Campesato, O. (2021). Natural Language Processing Fundamentals for Developers. Mercury Learning & Information. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6647713

Cascante-Bonilla, P., Tan, F., Qi, Y. & Ordonez, V. (2020). Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning. https://doi.org/10.48550/arXiv.2001.06001

Chollet, F., Kalinowski, T., & Allaire, J. J. (2022). Deep learning with R (Second edition). Manning Publications Co. https://learning.oreilly.com/library/view/-/9781633439849/?ar

Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. https://doi.org/10.48550/arXiv.2006.03236

Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Frochte, J. (2019). Maschinelles Lernen: Grundlagen und Algorithmen in Python (2., aktualisierte Auflage). Hanser.

Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Sajjad, H., Nakov, P., Chen, D., & Winslett, M. (2021). Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Transactions of the Association for Computational Linguistics, 9, 1061–1080. https://doi.org/10.1162/tacl_a_00413

Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Fourth edition). Gaithersburg: STATAXIS.

He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://doi.org/10.48550/arXiv.2006.03654

Jadon, S., & Garg, A. (2020). Hands-On One-shot Learning with Python: Learn to Implement Fast and Accurate Deep Learning Models with Fewer Training Samples Using Pytorch. Packt Publishing Limited. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6175328

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Los Angeles: SAGE.

Lane, H., Howard, C., & Hapke, H. M. (2019). Natural language processing in action: Understanding, analyzing, and generating text with Python. Shelter Island: Manning.

Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice. New York: Springer. https://doi.org/10.1007/978-1-4614-3305-7

Lee, D.‑H. (2013). Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. CML 2013 Workshop: Challenges in Representation Learning.

Lee-Thorp, J., Ainslie, J., Eckstein, I. & Ontanon, S. (2021). FNet: Mixing Tokens with Fourier Transforms. https://doi.org/10.48550/arXiv.2105.03824

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Association for Computational Linguistics. https://aclanthology.org/P11-1015

Oreshkin, B. N., Rodriguez, P., & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. Advance online publication. https://doi.org/10.48550/arXiv.1805.10123

Papilloud, C., & Hinneburg, A. (2018). Qualitative Textanalyse mit Topic-Modellen: Eine Einführung für Sozialwissenschaftler. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-21980-2

Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., & Dehak, N. (2019). Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 838–844). IEEE. https://doi.org/10.1109/ASRU46091.2019.9003958

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D14-1162.pdf

Ranjan, & Chitta. (2019). Build the right Autoencoder — Tune and Optimize using PCA principles.: Part I. https://towardsdatascience.com/build-the-right-autoencoder-tune-and-optimize-using-pca-principles-part-i-1f01f821999b

Schreier, M. (2012). Qualitative Content Analysis in Practice. Los Angeles: SAGE.

Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175

Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.‑Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/arXiv.2004.09297

Tunstall, L., Werra, L. von, Wolf, T., & Géron, A. (2022). Natural language processing with transformers: Building language applications with hugging face (Revised edition). Heidelberg: O’Reilly.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://doi.org/10.48550/arXiv.1609.08144

Zhang, X., Nie, J., Zong, L., Yu, H., & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang, & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24

ou, L. (2023). Meta-Learning: Theory, Algorithms and Applications. Elsevier Science & Technology. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=7134465