03 Using R syntax
Florian Berding, Yuliia Tykhonova, Julia Pargmann, Andreas Slopinski, Elisabeth Riebenbauer, Karin Rebmann
Source:vignettes/classification_tasks.Rmd
classification_tasks.Rmd
1 Introduction and Overview
1.1 Preface
This vignette introduces the package aifeducation and its usage with R syntax. For users who are unfamiliar with R or those who do not have coding skills in relevant languages (e.g., python), we recommend to start with the graphical user interface Aifeducation - Studio, which is described in the vignette 02 Using the graphical user interface Aifeducation - Studio.
We assume that aifeducation is installed as described in vignette 01 Get Started. The introduction starts with a brief explanation of basic concepts, which are necessary to work with this package.
1.2 Basic Concepts
In the educational and social sciences, assigning scientific concepts to an observation is an important task that allows researchers to understand an observation, to generate new insights, and to derive recommendations for research and practice.
In educational science, several areas deal with this kind of task. For example, diagnosing students’ characteristics is an important aspect of a teachers’ profession and necessary to understand and promote learning. Another example is the use of learning analytics, where data about students is used to provide learning environments adapted to their individual needs. On another level, educational institutions such as schools and universities can use this information for data-driven performance decisions (Laurusson & White 2014) as well as where and how to improve it. In any case, a real-world observation is aligned with scientific models to use scientific knowledge as a technology for improved learning and instruction.
Supervised machine learning is one concept that allows a link between real-world observations and existing scientific models and theories (Berding et al. 2022). For educational science, this is a great advantage because it allows researchers to use the existing knowledge and insights to apply AI. The drawback of this approach is that the training of AI requires both information about the real world observations and information on the corresponding alignment with scientific models and theories.
A valuable source of data in educational science are written texts, since textual data can be found almost everywhere in the realm of learning and teaching (Berding et al. 2022). For example, teachers often require students to solve a task which they provide in a written form. Students have to create a solution for the tasks which they often document with a short written essay or a presentation. This data can be used to analyze learning and teaching. Teachers’ written tasks for their students may provide insights into the quality of instruction while students’ solutions may provide insights into their learning outcomes and prerequisites.
AI can be a helpful assistant in analyzing textual data since the analysis of textual data is a challenging and time-consuming task for humans.
Please note that an introduction to content analysis, natural language processing or machine learning is beyond the scope of this vignette. If you would like to learn more, please refer to the cited literature.
Before we start, it is necessary to introduce a definition of our understanding of some basic concepts, since applying AI to educational contexts means to combine the knowledge of different scientific disciplines using different, sometimes overlapping, concepts. Even within a single research area, concepts are not unified. Figure 1 illustrates this package’s understanding.
Since aifeducation looks at the application of AI for classification tasks from the perspective of the empirical method of content analysis, there is some overlapping between the concepts of content analysis and machine learning. In content analysis, a phenomenon like performance or colors can be described as a scale/dimension which is made up by several categories (e.g. Schreier 2012, pp. 59). In our example, an exam’s performance (scale/dimension) could be “good”, “average” or “poor”. In terms of colors (scale/dimension) categories could be “blue”, “green”, etc. Machine learning literature uses other words to describe this kind of data. In machine learning, “scale” and “dimension” correspond to the term “label” while “categories” refer to the term “classes” (Chollet, Kalinowski & Allaire 2022, p. 114).
With these clarifications, classification means that a text is assigned to the correct category of a scale or, respectively, that the text is labeled with the correct class. As Figure 2 illustrates, two kinds of data are necessary to train an AI to classify text in line with supervised machine learning principles.
By providing AI with both the textual data as input data and the corresponding information about the class as target data, AI can learn which texts imply a specific class or category. In the above exam example, AI can learn which texts imply a “good”, an “average” or a “poor” judgment. After training, AI can be applied to new texts and predict the most likely class of every new text. The generated class can be used for further statistical analysis or to derive recommendations about learning and teaching.
In use cases as described in this vignette, AI has to “understand” natural language: „Natural language processing is an area of research in computer science and artificial intelligence (AI) concerned with processing natural languages such as English and Mandarin. This processing generally involves translating natural language into data (numbers) that a computer can use to learn about the world. (…)” (Lane , Howard & Hapke 2019, p. 4)
Thus, the first step is to transform raw texts into a a form that is usable for a computer, hence raw texts must be transformed into numbers. In modern approaches, this is usually done through word embeddings. Campesato (2021, p. 102) describes them as “the collective name for a set of language modeling and feature learning techniques (…) where words or phrases from the vocabulary are mapped to vectors of real numbers.” The definition of a word vector is similar: „Word vectors represent the semantic meaning of words as vectors in the context of the training corpus.” (Lane, Howard & Hapke 2019, p. 191). In the next step, the words or text embeddings can be used as input data and the labels as target data when training AI to classify a text.
In aifeducation, these steps are covered with three different types of models, as shown in Figure 3.
Base Models: The base models contain the capacities to understand natural language. In general, these are transformers such as BERT, RoBERTa, etc. A huge number of pre-trained models can be found on Hugging Face.
Text Embedding Models: The modes are built on top of base models and store directions on how to use these base models for converting raw texts into sequences of numbers. Please note that the same base model can be used to create different text embedding models.
Classifiers: Classifiers are used on top of a text embedding model. They are used to classify a text into categories/classes based on the numeric representation provided by the corresponding text embedding model. Please note that a text embedding model can be used to create different classifiers (e.g. one classifier for colors, one classifier to estimate the quality of a text, etc.).
2 Start Working
2.1 Starting a New Session
Before you can work with aifeducation, you must set up a new
R session. First, it is necessary that you set up python via
‘reticulate’ and chose the conda environment where all necessary python
libraries are available. Second, you can load aifeducation
.
In case you installed python as suggested in vignette 01 Get started you may start a new session
like this:
reticulate::use_condaenv(condaenv = "aifeducation")
library(aifeducation)
Note: Please remember: Every time you start a new session in R, you have to to set the correct conda environment and to load the library
aifeducation
.
2.2 Data Management
2.2.1 Introducation
In the context of use cases for aifeducation, three different types of data are necessary: raw texts, text embeddings, and target data which represent the categories/classes of a text.
To deal with the first two types and to allow the use of large data sets that may not fit into the memory of your machine, the packages ships with two specialized objects.
The first is LargeDataSetForText
. Objects of this class
are used to read raw texts from .txt, .pdf, and .xlsx files and store
them for further computations. The second is
LargeDataSetForTextEmbeddings
which are used to store the
text embeddings of raw texts which are generated with
TextEmbeddingModel
s. We will describe the transformation of
raw texts into text embeddings later.
2.2.2 Raw Texts
The creation of a LargeDataSetForText
is necessary if
you would like to create or train a base model or to generate text
embeddings. In case you would like to create such a data set for the
first time you have to call the method:
raw_texts <- LargeDataSetForText$new()
Now you have an empty data set. To fill this object with raw texts different methods are available depending on the file type you use for storing raw texts.
.txt files
The first alternative is to store raw texts in .txt files. To use these you have to structure your data in a specific way:
- Create a main folder for storing your data.
- Store every raw text/document into a single .txt file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
- Add an additional .txt file to the folder named
bib_entry.txt
. This file contains bibliographic information for the raw text. - Add an additional .txt file to the folder named
license.txt
which contains a short statement for the license of the text such as “CC BY”. - Add an additional .txt file to the folder named
url_license.txt
which contains the url/link to the license’ text such as “https://creativecommons.org/licenses/by/4.0/”. - Add an additional .txt file to the folder named
text_license.txt
which contains the full license in raw texts. - Add an additional .txt file to the folder named
url_source.txt
which contains the url/link to the text file in the internet.
Applying these rules may result in a data structure as follows:
- Folder “main folder”
- Folder Text A
- text_a.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text B
- text_b.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text C
- text_C.txt
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text A
Now you can call the method add_from_files_txt
by
passing the path to the directory of the main folder to
dir_path
.
raw_texts$add_from_files_txt(
dir_path = "main folder"
)
The data set will now read all the raw texts in the main folder and
will assign every text the corresponding bib entry and license. Please
note that adding a bib_entry.txt
, license.txt
,
url_license.txt
, text_license.txt
, and
url_soruce.text
to every folder is optional. If there is no
such file in the corresponding folder, there will be an empty entry in
the data set. However, against the backdrop of the European AI Act, we
recommend to provide both the license and bibliographic information to
make the documentation of your models more straightforward. Furthermore,
some licenses such as those provided by Creative Commons require
statements about the creators, a copyright note, a URL or link to the
source material (if possible), the license of the material and a URL or
link to the license’s text on the internet or the license text itself.
Please check the licenses of the material you are using for the
requirements.
.pdf files
The second alternative is to use .pdf files as a source for raw texts. Here, the necessary structure is similar to .txt files:
- Create a main folder for storing your data.
- Store every raw text/document into a single .pdf file into its own folder within the main folder. In every folder there should be only one file for a raw text/document.
- Add an additional .txt file to the folder named
bib_entry.txt
. This file contains bibliographic information for the raw text. - Add an additional .txt file to the folder named
license.txt
which contains a short statement for the license of the text such as “CC BY”. - Add an additional .txt file to the folder named
url_license.txt
which contains the URL/link to the license text such as “https://creativecommons.org/licenses/by/4.0/”. - Add an additional .txt file to the folder named
text_license.txt
which contains the full license in raw texts. - Add an additional .txt file to the folder named
url_source.txt
which contains the url/link to the text file in the internet.
Applying these rules may result in a data structure as follows:
- Folder “main folder”
- Folder Text A
- text_a.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text B
- text_b.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text C
- text_C.pdf
- bib_entry.txt
- license.txt
- url_license.txt
- text_license.txt
- url_source.txt
- Folder Text A
Please not that all files except the text file must be .txt, not .pdf.
Now you can call the method add_from_files_pdf
by
passing the path to the directory of the main folder to
dir_path
.
raw_texts$add_from_files_pdf(
dir_path = "main folder"
)
As stated above, bib_entry.txt
,
license.txt
, url_license.txt
,
text_license.txt
, and url_soruce.text
are
optional.
.xlsx files
The third alternative is to store the raw texts into .xlsx files. This alternative is useful if you have many small raw texts. For raw texts that are very large such as books or papers we recommend to store them as .txt or .pdf files.
In order to add raw texts from .xlsx files, the files need a special structure:
- Create a main folder for storing all .xlsx files you would like to read.
- All .xlsx files must contain the names of the columns in the first row and the names must be identical for each column across all .xslx files you would like to read.
- Every .xslx files must contain a column storing the text ID and must contain a column storing the raw text. Every text must have a unique ID across all .xlsx files.
- Every .xslx file can contain an additional column for the bib entry.
- Every .xslx file can contain an additional column for the license.
- Every .xslx file can contain an additional column for the license’s URL.
- Every .xslx file can contain an additional column for the license text.
- Every .xslx file can contain an additional column for the source’s URL.
Your .xlsx file may look like
id | text | bib | license | url_license | text_license | url_source |
---|---|---|---|---|---|---|
z3 | This is an example. | Author (2019) | CC BY | Example URL | Text | Example URL |
a3 | This is a second example. | Author (2022) | CC BY | Example URL | Text | Example URL |
… | … | … | … |
Now you can call the method add_from_files_xlsx
by
passing the path to the directory of the main folder to
dir_path
. Please do not forget to specify the column names
for ID, text as well as bibliographic and license information.
raw_texts$add_from_files_xlsx(
dir_path = "main folder",
id_column = "id",
text_column = "text",
bib_entry_column = "bib_entry",
license_column = "license",
url_license_column = "url_license",
text_license_column = "text_license",
url_source_column = "url_source"
)
Saving and loading a data set
Once you have create a LargeDataSetForText
you can save
your data to disk by calling the function save_to_disk
. In
our example the code would be:
save_to_disk(
object = raw_texts,
dir_path = "C:/",
folder_name = "raw_texts"
)
The argument object
requires the object you would like
to save. In our case this is raw_texts
. With
dir_path
you specific the location where to save the object
and with folder_name
you define the name of the folder that
will be created within that directory. In this folder the data set is
saved.
To load an existing data set, you can call the function
load_from_disk
with the directory path where you stored the
data. In our case this would be.
raw_text_dataset <- load_from_disk("C:/raw_texts")
Now you can work with your data.
2.2.3 Text Embeddings
The numerical representations of raw texts (called text embeddings)
are stored with objects of class
LargeDataSetForTextEmbeddings
. These kinds of data sets are
generated by some models such as TextEmbeddingModel
s. Thus,
you will never need to create such a data set manually.
However, you will need this kind of data set to train a classifier or
to predict the categories/classes of raw texts. Thus, it may be
advantageous to save already transformed data. You can save and load an
object of this class with the functions save_to_disk
and
load_from_disk
.
Let us assume that we have a
LargeDataSetForTextEmbeddings
text_embeddings.
Saving this object may look like:
save_to_disk(
object = text_embeddings,
dir_path = "C:/",
folder_name = "text_embeddings"
)
The data set will be saved at C:/text_embeddings
.
Loading this data set may look like:
new_text_embeddings <- load_from_disk("C:/text_embeddings")
2.2.4 Target Data
The last data type necessary for working with
aifeducation
are the categories/classes of given raw texts.
For this kind of data we currently do not provide a special object. You
just need a named factor
storing the
classes/categories for a dimension. It is also important that the names
equal the ID of the corresponding raw texts/text embeddings since
matching the classes/categories to texts is done with the help of these
names.
Saving and loading can be done with R’s functions
save
and load
.
2.3 Example Data for this Vignette
To illustrate the steps in this vignette, we cannot use data from
educational settings since these data is generally protected by privacy
policies. Therefore, we use a subset of the Standford Movie Review
Dataset provided by Maas et al. (2011) which is part of the package. You
can access the data set with imdb_movie_reviews
.
We now have a data set with three columns. The first column contains the raw text, the second contains the rating of the movie (positive or negative), and the third column the ID of the movie review. About 200 reviews imply a positive rating of a movie and about 100 imply a negative rating.
For this tutorial, we modify this data set by setting about 50
positive and 25 negative reviews to NA
, indicating that
these reviews are not labeled.
example_data <- imdb_movie_reviews
example_data$label <- as.character(example_data$label)
example_data$label[c(76:100)] <- NA
example_data$label[c(201:250)] <- NA
example_targets <- as.factor(example_data$label)
table(example_data$label)
#>
#> neg pos
#> 75 150
We will now create a LargeDataSetForText
from this
data.frame
. Before we can do this we must ensure that the
data.set
has all necessary columns:
colnames(example_data)
#> [1] "text" "label" "id"
Now we have to add two columns. For this tutorial we do not add any bibliographic or license information although this is recommended in practice.
example_data$bib_entry <- NA
example_data$license <- NA
colnames(example_data)
#> [1] "text" "label" "id" "bib_entry" "license"
Now the data.frame
is ready as input for our data set.
The “label” column will not be included in this data set.
data_set_reviews_text <- LargeDataSetForText$new()
data_set_reviews_text$add_from_data.frame(example_data)
We save the categories/labels within a separate factor.
review_labels <- example_data$label
names(review_labels) <- example_data$id
We will now use this data to show you how to use the different objects and functions in aifeducation.
3 Base Models
3.1 Overview
Base models are the foundation of all further models in aifeducation. At the moment, these are transformer models such as MPNet(), BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), DeBERTa version 2 (He et al. 2020), Funnel-Transformer (Dai et al. 2020), and Longformer (Beltagy, Peters & Cohan 2020). In general, these models are trained on a large corpus of general texts in the first step. In the next step, the models are fine-tuned to domain-specific texts and/or fine-tuned for specific tasks. Since the creation of base models requires a huge number of texts resulting in high computational time, it is recommended to use pre-trained models. These can be found on Hugging Face. Sometimes, however, it is more straightforward to create a new model to fit a specific purpose. aifeducation supports the option to both create and train/fine-tune base models.
3.2 Creation of Base Models
Every transformer model is composed of two parts: 1) the tokenizer which splits raw texts into smaller pieces to model a large number of words with a limited, small number of tokens and 2) the neural network that is used to model the capabilities for understanding natural language.
At the beginning you can choose between the different supported
transformer architectures. Depending on the architecture, you have
different options determining the shape of your neural network. For this
vignette we use a BERT (Devlin et al. 2019) model which can be created
with the create
-method of the Transformer class. Use
aife_transformer_maker
to create a transformer object.
See p. 3 Transformer Maker 01 Transformers for Developers for details.
base_model<-aife_transformer_maker$make("bert")
base_model$create(
ml_framework = "pytorch",
model_dir = "my_own_transformer",
text_dataset = LargeDataSetForText$new(example_data),
vocab_size = 30522,
vocab_do_lower_case = FALSE,
max_position_embeddings = 512,
hidden_size = 768,
num_hidden_layer = 12,
num_attention_heads = 12,
intermediate_size = 3072,
hidden_act = "gelu",
hidden_dropout_prob = 0.1,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
trace = TRUE,
log_dir = NULL,
log_write_interval = 2
)
First, the function receives the machine learning framework you chose
at the start of the session. However, you can change this by setting
ml_framework="tensorflow"
or by
ml_framework="pytorch"
.
For this function to work, you must provide a path to a directory
where your new transformer should be saved (model_dir
).
Furthermore, you must provide raw texts. These texts are
not used to train the transformer but for the
vocabulary. The maximum size of the vocabulary is determined by
vocab_size
. Modern tokenizers such as WordPiece
(Wu et al. 2016) use algorithms that splits tokens into smaller
elements, allowing them to build a huge number of words with a small
number of elements. Thus, even with only small number of about 30,000
tokens, they are able to represent a very large number of words.
The other parameters allow you to customize your BERT model. For example, you could increase the number of hidden layers from 12 to 24 or reduce the hidden size from 768 to 256, allowing you to build and to test larger or smaller models.
The vignette 04 Model configuration provides details on how to configure a base model.
Please note that with max_position_embeddings
you
determine how many tokens your transformer can process. If your text has
more tokens, these tokens are ignored. However, if you would like to
analyze long documents, please avoid to increase this number too
significantly because the computational time does not increase in a
linear way but quadratic (Beltagy, Peters & Cohan 2020). For long
documents you can use another architecture of BERT (e.g. Longformer from
Beltagy, Peters & Cohan 2020) or split a long document into several
chunks which are used sequentially for classification (e.g., Pappagari
et al. 2019). Using chunks is supported by aifedcuation for all
models.
Since creating a transformer model is energy consuming,
aifeducation allows you to estimate its ecological impact with
help of the python library codecarbon
. Thus,
sustain_track
is set to TRUE
by default. If
you use the sustainability tracker you must provide the alpha-3 code for
the country where your computer is located (e.g., “CAN”=“Canada”,
“Deu”=“Germany”). A list with the codes can be found on Wikipedia.
The reason is that different countries use different sources and
techniques for generating their energy resulting in a specific impact on
CO2 emissions. For the USA and Canada you can additionally specify a
region by setting sustain_region
. Please refer to the
documentation of codecarbon for more information.
After calling the function, you will find your new model in your model directory.
3.3 Train/Fine-Tune a Base Model
If you would like to train a new base model (see section 3.2) for the
first time or want to adapt a pre-trained model to a domain-specific
language or task, you can call the corresponding
train
-method.
See p. 3 Transformer Maker 01 Transformers for Developers for details.
base_model$train(
ml_framework = "pytorch",
output_dir = "my_own_transformer_trained",
model_dir_path = "my_own_transformer",
text_dataset = LargeDataSetForText$new(example_data[1:10, ]),
p_mask = 0.15,
whole_word = TRUE,
val_size = 0.1,
n_epoch = 1,
batch_size = 12,
chunk_size = 250,
n_workers = 1,
multi_process = FALSE,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
trace = TRUE,
log_dir = NULL,
log_write_interval = 2
)
Here it is important that you provide the path to the directory where your new transformer is stored. Furthermore, it is important that you provide another directory where your trained transformer should be saved to avoid reading and writing collisions.
Now, the provided raw data is used to train your model. In case of a BERT model, the learning objective is Masked Language Modeling. Other models may use other learning objectives. Please refer to the documentation for more details on every model.
First, you can set the length of token sequences with
chunk_size
. With whole_word
you can choose
between masking single tokens or masking complete words (Please remember
that modern tokenizers split words into several tokens. Thus, tokens and
words are not forced to match each other directly). With
p_mask
you can determine how many tokens should be masked.
Finally, with val_size
, you set how many chunks of tokens
should be used for the validation sample. Minimum is 2.
Please remember to set the correct alpha-3 code for tracking the
ecological impact of training your model
(sustain_iso_code
).
If you work on a machine and your graphic device only has small
memory capacity, please reduce the batch size significantly. We also
recommend to change the usage of memory with
set_config_gpu_low_memory()
at the beginning of the session
if you use tensorflow as framework.
After the training finishes, you can find the transformer ready to use in your output_directory. Now you are able to create a text embedding model.
Again you can change the machine learning framework by setting
ml_framework="tensorflow"
or
ml_framework="pytorch"
. If you do not change this argument,
the framework you chose at the beginning is used.
4 Text Embedding Models
4.1 Introduction
The text embedding model is the interface to R in
aifeducation. In order to create a new model, you need a base
model that provides the ability to understand natural language. A text
embedding model is stored as an object of class
TextEmbeddingModel
. This object contains all relevant
information for transforming raw texts into a numeric representation
that can be used for machine learning.
In aifedcuation, the transformation of raw texts into numbers is a separate step from downstream tasks such as classification. This is to reduce computational time on machines with low performance. By separating text embedding from other tasks, the text embedding has to be calculated only once and can be used for different tasks at the same time. Another advantage is that the training of the downstream tasks involves only the downstream tasks an not the parameters of the embedding model, making training less time-consuming, thus decreasing computational intensity. Finally, this approach allows the analysis of long documents by applying the same algorithm to different parts.
The text embedding model provides a unified interface: After creating the model with different methods, the handling of the model is always the same.
4.2 Create a Text Embedding Model
First you have to choose the base model that forms the foundation of
your new text embedding model. Since we use a BERT model in our example,
we have to set method = "bert"
.
bert_modeling <- TextEmbeddingModel$new()
bert_modeling$configure(
model_name = "bert_embedding",
model_label = "Text Embedding via BERT",
model_language = "english",
method = "bert",
max_length = 512,
chunks = 4,
overlap = 30,
emb_layer_min = "middle",
emb_layer_max = "2_3_layer",
emb_pool_type = "average",
model_dir = "my_own_transformer_trained"
)
Next, you have to provide the directory where your base model is
stored. In this example this would be
model_dir="my_own_transformer_trained
. Of course you can
use any other pre-trained model from Hugging Face which addresses your
needs.
Using a BERT model for text embedding is not a problem since your
text does not provide more tokens than the transformer can process. This
maximum value is set in the configuration of the transformer (see
section 3.2). If the text produces more tokens, the last tokens are
ignored. In some instances you might want to analyze long texts. In
these situations, reducing the text to the first tokens (e.g. only the
first 512 tokens) could result in a problematic loss of information. To
deal with these situations, you can configure a text embedding model in
aifecuation to split long texts into several chunks which are
processed by the base model. The maximum number of chunks is set with
chunks
. In our example above, the text embedding model
would split a text consisting of 1024 tokens into two chunks with every
chunk consisting of 512 tokens. For every chunk, a text embedding is
calculated. As a result, you receive a sequence of embeddings. The first
embedding characterizes the first part of the text and the second
embedding characterizes the second part of the text (and so on). Thus,
our sample text embedding model is able to process texts with about
4*512=2048 tokens. This approach is inspired by the work by Pappagari et
al. (2019).
Since transformers are able to account for the context, it may be
useful to interconnect every chunk to bring context into the
calculations. This can be done with overlap
to determine
how many tokens of the end of a prior chunk should be added to the next.
In our example the last 30 tokens of the prior chunks are added at the
beginning of the following chunk. This can help to add the correct
context of the text sections into the analysis. Altogether, this sample
model can analyse a maximum of
tokens of a text.
Finally, you have to decide from which hidden layer(s) the embeddings
should be drawn. With emb_layer_min
and
emb_layer_max
you can decide from which layers the average
value for every token should be calculated. Please note that the
calculation considers all layers between emb_layer_min
and
emb_layer_max
. In their initial work, Devlin et al. (2019)
used the hidden states of different layers for classification.
With emb_pool_type,
you decide which tokens are used for
pooling within every layer. In the case of
emb_pool_type="cls",
only the cls token is used. In the
case of emb_pool_type="average"
all tokens within a layer
are averaged except padding tokens.
The vignette 04 Model configuration provides details on how to configure a text embedding model.
After deciding about the configuration, you can use your model.
4.3 Transforming Raw Texts into Embedded Texts
To transform raw text into a numeric representation, you only have to
use the embed_large
method of your model. To do this, you
must provide a LargeDataSetForText
to
large_datas_set
. Relying on the sample data from section
2.3, we can use the movie reviews as raw texts.
review_embeddings <- bert_modeling$embed_large(
large_datas_set = data_set_reviews_text,
trace = TRUE
)
The method embed_large
creates an object of class
LargeDataSetForTextEmbeddings
. This is just a data set
consisting of the embeddings of every text. The embeddings are an array,
of which the first dimension refers to specific texts, the second
dimension refers to chunks/sequences, and the third dimension refers to
the features.
With the embedded texts you now have the input to train a new classifier or to apply a pre-trained classifier for predicting categories/classes. In the next chapter we will show you how to use these classifiers. But before we start, we will show you how to save and load your model.
4.4 Saving and Loading Text Embedding Models
Saving a created text embedding model is very easy in
aifeducation by using the function save_to_disk
.
This function provides a unique interface for all text embedding models.
For saving your work you can pass your model to object
and
the directory where to save the model to dir_path
. With
folder_name
you can determine the name of the folder that
should be created in that directory to store the model.
save_ai_model(
object = bert_modeling,
dir_path = "C:/text_embedding_models",
folder_name = "bert_model"
)
In this example the model is saved in a folder at the location
C:/text_embedding_models/bert_model
. If you want to load
your model you can call load_from_disk
.
bert_modeling <- load_from_disk("C:/text_embedding_models/bert_model")
4.5 Sustainability
In case the underlying model was trained with an active
sustainability tracker (section 3.2 and 3.3) you can receive a table
showing you the energy consumption, CO2 emissions, and hardware used
during training by calling the method
get_sustainability_data()
. For our example this would be
bert_modeling$get_sustainability_data()
.
5 Classifiers
5.1 Create a Classifier
Classifiers are built on top of a TextEmbeddingModel
.
You can create a new classifier by calling
TEClassifierRegular$new()
. The TE
in the
object class refers to the idea that the classifiers uses text
embeddings instead of raw texts.
With the sample data from section 2.3 and the text embeddings from section 4.3, the creation of a new classifier may look like:
classifier <- TEClassifierRegular$new()
classifier$configure(
name = "movie_review_classifier",
label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = NULL,
target_levels = c("neg", "pos"),
dense_layers=2,
dense_size=5,
rec_layers=2,
rec_size=10,
rec_type = "gru",
rec_bidirectional = FALSE,
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = FALSE,
rec_dropout = 0.5,
repeat_encoder = 0,
dense_dropout = 0.2,
recurrent_dropout = 0.6,
encoder_dropout = 0.1,
optimizer = "adam"
)
Similarly to the text embedding model, you should provide a name
(name
) and a label (label
) for your new
classifier. With text_embeddings
you have to provide a
LargeDataSetForTextEmbeddings
. The data set is created with
a TextEmbeddingModel
as described in section 4. We here
continue our example and use the embedding produced by our BERT
model.
target_levels
take the categories/classes you classifier
should predict. This can be numbers or even words.
In case you would like to use ordinal data, it is very important that you provide the classes/categories in the correct order. That is, classes/categories representing a “higher” level must be stated before categories/classes with a lower level. If you provide the wrong order, the performance indices are not valid. In case of nominal data the order does not matter.
With feature_extractor
you can add a feature extractor
that tries to reduce the number of features of your text embedding
before passing the embeddings to the classifier. You can read more on
this in Section 6.2.
With the other parameters you decide about the structure of your classifier. Figure 4 illustrates this.
dense_layers
takes a vector of integers, determining the
number of layers and dense_size
determines the number of
neurons for all dense layers. In our example, there are two dense layers
with 5 neurons. rec_layers
also takes a vector of integers
determining the number of layers while rec_size
determines
the size of all recurrent layers. In this example, we use two layer with
10 neurons each. With rec_type
you can choose between two
types of recurrent layers. rec_type="gru"
implements a
Gated Recurrent Unit (GRU) network and rec_type="lstm"
implements a Long Short-Term Memory layer. With
rec_bidirectional
you can decide whether the recurrent
layer should be unidirectional or bidirectional.
Since the classifiers in aifeducation use a standardized
scheme for their creation, dense layers are used after the gru layers.
If you want to omit gru layers or dense layers, set the corresponding
argument for the number of layers to 0 (dense_layers=0
,
rec_layers=0
).
If you use a text embedding model that processes more than one chunk we recommend to use recurrent layers, since they use the sequential structure of your data. In all other cases you can rely on dense layers only.
If you use text embeddings with more than one chunk, you can try self-attention layering in order to take the context of all chunks into account. To add self-attention you have two choices:
You can use the attention mechanism used in classic transformer models as multi-head attention (Vaswani et al. 2017). For this variant you have to set
attention_type="multihead"
,repeat_encoder
to a value of at least 1, andself_attention_heads
to a value of at least 1.Furthermore you can use the attention mechanism described in Lee-Thorp et al. (2021) of the FNet model which allows much fast computations at low accuracy costs. To use this kind of attention you have to set
attention_type="fourier
andrepeat_encoder
to a value of at least 1.
With repeat_encoder
you can choose how many times an
encoder layer should be added. The encoder is implemented as described
by Chollet, Kalinowski, and Allaire (2022, pp. 373) for both variants of
attention. In our example we have only 300 cases altogether and only 4
chunks. Thus, we do not use any encoder layers.
You can further extend the abilities of your network by adding
positional embeddings. Positional embeddings take care of the order of
your chunks. Thus, adding such a layer may increase performance if the
order of information is important. You can add this layer by setting
add_pos_embedding=TRUE
. The layer is created as described
by Chollet, Kalinowski, and Allaire (2022, pp. 378).
The vignette 04 Model configuration provides details on how to configure a classifier.
Masking, normalization, and the creation of the input layer as well as the output layer are done automatically.
After you have created a new classifier, you can begin training.
5.2 Training a Classifier
To start the training of your classifier, you have to call the
train
method. Similarly, for the creation of the
classifier, you must provide the text embedding to
data_embeddings
and the categories/classes as target data
to data_targets
. Please remember that
data_targets
expects a named factor where
the names correspond to the IDs of the corresponding text embeddings.
Text embeddings and target data that cannot be matched are omitted from
training.
To train a classifier, it is necessary that you provide a path to
dir_checkpoint
. This directory stores the best set of
weights during each training epoch. After training, these weights are
automatically used as final weights for the classifier.
For performance estimation, training splits the data into several
chunks based on cross-fold validation. The number of folds is set with
data_folds
. In every case, one fold is not used for
training and serves as a test sample. The remaining data is
used to create a training and a validation sample. The
percentage of cases within each fold used as a validation sample is
determined with data_val_size
. This sample is used to
determine the state of the model that generalizes best. All performance
values saved in the trained classifier refer to the test sample. This
data has never been used during training and provides a more realistic
estimation of a classifier’s performance.
classifier$train(
data_embeddings = review_embeddings,
data_targets = review_labels,
data_folds = 10,
data_val_size = 0.25,
balance_class_weights = TRUE,
balance_sequence_length = TRUE,
use_sc = FALSE,
sc_method = "dbsmote",
sc_min_k = 1,
sc_max_k = 10,
use_pl = FALSE,
pl_max_steps = 5,
pl_max = 1.00,
pl_anchor = 1.00,
pl_min = 0.00,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 300,
batch_size = 32,
dir_checkpoint = "training/classifier",
trace = TRUE,
ml_trace = 1
)
You can further modify the training process with different arguments.
With balance_class_weights=TRUE
the absolute frequencies of
the classes/categories are adjusted according to the ‘Inverse Class
Frequency’ method. This option should be activated if you have to deal
with imbalanced data.
With balance_sequence_length=TRUE
you can increase
performance if you have to deal with texts that differ in their lengths
and have an imbalanced frequency. If this option is enabled, the loss is
adjusted to the absolute frequencies of length of your texts according
to the ‘Inverse Class Frequency’ method.
epochs
determines the maximal number of epochs. During
training, the model with the best balanced accuracy is saved and
used.
batch_size
sets the number of cases that should be
processed simultaneously. Please adjust this value to your machine’s
capacities. Please note that the batch size can have an impact on the
classifier’s performance.
Since aifedcuation tries to address the special needs in educational and social science, some special training steps are integrated into this method.
Synthetic Cases: In case of imbalanced data, it is recommended to set
use_sc=TRUE
. Before training, a number of synthetic units is created via different techniques. Currently you can request Basic Synthetic Minority Oversampling Technique, Density-Bases Synthetic Minority Oversampling Technique, and Adaptive Synthetic Sampling Approach for Imbalanced Learning. The aim is to create new cases that fill the gap to the majority class. Multi-class problems are reduced to a two class problem (class under investigation vs. each other) for generating these units. If the technique allows to set the number of neighbors during generation, you can configure the data generation withsc_min_k
andsc_max_k
. The synthetic cases for every class are generated for all k betweensc_min_k
andsc_max_k
. Every k contributes proportionally to the synthetic cases.Pseudo-Labeling: This technique is relevant if you have labeled target data and a large number of unlabeled target data. With the different parameters starting with “pl_”, you can configure the process of pseudo-labeling. Implementation of pseudo-labeling is based on Cascante-Bonilla et al. (2020). To apply pseudo-labeling, you have to set
use_pl=TRUE
.pl_max=1.00
,pl_anchor=1.00
, andpl_min=0.00
are used to describe the certainty of a prediction. 0 refers to random guessing while 1 refers to perfect certainty.pl_anchor
is used as a reference value. The distance topl_anchor
is calculated for every case. Then, they are sorted with an increasing distance frompl_anchor
. The proportion of added pseudo-labeled data into training increases with every step. The maximum number of steps is determined withpl_max_steps
.
Figure 5 illustrates the training loop for the cases that all options
are set to TRUE
.
The example above applies the generation of synthetic cases and the
algorithm proposed by Cascante-Bonilla et al. (2020). For every fold,
the training starts with generating synthetic cases to fill the gap
between the classes and the majority class. After this, an initial
training of the classifiers starts. The trained classifier is used to
predict pseudo-labels for the unlabeled part of the data and adds 20% of
the cases with the highest certainty for their pseudo-labels to the
training data set. Now new synthetic cases are generated based on both
the labeled data and the newly added pseudo-labeled data. The classifier
is re-initialized and trained again. After training, the classifier
predicts the potential labels of all originally unlabeled data
and adds 40% of the pseudo-labeled data to the training data with the
highest certainty. Again, new synthetic cases are generated on both the
labeled and added pseudo-labeled data. The model is again re-initialized
and trained again until the maximum number of steps for pseudo labeling
(pl_max_steps
) is reached. After this, the logarithm is
restated for the next fold until the number of folds
(data_folds
) is reached. All of these steps are only used
to estimate the performance of the classifier to evaluate for the
classifier’s unknown data.
The last phase of the training begins after the last fold. In the final training, the data set is split only into a training and validation set without a test set to provide the maximum amount of data for the best performance in final training.
In case options like the generation of synthetic cases
(use_sc
) or pseudo-labeling (use_pl
) are
disabled, the training process is shorter.
Since training a neural net is energy consuming,
aifeducation allows you to estimate its ecological impact with
the help of the python library codecarbon
. Thus,
sustain_track
is set to TRUE
by default. If
you use the sustainability tracker you must provide the alpha-3 code for
the country where your computer is located (e.g., “CAN”=“Canada”,
“Deu”=“Germany”). A list with the codes can be found on Wikipedia.
The reason is that different countries use different sources and
techniques for generating their energy resulting in a specific impact on
CO2 emissions. For the USA and Canada, you can additionally specify a
region by setting sustain_region
. Please refer to the
documentation of codecarbon for more information.
Finally, trace
, and ml_trace
allow you to
control how much information about the training progress is printed to
the console. Please note that training the classifier can take some
time.
Please note that after performance estimation, the final training of the classifier makes use of all data available. That is, the test sample is left empty.
5.3 Evaluating Classifier’s Performance
After finishing training, you can evaluate the performance of the classifier. For every fold, the classifier is applied to the test sample and the results are compared to the true categories/classes. Since the test sample is never part of the training, all performance measures provide a more realistic idea of the classifier’s performance.
To support researchers in judging the quality of the predictions, aifeducation utilizes several measures and concepts from content analysis. These are
- Iota Concept of the Second Generation (Berding & Pargmann 2022)
- Krippendorff’s Alpha (Krippendorff 2019)
- Percentage Agreement
- Gwet’s AC1/AC2 (Gwet 2014)
- Kendall’s coefficient of concordance W
- Cohen’s Kappa unweighted
- Cohen’s Kappa with equal weights
- Cohen’s Kappa with squared weights
- Fleiss’ Kappa for multiple raters without exact estimation
You can access the concrete values by accessing the field
reliability,
which stores all relevant information. In this
list you will find the reliability values for every fold. In addition,
the reliability of every step within pseudo-labeling is reported.
The central estimates for the reliability values can be found via
reliability$test_metric_mean
. In our example this would
be:
classifier$reliability$test_metric_mean
#> iota_index min_iota2 avg_iota2
#> 0.5606719 0.4584235 0.5869457
#> max_iota2 min_alpha avg_alpha
#> 0.7154678 0.5785714 0.7226190
#> max_alpha static_iota_index dynamic_iota_index
#> 0.8666667 0.2620308 0.4736155
#> kalpha_nominal kalpha_ordinal kendall
#> 0.4654527 0.4654527 0.7369689
#> kappa2_unweighted kappa2_equal_weighted kappa2_squared_weighted
#> 0.4613610 0.4613610 0.4613610
#> kappa_fleiss percentage_agreement balanced_accuracy
#> 0.4533283 0.7693676 0.7226190
#> gwet_ac avg_precision avg_recall
#> 0.5980910 0.7610960 0.7226190
#> avg_f1
#> 0.7266641
Of particular interest are the values for alpha from the Iota Concept, since they represent a measure of reliability which is independent from the frequency distribution of the classes/categories. The alpha values describe the probability that a case of a specific class is recognized as that specific class. As you can see, compared to the baseline model, applying Balanced Synthetic Cases increased increases the minimal value of alpha, reducing the risk to miss cases which belong to a rare class (see row with “BSC”). On the contrary, the alpha values for the major category decrease slightly, thus losing its unjustified bonus from a high number of cases in the training set. This provides a more realistic performance estimation of the classifier.
An addition, standard measures from machine learning are reported. These are
- Precision
- Recall
- F1-Score
You can access these values as follows:
classifier$reliability$standard_measures_mean
#> precision recall f1
#> neg 0.7155556 0.5785714 0.6209740
#> pos 0.8066364 0.8666667 0.8323543
Finally, you can plot a coding stream scheme showing how the cases of different classes are labeled. Here we use the package iotarelr.
library(iotarelr)
iotarelr::plot_iota2_alluvial(classifier$reliability$iota_object_end_free)
Here you can see that a small number of negative reviews is treated as a good review, while a larger number of positive reviews is treated as a bad review. Thus, the data for the major class (negative reviews) is more reliable and valid as the the data for the minor class (positive reviews).
Evaluating the performance of a classifier is a complex task and and beyond the scope of this vignette. Instead, we would like to refer to the cited literature of content analysis and machine learning if you would like to dive deeper into this topic.
5.4 Sustainability
In case the classifier was trained with an active sustainability
tracker, you can receive information on sustainability by calling
classifier$get_sustainability_data()
.
classifier$get_sustainability_data()
#> $sustainability_tracked
#> [1] TRUE
#>
#> $date
#> [1] "Tue Oct 1 20:56:37 2024"
#>
#> $sustainability_data
#> $sustainability_data$duration_sec
#> [1] 343.5135
#>
#> $sustainability_data$co2eq_kg
#> [1] 0.0005406515
#>
#> $sustainability_data$cpu_energy_kwh
#> [1] 0.001012826
#>
#> $sustainability_data$gpu_energy_kwh
#> [1] 0
#>
#> $sustainability_data$ram_energy_kwh
#> [1] 0.0004664782
#>
#> $sustainability_data$total_energy_kwh
#> [1] 0.001479304
#>
#>
#> $technical
#> $technical$tracker
#> [1] "codecarbon"
#>
#> $technical$py_package_version
#> [1] "2.3.4"
#>
#> $technical$cpu_count
#> [1] 8
#>
#> $technical$cpu_model
#> [1] "11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz"
#>
#> $technical$gpu_count
#> [1] NA
#>
#> $technical$gpu_model
#> [1] NA
#>
#> $technical$ram_total_size
#> [1] 15.73279
#>
#>
#> $region
#> $region$country_name
#> [1] "Germany"
#>
#> $region$country_iso_code
#> [1] "DEU"
#>
#> $region$region
#> [1] NA
5.5 Saving and Loading a Classifier
Saving and loading follows the same pattern as for the other objects
in aifeducation. You can save the classifier by calling
save_to_disk
. In our example this may be:
save_to_disk(
object = classifier,
dir_path = "C:/classifiers",
folder_name = "imdb_movie_reviews"
)
The classifier is saved to
C:/classifiers/imdb_movie_reviews
. To load the model call
load_from_disk
.
classifier <- load_from_disk("C:/classifiers/imdb_movie_reviews")
5.6 Predicting New Data
If you would like to apply your classifier to new data, two steps are necessary. You must first transform the raw text into a numerical expression by using exactly the same text embedding model that was used to train your classifier (see section 4). In the case of our example classifier, we use our BERT model.
# If our mode is not loaded
bert_modeling <- load_from_disk("C:/text_embedding_models/bert_model")
# Create a numerical representation of the text
review_embeddings <- bert_modeling$embed_large(
large_datas_set = data_set_reviews_text,
trace = TRUE
)
To transform raw texts into a numeric representation just pass the
raw texts to the method embed_large
of the loaded model.
The raw texts should be an object of class
LargeDataSetForText
. To create such a data set, please
refer to section 2.
In the example above, the text embeddings are stored in
review_embeddings
. Since embedding texts may take some
time, it is a good idea to save the embeddings for future analysis (see
section 2 for more details). This allows you to load the embeddings
without the need to apply the text embedding model on the same raw texts
again.
The resulting object can then be passed to the method
predict
of our classifier and you will get the predictions
together with an estimate of certainty for each class/category.
# If your classifier is not loaded
classifier <- load_from_disk("C:/classifiers/imdb_movie_reviews")
# Predict the classes of new texts
predicted_categories <- classifier$predict(
newdata = review_embeddings,
batch_size = 8
)
After the classifier finishes the prediction, the estimated
categories/classes are stored as predicted_categories
. This
object is a data.frame
containing texts’ IDs in the rows
and the probabilities of the different categories/classes in the
columns. The last column with the name expected_category
represents the category which is assigned to a text due the highest
probability.
The estimates can be used in further analysis with common methods of the educational and social sciences such as correlation analysis, regression analysis, structural equation modeling, latent class analysis or analysis of variance.
Now you are ready to to use aifeducation. In section 6 we describe further models for classification tasks and for improving model performance.
6 Extensions
6.1 Classifiers: ProtoNet
The classifier introduced in section 5 is a regular classifier which comes with the traditional challenges of deep learning, such as the need for a large number of training data, expensive hardware requirements, and only a limited possibility to interpret the model’s parameters (Jadon & Garg 2020, pp.13-14). Since in the educational and social sciences data is a bottle neck, a classifier that can work with only small data sets would be preferable. These types of models are discussed in the literature with terms such as “meta-learning” (Zou 2023) or “few-shot learning” (Jadon & Garg 2020). The basic idea behind these approaches is that the model learns to use a supporting data set to predict the output for a query data set (e.g., Zou 2023, pp. 2-3). However, the model is not explicitly trained for the query data set.
One type of models within this area are Prototypical Networks (ProtoNet) which were initially proposed by Snell, Swersky, and Zemel (2017). This type of network was developed to create classifiers that are able to generalize to new classes that the model did not see during training, using only the information of a few examples of each class provided to the network (support data set). To achieve this goal, the networks learn to create a prototype for every class in the support data set with help of the examples for every class. Then, the network compares the new data with these prototypes and assigns the class of the nearest prototype to the new data. Since the network calculates the distance of every new case to every prototype, it belongs to the metric-based meta-learning approaches (Zhou 2023, pp. 48).
Since ProtoNet is a simple, easy to understand approach and provie3w good performance, several extensions have been suggested. aifeducation replaces the original loss function with the loss function suggested by Zhang et al. (2019) and adds the learnable metric described by Oreshkin, Rodriguez, and Lacoste (2019) to increase performance.
The implementation provided in aifeducation currently applies only to a fixed set of classes and the prototypes are learned during training by using all available training data. This will be extended/changed in the future to allow the selection of the support data set by the user.
The application of a classifier based on ProtoNet is similar to the
regular classifiers. The only difference is embedding_dim
.
A ProtoNet classifier uses a network to project the similarity and
differences between the single cases and all prototypes into a
n-dimensional space. Similar cases are located near each other while
different cases are located further away. The number of dimensions of
this space is determined by embedding_dim
. In case
embedding_dim
is set to 1,2 or 3 the position of every case
and the prototypes can be easily visualized. For this example we use the
same data as in section 5. Let us first create and configure the new
classifier.
classifier <- TEClassifierProtoNet$new()
classifier$configure(
name = "proto_net_movie_review_classifier",
label = "ProtoNet classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = NULL,
target_levels = c("neg", "pos"),
hidden = c(5),
rec = c(6, 6),
rec_type = "gru",
rec_bidirectional = FALSE,
embedding_dim = 2,
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = TRUE,
rec_dropout = 0.3,
repeat_encoder = 0,
dense_dropout = 0.4,
recurrent_dropout = 0.4,
encoder_dropout = 0.1,
optimizer = "adam"
)
Now we can plot how the untrained classifiers embeds the different
cases and the prototypes. To create the corresponding plot you can call
the method plot_embeddings
. The argument
embeddings_q
takes the embeddings of the different cases as
the input of the classifier. In case you have the true classes for all
or some of the cases, you can add them to the plot by using the argument
classes_q
. The resulting plot is shown in the following
Figure.
plot_untrained<-classifier$plot_embeddings(
embeddings_q = review_embeddings,
classes_q=review_labels,
)
plot_untrained
The large triangles represent the prototypes for every class while the dots refer to the labeled cases in the data set. For these, the color represents their true class. For unlabeled cases, a square is used. Here, the color indicates the class of the estimates. As you can see, all cases are located very similarly and there seems to be no clear structure. Let us see how this changes when we train the model.
classifier$train(
data_embeddings = review_embeddings,
data_targets = review_labels,
data_folds = 5,
data_val_size = 0.25,
use_sc = TRUE,
sc_method = "dbsmote",
sc_min_k = 1,
sc_max_k = 10,
use_pl = TRUE,
pl_max_steps = 5,
pl_max = 1.00,
pl_anchor = 1.00,
pl_min = 0.00,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 400,
batch_size = 32,
Ns = 2,
Nq = 10,
loss_alpha = 0.5,
loss_margin = 0.5,
sampling_separate=FALSE,
sampling_shuffle = TRUE,
dir_checkpoint = "training/classifier",
trace = TRUE,
ml_trace=1
)
While there are no arguments for requesting a balance of the class
weights (balance_class_weights
) or balancing the sequence
length (balance_sequence_length
), four new arguments are
available. With Ns
you determine how many examples of
every class should be used during training within the support
sample. These examples are used to calculate the prototypes for every
class. With Nq
you determine how many examples of
every class should be part of the query sample. During training
the network tries to predict the correct classes of the examples.
The arguments loss_alpha
and loss_margin
refer to the configuration of the loss function describes by Zhang et
al. (2019). loss_margin
refers to the minimal distance all
examples of the query sample should have to all prototypes that do no
represent their class. loss_alpha
determines if the loss
should pay more attention to minimize the distance between the examples
to their corresponding prototype or if it should pay more attention to
maximize the distance to the prototypes that do not represent their
class. If you set loss_alpha=1,
the loss tries to minimize
the distance of the examples to their corresponding prototype. If you
set loss_alpha=0,
loss pays tries to maximize the distance
of all examples to all prototypes that do not reflect their class.
The next two important arguments refer to the sampling strategies
during training. With sampling_separate=TRUE
, cases for
sample and query a drawn from the same pool of cases. Thus, a specific
case can be a sample case in one epoch and a query case in another
epoch. However, it is ensured that a specific cases never occurs as a
sample and a query during the same training step. In
addition, it is ensured that every case exists only once during a
training step. If you set sampling_separate=FALSE
, the
training data set is split into one data pool for sample and one data
pool for query. Thus, a case can only be a sample case
or query case. With shuffle
you can
request that for every training step a random sample is chosen from the
training data set, resulting in different combinations of sample and
query cases. For the training we highly recommend to set
shuffle=TRUE
, since this will result in better performing
classifiers.
After training we can request a visualization of the data again. We
first omit all unlabeled cases by setting
inc_unlabeled=FALSE
in order to get an impression of the
quality of training.
plot_trained_1<-classifier$plot_embeddings(
embeddings_q = review_embeddings,
classes_q=review_labels,
inc_unlabeled=FALSE
)
plot_trained_1
As shown in the figure, all cases are now sorted. Cases of the class “neg” are located close to the prototype for “neg”, while cases of the class “pos” are located near the prototype for “pos”. Since we use the same data as during training, this result has to be expected. Only a small number of cases is located near the wrong prototype. This can be seen if a red dot is close to the prototype for “pos” and a green dot is close to the red prototype for “neg”.
Let us now add the unlabeled cases to the plot by setting
inc_unlabeled=TRUE
.
plot_trained_2<-classifier$plot_embeddings(
embeddings_q = review_embeddings,
classes_q=review_labels,
inc_unlabeled=FALSE
)
plot_trained_2
As the following figure shows, the model estimates the class of these cases according to their distance to the two prototypes. Cases that are close to the prototype for “pos” are assigned to “pos”, while cases near the prototype for “neg” are assigned to “neg”.
Finally, let us report the reliability of this classifier.
classifier$reliability$test_metric_mean
#> iota_index min_iota2 avg_iota2
#> 0.4375494 0.3161485 0.4643332
#> max_iota2 min_alpha avg_alpha
#> 0.6125178 0.4013095 0.6123214
#> max_alpha static_iota_index dynamic_iota_index
#> 0.8233333 0.1946912 0.3727251
#> kalpha_nominal kalpha_ordinal kendall
#> 0.2297705 0.2297705 0.6241177
#> kappa2_unweighted kappa2_equal_weighted kappa2_squared_weighted
#> 0.2345552 0.2345552 0.2345552
#> kappa_fleiss percentage_agreement balanced_accuracy
#> 0.2122470 0.6717391 0.6123214
#> gwet_ac avg_precision avg_recall
#> 0.4255430 0.6485922 0.6123214
#> avg_f1
#> 0.6061235
6.2 Feature Extractors
Another option to increase a model’s performance and/or to increase computational speed is to apply a feature extractor. For example, the work by Ganesan et al. (2021) indicates that a reduction of the hidden size can increase a model’s accuracy. In aifeducation, a feature extractor is a model that tries to reduce the number of features of given text embeddings before feeding the embeddings as input to a classifier.
The feature extractors implemented in aifeducation are auto-encoders that support sequential data and sequences of different length. The basic architecture of all extractors is shown in the following figure.
The learning objective of the feature extractors is first to compress information by reducing the number of features to the number of features of the latent space (Frochte 2019, p.281). In the figure above, this would mean to reduce the number of features from 8 to 4 and to store as much information as possible from the 8 dimensions in only 4 dimensions. In the next step, the extractor tries to reconstruct the original information from the compressed information of the latent space (Frochte 2019, pp.280-281). The information is extended from 4 dimensions to 8. After training, the hidden representation of the latent space is used as a compression of the original input.
You can create a feature extractor as follows.
feature_extractor<-TEFeatureExtractor$new()
feature_extractor$configure(
name = "feature_extractor_bert_movie_reviews",
label = "Feature extractor for Text Embeddings via BERT",
text_embeddings = review_embeddings,
features = 128,
method = "lstm",
noise_factor = 0.2,
optimizer = "adam"
)
Similarly to the other models, you can use name
for the
model’s name and label
for the model’s label. The argument
text_embeddings
takes on object of class
EmbeddedText
or LargeDataSetForTextEmbeddings
.
With this object you connect your feature extractor with a specific
TextEmbeddingModel. That is, the feature extractor works only with
embeddings from exactly the same TextEmbeddingModel.
features
determines the number of features for the
compressed representation. The lower the number, the higher the
requested compression. This value corresponds to the features of the
latent space in the figure above.
With method
you determine the type of layer the feature
extractor should use. If set method="lstm"
, all layers of
the model are long short-term memory layers. If set
method="dense"
all layers are standard dense layers.
Independently from your choice, all models try to generate the latent
space such that the co-variance of the features to be zero. Thus, all
features represent unique information. In addition, all methods except
"lstm"
use an orthogonal parameterization to prevent
over-fitting and apply parameter sharing. The opposite layers use the
same parameters. For more details please refer to Ranjan (2019).
With noise_factor
you can add some noise during training
making the feature extractor perform a denoising auto-encoder, which can
provide more robust generalizations.
Training the extractor is identical to the other models in
aifeducation. Please note that the text embeddings provided to
data_embeddings
must be generated with the same
TextEmbeddingModels as the embeddings provided during the configuration
of your model.
feature_extractor$train(
data_embeddings=review_embeddings,
data_val_size = 0.25,
sustain_track = TRUE,
sustain_iso_code = "DEU",
sustain_region = NULL,
sustain_interval = 15,
epochs = 40,
batch_size = 32,
dir_checkpoint,
trace = TRUE,
ml_trace = 1,
)
After you have trained your feature extractor, you can use it for
every classifier. Just pass the feature extractor to
feature_extractor
during configuration of the classifier.
For the classifier described in section 5 this would look like:
classifier <- TEClassifierRegular$new()
classifier$configure(
name = "movie_review_classifier",
label = "Classifier for Estimating a Postive or Negative Rating of Movie Reviews",
text_embeddings = review_embeddings,
feature_extractor = feature_extractor,
target_levels = c("neg", "pos"),
hidden = c(5),
rec = c(6, 6),
rec_type = "gru",
rec_bidirectional = FALSE,
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = TRUE,
rec_dropout = 0.1,
repeat_encoder = 0,
dense_dropout = 0.4,
recurrent_dropout = 0.4,
encoder_dropout = 0.1,
optimizer = "adam"
)
That is all. Now you can use and train the classifier in the same way you did without a feature extractor. Even saving and loading is done automatically.
References
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150
Berding, F., & Pargmann, J. (2022). Iota Reliability Concept of the Second Generation. Berlin: Logos. https://doi.org/10.30819/5581
Berding, F., Riebenbauer, E., Stütz, S., Jahncke, H., Slopinski, A., & Rebmann, K. (2022). Performance and Configuration of Artificial Intelligence in Educational Settings.: Introducing a New Reliability Concept Based on Content Analysis. Frontiers in Education, 1–21. https://doi.org/10.3389/feduc.2022.818365
Campesato, O. (2021). Natural Language Processing Fundamentals for Developers. Mercury Learning & Information. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6647713
Cascante-Bonilla, P., Tan, F., Qi, Y. & Ordonez, V. (2020). Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning. https://doi.org/10.48550/arXiv.2001.06001
Chollet, F., Kalinowski, T., & Allaire, J. J. (2022). Deep learning with R (Second edition). Manning Publications Co. https://learning.oreilly.com/library/view/-/9781633439849/?ar
Dai, Z., Lai, G., Yang, Y. & Le, Q. V. (2020). Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. https://doi.org/10.48550/arXiv.2006.03236
Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Frochte, J. (2019). Maschinelles Lernen: Grundlagen und Algorithmen in Python (2., aktualisierte Auflage). Hanser.
Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Sajjad, H., Nakov, P., Chen, D., & Winslett, M. (2021). Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. Transactions of the Association for Computational Linguistics, 9, 1061–1080. https://doi.org/10.1162/tacl_a_00413
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (Fourth edition). Gaithersburg: STATAXIS.
He, P., Liu, X., Gao, J. & Chen, W. (2020). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://doi.org/10.48550/arXiv.2006.03654
Jadon, S., & Garg, A. (2020). Hands-On One-shot Learning with Python: Learn to Implement Fast and Accurate Deep Learning Models with Fewer Training Samples Using Pytorch. Packt Publishing Limited. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6175328
Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). Los Angeles: SAGE.
Lane, H., Howard, C., & Hapke, H. M. (2019). Natural language processing in action: Understanding, analyzing, and generating text with Python. Shelter Island: Manning.
Larusson, J. A., & White, B. (Eds.). (2014). Learning Analytics: From Research to Practice. New York: Springer. https://doi.org/10.1007/978-1-4614-3305-7
Lee, D.‑H. (2013). Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. CML 2013 Workshop: Challenges in Representation Learning.
Lee-Thorp, J., Ainslie, J., Eckstein, I. & Ontanon, S. (2021). FNet: Mixing Tokens with Fourier Transforms. https://doi.org/10.48550/arXiv.2105.03824
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/arXiv.1907.11692
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning Word Vectors for Sentiment Analysis. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 142–150). Association for Computational Linguistics. https://aclanthology.org/P11-1015
Oreshkin, B. N., Rodriguez, P., & Lacoste, A. (2018). TADAM: Task dependent adaptive metric for improved few-shot learning. Advance online publication. https://doi.org/10.48550/arXiv.1805.10123
Papilloud, C., & Hinneburg, A. (2018). Qualitative Textanalyse mit Topic-Modellen: Eine Einführung für Sozialwissenschaftler. Wiesbaden: Springer. https://doi.org/10.1007/978-3-658-21980-2
Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., & Dehak, N. (2019). Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 838–844). IEEE. https://doi.org/10.1109/ASRU46091.2019.9003958
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D14-1162.pdf
Ranjan, & Chitta. (2019). Build the right Autoencoder — Tune and Optimize using PCA principles.: Part I. https://towardsdatascience.com/build-the-right-autoencoder-tune-and-optimize-using-pca-principles-part-i-1f01f821999b
Schreier, M. (2012). Qualitative Content Analysis in Practice. Los Angeles: SAGE.
Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. https://doi.org/10.48550/arXiv.1703.05175
Song, K., Tan, X., Qin, T., Lu, J. & Liu, T.‑Y. (2020). MPNet: Masked and Permuted Pre-training for Language Understanding. https://doi.org/10.48550/arXiv.2004.09297
Tunstall, L., Werra, L. von, Wolf, T., & Géron, A. (2022). Natural language processing with transformers: Building language applications with hugging face (Revised edition). Heidelberg: O’Reilly.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., . . . Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://doi.org/10.48550/arXiv.1609.08144
Zhang, X., Nie, J., Zong, L., Yu, H., & Liang, W. (2019). One Shot Learning with Margin. In Q. Yang, Z.-H. Zhou, Z. Gong, M.-L. Zhang, & S.-J. Huang (Eds.), Lecture Notes in Computer Science. Advances in Knowledge Discovery and Data Mining (Vol. 11440, pp. 305–317). Springer International Publishing. https://doi.org/10.1007/978-3-030-16145-3_24
ou, L. (2023). Meta-Learning: Theory, Algorithms and Applications. Elsevier Science & Technology. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=7134465