End-To-End Text Classification with BERTurk

Apdullah Yayik
6 min readMar 15, 2021

A Brief Introduction

The transformer-based machine learning models have led to substantial gains on several complex natural language processing (NLP) tasks. In this article, I will make a basic theoretical introduction to the architectural design of the transformer models and text classification that is a popular downstream task. Then I will show you the fine-tuning operation of the pre-trained (nowadays this term is changed with self-supervised) BERTurk model using transformers library provided by Hugging Face for the text classification task.

Text classification, also known as text categorization, is a classical downstream task in natural language processing (NLP), which intends to specify labels or tags to textual segments such as sentences, questions, paragraphs, and documents. It has a broad range of utilization consisting of question answering, spam detection, sentiment analysis, news categorization, user intent classification, content moderation, etc. Text data can be obtained from various sources, including web data, emails, chats, social media, tickets, insurance claims, user reviews, and questions and answers from customer services, to name a few. Text is a remarkably rich source of information. But obtaining insights from a text can be challenging and time-consuming, due to its unstructured and ambiguous characteristics.

The text classification operation is only achievable by the discovery of nonlinear relationships between words. Note that it is prone to error if the conventional methods such as TF-IDF vectors are employed.

What is Transformer Models?

I will not provide details about the architecture of the transformer models in this article (I promise for a dedicated article shortly). But it is beneficial to understand some of the difficulties in computational linguistics. There are two primary strategies, which are interconnected, (1) word embedding, (2) language model. Transformer models are dedicated to creating language models. As for word embeddings, they can be obtained via neural-based (CBOV, skip-gram) or probabilistic (WordPiece) approaches. Also, this side is not the focus of this article.

Have LSTM Models Been Invalid?

Most old-fashioned techniques for language modeling rely on recurrent neural networks (RNNs). RNNs have a fundamental problem that is based on the vanishing gradient and therefore fails to model the longer contextual dependencies. They were essentially displaced by the long short-term neural networks (LSTMs) that are also a special kind of RNN but can catch the larger contextual dependencies. But LSTM can process sequences only unidirectional, for this reason, they are replaced with the bidirectional LSTMs (Bi-LSTMs), which can process the sequences not only left to right but also right to left. Some very promising models rely on LSTMs, i.e., ELMO, ULMFiT. Furthermore, such models are still confirmed by the modern NLP community.

However, Bi- LSTMs have limitations to train in parallel due to their sequential architecture. With the attention layers, transformers have solved this issue and it is relatively much cost-effective to train in parallel. A large corpus can be pre-trained unsupervised to be able to be fine-tuned on any downstream task.

State-of-the-art Transforms Models

Researchers in academics and industry have introduced many variants of transformer-based models for the last 3 years. As of March 2021, the most successful ones are listed below:

GPT (Generative Pre-Training)

BERT (Bidirectional Encoder Representations from Transformers)

ALBERT

GPT-3

XLnet

ELECTRA

GShard

For now, there are small differences among the models. BERT is regarded as a state-of-the-art one on several NLP tasks. For representation, in this article, I will make practices with the BERTurk-based model.

BERT

BERT is a machine learning model that utilizes the pre-training of a language model. BERT has two forms: based and large. In this paper, the BERT-based model is explained. The details of the BERT-based model are out of the scope of this article. The model consists of 12 transformer blocks, 12 attention heads, and 768 hidden sizes. Besides, it has almost 110M pre-trained weights.

Pre-training of the BERT model is applied by optimizing 2 loss functions: (1) select randomly 15 \% of the word embeddings in the content and label them as unknown, then forecast the weights to reach the original word embeddings (this task is called Masked Language Model), (2) forecast connected sentences in the textual content. The architectural design of creating word embeddings in the BERT model is shown in the Figure below.

BERT input representation, by Delvin et.al (https://arxiv.org/pdf/1810.04805.pdf)
BERT input representation, by Delvin et.al (Figure 2 at https://arxiv.org/pdf/1810.04805.pdf)

As it is seen in the Figure, word embeddings are calculated by summing of the (1) segmentation embeddings that keep the order of the sentence, position embeddings that keep the order of the subwords, and token embeddings via WordPiece model which regards the probability of the subwords.

There 2 special tokens: [CLS] and [SEP] tokens. The [CLS] token is added at the onset of each sequence, while the [SEP] token is added at the end.

Contrary to the pre-training task, in the fine-tuning task, all weights are end-to-end updated in a relatively cheaper way.

BERTurk

BERTurk is a community-driven BERT model for the Turkish language. Pre-training of it is employed via a large Turkish corpus. It outperforms the official BERT model on Turkish NER tagging. In this article, for illustration, I will use the BERTurk model.

Let’s start coding!

Install Dependencies

Import Libraries

Check GPU

You can see the details through “nvidia-smi” command from the terminal.

Read Data

Content of train.data and test.data files should be like below:

category, text
politics, siyaset,ön seçim vaadi mhp nin olağan büyük kurultayı
health, sıra dışı ağrı hastalık habercisi normal ağrı kesicilerin
justice, suçluluğu sabit oluncaya suçlu sayılamaz
religion, isa mesihin döneceğine gelişe inançta birleşmişlerdi

Preprocessing

Stop-words, punctuation and number removal, and case-folding to lower case. These codes can be used if preprocessing is required in your case.

Encode Labels

Making labels numeric values.

BERTurk Tokenization

Encode Data

Input tokenization that is described in Figure 2 by Devlin et.al. WordPiece mapping and inserting special tokens.

Apply the encoding to train and test sets.

Build Model

Set Learning Parameters

Create Model

Model Architect

As it is seen the BERTurk-based model has almost 109M parameters (a little less than BERT-based).

Train Model

Model training takes 6 hours 15 minutes 32 sec and requires at least 9 GB memory with GPU Nvidia Tesla V4.

During training for both the train and validation sets, loss value is decreased and accuracy is increased. This means the model has learned in the scope of the train set. Model files are saved at each epoch. The disk size of the models is 442 MB.

Model Serve

Import Libraries

Load Model

When the model is loaded, it consumes 1.1 GB of memory.

BERTurk Tokenization

Given text:

text = """
Botosani son haftaların flaş ekibi… Şampiyonluk adayı Cluj’u da 2-1’le geçerek son 6 maçta 5. galibiyetlerini elde ettiler. Ay başında teknik adam değişikliğine giden Gaz Metan son yıllarda olmadığı kadar iyi sonuçlar aldı bu aralar. Son olarak Chindia’yı 1-0 mağlup ettiler. Formda iki takım karşılaşıyor, beraberlik öne çıksa da karşılıklı gol daha sağlam bir seçim.
"""

Preprocess and Encode the given text

Send tokens to the model

Add a Softmax Layer

Get the decision that has the highest probability

Conclusion

We have formed end-to-end operations to employ transformers on the text classification task. The operations can be adapted to any other NLP downstream tasks with merely minor adjustments to the code. Codes are shared as a Colab project.

--

--