How to perform text classification using TensorFlow in python

How to perform text classification using TensorFlow in python

The original article can be found on kalebujordan.com

Hi guys,

In this article, you're going to learn about text classification using a popular Python framework for machine learning, Tensorflow in just a couple of lines of code.

what is text classification?

Text classification is a subpart of natural language processing that focuses on grouping a paragraph into predefined groups based on its content, for instance classifying categories of news whether its sports, business, music and etc

what will you learn?

In this tutorial, we learn in brief how to perform text classification using Tensorflow, you're going to learn text processing concepts such as word embedding and how to build a neural network with an embedding layer.

You will be learning all those concepts while by building a simple model to properly classify text as negative and positive reviews based on data we used to train it.

what you need to have?

For you to successfully follow through with this tutorial, you're supposed to have the following libraries python libraries installed on your machine.

Installation

There are two approaches that you can follow when it comes to installing the setup environment for doing machine learning together with data science-based projects.

  • Installing Anaconda
  • Installing independently using pip

Installing Anaconda

If it's your first time hearing about Anaconda, it is the toolkit that equips you to work with thousands of open-source packages and libraries. It saves the time for installing each library independently together with handling dependencies issues.

What you need to do is go to their official website at Anaconda.com and then follow the guide to download and install it on your machine depending on the Operating system you're using.

Once you install it, it will install thousands of other packages for doing machine learning and data science tasks such as numpy, pandas, matplotlib, scikit-learn, jupyter notebook, and many others

Almost here

Now once dependencies have been installed together with Anaconda its time to install the TensorFlow library, Anaconda comes with its package manager known as conda.

Now Let's use conda to install TensorFlow

conda create -n tf tensorflow

conda activate tf

Installing independently using pip

If you love handling every piece of details of yourself, then you can also install all the required python libraries just by using pip just as shown below;

pip install tensorflow

pip install numpy

pip install matplotlib

pip install jupyter notebook

Now once everything is installed let's start building our classification model

Note:

The TensorFlow that has been using while preparing this tutorial is TensorFlow 2.0 which comes with keras already integrated into it, therefore I recommend using it or a more updated version to avoid bugs.

Let's get started

For convenience we usually use a jupyter notebook in training our machine learning models therefore I would you to use it too since in this article I will be showing you individual chunks of code equivalent to a single cell in a jupyter notebook

Starting a jupyter notebook

To start a jupyter notebook it just simple and straight forward it's just you have to type jupyter notebook on your terminal and then it gonna automatically open a notebook on your default browser.

Importing all required libraries

import numpy as np

import tensorflow as tf

import matplotlib.pyplot as plt
Create array of random Textual Data ( features ) & Labels

The array below acts as features for training our model consisting of 4 positive and 4 negative short sentences and their respective labels were by 1 for positive and 0 for negative

data_x = [

 'good',  'well done', 'nice', 'Excellent',

 'Bad', 'OOps I hate it deadly', 'embrassing', 'A piece of shit'

]

label_x = np.array([1,1,1,1, 0,0,0,0])

Use one-hot encoding to convert textual feature to numerical

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

Follow the below code to encode the above textual features into numerical values .

one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]

print(one_hot_x)

[[21], [9, 34], [24], [20], [28], [41, 26, 9, 17, 26], [36], [9, 41]]

As we can see after using one-hot encoding to our textual data, it has resulted in an array of different sizes.

The array of textual data require the same length to be well fitted on Machine Learning Model. Therefore we have to process it again to form an array of Identical lengths.

Apply padding to features array & restrict its length to 4

you can edit or change individual array length by changing the maxlen parameter, the choice of value for maxlen depends on where most of the paragraph in your training data lies

padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')

print(padded_x)

Output :

array([[21,  0,  0,  0],

 [ 9, 34,  0,  0], [24,  0,  0,  0], [20,  0,  0,  0],

 [28,  0,  0,  0], [26,  9, 17, 26], [36,  0,  0,  0],

 [ 9, 41,  0,  0]], dtype=int32)

After we have already processed the training data now let's create our Sequential Model to fit our data.

Let's build a Sequential model for our classification

model = tf.keras.models.Sequential()

Now Let's add an Embedding Layer to receive the processed textual feature

model.add(tf.keras.layers.Embedding(50, 8, input_length=4))

Add Flatten layer to flatten the features array

model.add(tf.keras.layers.Flatten())

Finally, Let's add a dense layer with a sigmoid activation function to effectively learn the textual relationship

model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

Compile the Model and Check it's summary Structure

model.compile(optimizer='adam', loss='binary_crossentropy', 
metrics=['accuracy'])

model.summary()

Output

Model: "sequential"

_________________________________________________________________

Layer (type)                 Output Shape              Param #

=================================================================

embedding (Embedding)        (None, 4, 8)              400

_________________________________________________________________

flatten (Flatten)            (None, 32)                0

_________________________________________________________________

dense (Dense)                (None, 1)                 33

=================================================================

Total params: 433

Trainable params: 433

Non-trainable params: 0

_________________________________________________________________

Now Let's fit the Model with 1000 epochs & Visualizing the learning process


history = model.fit(padded_x, label_x, epochs=1000, 
batch_size=2, verbose=0)

plt.plot(history.history['loss'])

Testing Model

Let's create a Simple function to predict new words using the model have just created, it won't be as smart since our data was really short

def predict(word):
    one_hot_word = [tf.keras.preprocessing.text.one_hot(word, 50)]
    pad_word = tf.keras.preprocessing.sequence.pad_sequences(one_hot_word, maxlen=4,  padding='post')
    result = model.predict(pad_word)
    if result[0][0]>0.1:
        print('you look positive')
    else:
        print('damn you\'re negative')

Let's test calling predict method with different word parameters

>>>predict('this tutorial is cool')

you look positive

>>>predict('This tutorial is bad as me ')

damn you're negative

Congratulations you have successfully trained Text classifier using TensorFlow to get the Jupyter notebook guide download here. Otherwise, in case of comment, suggestion, difficulties drop it on the comment box

I also recommend reading this