The original article can be found on kalebujordan.com
Hi guys,
In this article, you're going to learn about text classification using a popular Python framework for machine learning, Tensorflow in just a couple of lines of code.
what is text classification?
Text classification is a subpart of natural language processing that focuses on grouping a paragraph into predefined groups based on its content, for instance classifying categories of news whether its sports, business, music and etc
what will you learn?
In this tutorial, we learn in brief how to perform text classification using Tensorflow, you're going to learn text processing concepts such as word embedding and how to build a neural network with an embedding layer.
You will be learning all those concepts while by building a simple model to properly classify text as negative and positive reviews based on data we used to train it.
what you need to have?
For you to successfully follow through with this tutorial, you're supposed to have the following libraries python libraries installed on your machine.
Installation
There are two approaches that you can follow when it comes to installing the setup environment for doing machine learning together with data science-based projects.
- Installing Anaconda
- Installing independently using pip
Installing Anaconda
If it's your first time hearing about Anaconda, it is the toolkit that equips you to work with thousands of open-source packages and libraries. It saves the time for installing each library independently together with handling dependencies issues.
What you need to do is go to their official website at Anaconda.com and then follow the guide to download and install it on your machine depending on the Operating system you're using.
Once you install it, it will install thousands of other packages for doing machine learning and data science tasks such as numpy, pandas, matplotlib, scikit-learn, jupyter notebook, and many others
Almost here
Now once dependencies have been installed together with Anaconda its time to install the TensorFlow library, Anaconda comes with its package manager known as conda.
Now Let's use conda to install TensorFlow
conda create -n tf tensorflow
conda activate tf
Installing independently using pip
If you love handling every piece of details of yourself, then you can also install all the required python libraries just by using pip just as shown below;
pip install tensorflow
pip install numpy
pip install matplotlib
pip install jupyter notebook
Now once everything is installed let's start building our classification model
Note:
The TensorFlow that has been using while preparing this tutorial is TensorFlow 2.0 which comes with keras already integrated into it, therefore I recommend using it or a more updated version to avoid bugs.
Let's get started
For convenience we usually use a jupyter notebook in training our machine learning models therefore I would you to use it too since in this article I will be showing you individual chunks of code equivalent to a single cell in a jupyter notebook
Starting a jupyter notebook
To start a jupyter notebook it just simple and straight forward it's just you have to type jupyter notebook on your terminal and then it gonna automatically open a notebook on your default browser.
Importing all required libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
Create array of random Textual Data ( features ) & Labels
The array below acts as features for training our model consisting of 4 positive and 4 negative short sentences and their respective labels were by 1 for positive and 0 for negative
data_x = [
'good', 'well done', 'nice', 'Excellent',
'Bad', 'OOps I hate it deadly', 'embrassing', 'A piece of shit'
]
label_x = np.array([1,1,1,1, 0,0,0,0])
Use one-hot encoding to convert textual feature to numerical
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
Follow the below code to encode the above textual features into numerical values .
one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]
print(one_hot_x)
[[21], [9, 34], [24], [20], [28], [41, 26, 9, 17, 26], [36], [9, 41]]
As we can see after using one-hot encoding to our textual data, it has resulted in an array of different sizes.
The array of textual data require the same length to be well fitted on Machine Learning Model. Therefore we have to process it again to form an array of Identical lengths.
Apply padding to features array & restrict its length to 4
you can edit or change individual array length by changing the maxlen parameter, the choice of value for maxlen depends on where most of the paragraph in your training data lies
padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')
print(padded_x)
Output :
array([[21, 0, 0, 0],
[ 9, 34, 0, 0], [24, 0, 0, 0], [20, 0, 0, 0],
[28, 0, 0, 0], [26, 9, 17, 26], [36, 0, 0, 0],
[ 9, 41, 0, 0]], dtype=int32)
After we have already processed the training data now let's create our Sequential Model to fit our data.
Let's build a Sequential model for our classification
model = tf.keras.models.Sequential()
Now Let's add an Embedding Layer to receive the processed textual feature
model.add(tf.keras.layers.Embedding(50, 8, input_length=4))
Add Flatten layer to flatten the features array
model.add(tf.keras.layers.Flatten())
Finally, Let's add a dense layer with a sigmoid activation function to effectively learn the textual relationship
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
Compile the Model and Check it's summary Structure
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 4, 8) 400
_________________________________________________________________
flatten (Flatten) (None, 32) 0
_________________________________________________________________
dense (Dense) (None, 1) 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
Now Let's fit the Model with 1000 epochs & Visualizing the learning process
history = model.fit(padded_x, label_x, epochs=1000,
batch_size=2, verbose=0)
plt.plot(history.history['loss'])
Testing Model
Let's create a Simple function to predict new words using the model have just created, it won't be as smart since our data was really short
def predict(word):
one_hot_word = [tf.keras.preprocessing.text.one_hot(word, 50)]
pad_word = tf.keras.preprocessing.sequence.pad_sequences(one_hot_word, maxlen=4, padding='post')
result = model.predict(pad_word)
if result[0][0]>0.1:
print('you look positive')
else:
print('damn you\'re negative')
Let's test calling predict method with different word parameters
>>>predict('this tutorial is cool')
you look positive
>>>predict('This tutorial is bad as me ')
damn you're negative
Congratulations you have successfully trained Text classifier using TensorFlow to get the Jupyter notebook guide download here. Otherwise, in case of comment, suggestion, difficulties drop it on the comment box
I also recommend reading this
- 3 ways to convert text to speech in Python
- How to convert picture to sound in Python
- Build a Real-time barcode reader in Python
- How to perform Speech Recognition in Python
- How to detect emotion detection from text Python
- Make your own knowledge-based chatbot in Python
- Getting started with image processing using a pillow
- A Quick guide to twitter sentiment analysis using python
- How to detect Edges in a picture using OpenCV Canny algorithm