Working with Natural Language Processing on Restaurant Review Data

Natural language Processing

Computers are great at working with structured data like spreadsheets and database tables. But humans usually communicate in words, not in tables. That’s unfortunate for computers. A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand the unstructured text and extract data from it?

The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. In simple terms, Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to understand and process human languages. Let’s see how NLP works so that a computer can understand the unstructured text and extract data from it.

Does the computer understand the English language?

As long as computers have been around, programmers have been trying to write programs that understand languages like English. The reason is pretty obvious — humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data.

Computers don’t yet truly understand English in the way that humans do — but they can already do a lot! In certain limited areas, what you can do with NLP already seems like magic. You might be able to save a lot of time by applying NLP techniques to your own projects.

And even better, the latest advances in NLP are easily accessible through open-source Python libraries like spaCy, textacy and neuralcoref. What you can do with just a few lines of python is amazing.

Here is something new with the Work of Natural Language Processing:

There is a Dataset of Restaurant Review and we have to predict the reviews of Result from the text Data using Natural Language Processing.

Let's see If Computer will understand English or not?

The Link for Dataset is:

Importing All Essential Libraries
1. A (software) library is a collection of files (called modules) that contains functions for use by other programs.
2. May also contain data values (e.g., numerical constants) and other things.
3. Library’s contents are supposed to be related, but there’s no way to enforce that.
4. The Python standard library is an extensive suite of modules that comes with Python itself.
5. Many additional libraries are available from PyPI (the Python Package Index).

Reading the Dataset by converting it into a readable Dataframe using Pandas

panda is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For the source file of pandas, you can go to the

Working on Data Visualizations task

Cleaning the texts
You cannot go straight from raw text to fitting a machine learning or deep learning model. You must clean your text first, which means splitting it into words and handling punctuation and case. In fact, there is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task. After actually getting a hold of your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.

For more study on the topic, you can go on with the following

Creating the Bag of Words model
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification. A bag-of-words model, or BoW for short, is a way of extracting features from the text for use in modeling, such as with machine learning algorithms. The approach is very simple and flexible and can be used in a myriad of ways for extracting features from documents. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  1. A vocabulary of known words.
  2. A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. For more study on this topic, you can move on with the [ which is very detailed on BoW

Splitting the dataset into the Training set and Test set

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where the additional configuration is required, such as when it is used for classification and the dataset is not balanced.

For more study on this method, you can go on the following

Training the Naive Bayes model on the Training set

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
For more reading, you can go on with the blog of

The Probability Distribution of Naive Bayes Classifier using Bayes Theorem of Probability

Predicting the Test set results
Once you choose and fit a final machine learning model in scikit-learn, you can use it to make predictions on new data instances.
There is some confusion amongst beginners about how exactly to do this. I often see questions such as:
How do I make predictions with my model in scikit-learn?
You can get your detailed answer in the

Making the Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset. Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.

For More on the Confusion Matrix, you can go on with the

import os
import re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
dataset = pd.read_csv(‘/kaggle/input/restaurant-reviews/Restaurant_Reviews.tsv’, delimiter = ‘\t’, quoting = 3)dataset.head()
dataset.tail()
dataset.describe()
dataset.corr()
nltk.download('stopwords')
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
review = ' '.join(review)
corpus.append(review)
print(corpus)
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

The Code is uploaded here on

My Kaggle Profile —

My Github Profile — 1920

If you like the Blog, please give it a Clap and if you like the Kaggle work then please upvote it.

It will give me more enthusiasm to work more.

Bio: ( (Hons), 2016–2020) Electrical & Electronics Engineering, KIET Group of Institutions Ghaziabad, Uttar Pradesh, India. |Project Manager — | President — |Microsoft Learn Student Ambassador — Beta| Lifetime member at InSc | Student Member R10 IEEE | Student Member PES/PELS | Voice: +91 8445726929 | E-mail: bhushansharmaarpit@gmail.com

ML enthusiastic | Love Exploring | Electrical Engineering | Tech-Geek

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store