Working with Natural Language Processing on Restaurant Review Data
Computers are great at working with structured data like spreadsheets and database tables. But humans usually communicate in words, not in tables. That’s unfortunate for computers. A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand the unstructured text and extract data from it?
The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. In simple terms, Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to understand and process human languages. Let’s see how NLP works so that a computer can understand the unstructured text and extract data from it.
Does the computer understand the English language?
As long as computers have been around, programmers have been trying to write programs that understand languages like English. The reason is pretty obvious — humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data.
Computers don’t yet truly understand English in the way that humans do — but they can already do a lot! In certain limited areas, what you can do with NLP already seems like magic. You might be able to save a lot of time by applying NLP techniques to your own projects.
And even better, the latest advances in NLP are easily accessible through open-source Python libraries like spaCy, textacy and neuralcoref. What you can do with just a few lines of python is amazing.
Here is something new with the Work of Natural Language Processing:
There is a Dataset of Restaurant Review and we have to predict the reviews of Result from the text Data using Natural Language Processing.
Let's see If Computer will understand English or not?
The Link for Dataset is: https://www.kaggle.com/nanuprasad/restaurant-reviews
Importing All Essential Libraries
1. A (software) library is a collection of files (called modules) that contains functions for use by other programs.
2. May also contain data values (e.g., numerical constants) and other things.
3. Library’s contents are supposed to be related, but there’s no way to enforce that.
4. The Python standard library is an extensive suite of modules that comes with Python itself.
5. Many additional libraries are available from PyPI (the Python Package Index).
Reading the Dataset by converting it into a readable Dataframe using Pandas
panda is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For the source file of pandas, you can go to the Github link
Working on Data Visualizations task
Cleaning the texts
You cannot go straight from raw text to fitting a machine learning or deep learning model. You must clean your text first, which means splitting it into words and handling punctuation and case. In fact, there is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task. After actually getting a hold of your text data, the first step in cleaning up text data is to have a strong idea about what you’re trying to achieve, and in that context review your text to see what exactly might help.
For more study on the topic, you can go on with the following Blog post
Creating the Bag of Words model
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification. A bag-of-words model, or BoW for short, is a way of extracting features from the text for use in modeling, such as with machine learning algorithms. The approach is very simple and flexible and can be used in a myriad of ways for extracting features from documents. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. For more study on this topic, you can move on with the [blog of Machine Learning Mastery which is very detailed on BoW
Splitting the dataset into the Training set and Test set
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where the additional configuration is required, such as when it is used for classification and the dataset is not balanced.
For more study on this method, you can go on the following Blog of Machine learning Mastery
Training the Naive Bayes model on the Training set
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
For more reading, you can go on with the blog of Naive Bayes from geek for Geeks
The Probability Distribution of Naive Bayes Classifier using Bayes Theorem of Probability
Predicting the Test set results
Once you choose and fit a final machine learning model in scikit-learn, you can use it to make predictions on new data instances.
There is some confusion amongst beginners about how exactly to do this. I often see questions such as:
How do I make predictions with my model in scikit-learn?
You can get your detailed answer in the Blog of machine learning mastery
Making the Confusion Matrix
A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset. Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.
For More on the Confusion Matrix, you can go on with the Machine Learning Mastery Blog on the Confusion matrix
import os
import re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import CountVectorizer
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))dataset = pd.read_csv(‘/kaggle/input/restaurant-reviews/Restaurant_Reviews.tsv’, delimiter = ‘\t’, quoting = 3)dataset.head()
dataset.tail()
dataset.describe()
dataset.corr()nltk.download('stopwords')
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
review = ' '.join(review)
corpus.append(review)
print(corpus)cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].valuesfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)classifier = GaussianNB()
classifier.fit(X_train, y_train)y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)accuracy_score(y_test, y_pred)
The Code is uploaded here on Kaggle
My Kaggle Profile — https://www.kaggle.com/arpit3043
My Github Profile — https://www.github.com/arpit1920
If you like the Blog, please give it a Clap and if you like the Kaggle work then please upvote it.
It will give me more enthusiasm to work more.
Bio: Arpit Bhushan Sharma (B.Tech (Hons), 2016–2020) Electrical & Electronics Engineering, KIET Group of Institutions Ghaziabad, Uttar Pradesh, India. |Project Manager — Project4jungle | President — Edusakar |Microsoft Learn Student Ambassador — Beta| Lifetime member at InSc | Student Member R10 IEEE | Student Member PES/PELS | Voice: +91 8445726929 | E-mail: bhushansharmaarpit@gmail.com