Estimations tell us that unstructured data will account for 80% of all the data in companies i.e. data that is not organized in a way that allows easy analysis (Van der Linden, 2018). We are surrounded by this kind of data, that come often in textual form such as emails, memo’s, social media posts, articles and endless documents. Immediately regarding this data as useless for data analysis techniques such as machine learning models would be naive. You might be surprised how much we can do with such data.

This article has a different setup than you might be used to from IThappens. This article will offer code in Python to demonstrate how to bring what we are discussing in practice. This article will not go in depth into the coding language or libraries used, but it will provide you with usefull links if you want to learn more and I invite you to play around with the code.

The dataset used in this article is rather trivial, but depending on how you like your poison, the data might be of great interest to some of you. This dataset contains 150,000 wine reviews gathered by a Kaggle user (‘Wine Reviews’, 2017). Every entry contains the name, price, country of origin, textual review and the grade of the wine. The goal of the excercise in this article is trying to classify if a wine is a top wine (at least 9 out of 10 points), based on the textual review.

Important to note is that there is no “one” true best model, or as they like to say in machine learning: there is no free lunch. Therefore, we have conduct trial and error untill we find a solution that gives satisfactory results.

Load and pre-process data

First, we are going to load the data, look at some examples and prepare the data in order to train some models on it. Collecting, exploring and pre-processing your data could take up to 80% of a whole data science project. Luckily, the data is available in a csv file and only little pre-processing is necessary.

The data will be loaded in a pandas data frame, which allows easy exploration and visualisation of our data.

In [0]:
import pandas as pd

df = pd.read_csv("winemag-data_first150k.csv", encoding="utf-8")

The best way to explore our data is to simply look at it. Below the first 5 entries and reviews are presented.

In [0]:
df.head()
Out[0]:
Table
In [0]:
for text in (df.description[0:5]):
    print(text)
    print('\n')
This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.


Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.


Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.


This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.


This is the top wine from La Bégude, named after the highest point in the vineyard at 1200 feet. It has structure, density and considerable acidity that is still calming down. With 18 months in wood, the wine has developing an extra richness and concentration. Produced by the Tari family, formerly of Château Giscours in Margaux, it is a wine made for aging. Drink from 2020.


Notice that not every description has the same length. This could be problematic for training a model on the descriptions, since a longer description can contain many more important terms that can indicate the quality of the wine than smaller descriptions. Plotting the length of the descriptions reveals a substantial amount of outliers. Therefore, the descriptions will be restricted to 1 or 2 sentences, which roughly corresponds to sentences between 15 and 40 words.

In [0]:
import matplotlib.pyplot as plt

text_length = [len(text.split()) for text in df.description]
plt.boxplot(text_length)
plt.ylabel("Number of words")
Out[0]:
Text(0, 0.5, 'Number of words')
Pyplot
In [0]:
#filtering out all the descriptions that do not meet our criteria
df['description'] = df['description'].astype('str')

filtered = df[df.description.str.split().apply(len) > 15]
data = filtered[filtered.description.str.split().apply(len) < 45]
In [0]:
text_length = [len(text.split()) for text in data.description]
plt.boxplot(text_length)
plt.ylabel("Number of words")
Out[0]:
Text(0, 0.5, 'Number of words')
Pyplot 2

To check how many of our original data remain, we simply compare the number of old entries with the new set. 66% of our data has been preserved, which still means that we are left with a hefty data set of 100,000 wines.

In [0]:
#around 2/3 of the data is being kept
len(data.description)/len(df.description)*100
Out[0]:
66.5162658185914

Additionally, we have to label our data in a fitting manner. Since we only deal with two classes (top wines and other wines), the task at hand is called binary classification. Therefore, we can simply label the top wines with a 1 and the other wines with 0.

In [0]:
data["categorical"] = [1 if score > 90 else 0 for score in data.points]

This is not the end of this section. Before we can train models on our data, we would like to define what would constitute as a good performance. It might be a good idea to look at the distribution of wine points in our data set.

The plot below shows clearly that most of our wines are not top wines (at least 9 out of 10 points). This means that our data is unbalanced, which might result in problems for training a model on the data.

In [0]:
plt.hist(data.points)
plt.ylabel("count")
plt.xlabel("points")
plt.title("wine points distribution")
Out[0]:
Text(0.5, 1.0, 'wine points distribution')
Wine Points Distribution

Our excercise has the goal to correctly classify all the top wines. We could easily label all of our wines as top wines and call it a day. While some of us do not really care about the quantity of wine to consume, some of us only want to try the top wines. Therefore, we should evaluate our models on two metrics: accuracy and recall (other appropiate metrics exist, but to keep it simple, we stick to these two).

Accuracy is the percentage of how many wines are classified correctly from all the wines. Recall is the percentage of how many wines are classfied correctly from the top wines.

We would like to get our recall as high as possible and our accuracy at least as high than the case where we classify everything as a inferior wine, which is about 87%.

In [0]:
from sklearn.metrics import recall_score, accuracy_score
predictions = [0 for i in data.points]
print("Accuracy: " + str(accuracy_score(predictions, data.categorical)))
print("Recall: " + str(recall_score(predictions, data.categorical)))
Train accuracy: 0.8672417399619495
Train recall: 0.0
Further, we would not like to evaluate on the data on which we train our models. When you study for an exam, you study (or train) on different excercises than the excercises in the exam. If you would have been studying by using the excercises in the exam, the exam would be rather easy and not fairly evaluate your knowledge on the subject. Therefore, we split the data in a large training set and a smaller set to evaluate our models. The code below divides the data 75% and 25% for the train and test set respectively.
In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.description, data.categorical, test_size = 0.25, random_state = 42)

Embeddings

In order to let machine learning models work with the wine reviews, the text needs to be converted to numerical representations. For this problem, sentence embeddings are generated in order to capture the semantics of the whole sentence(s).

Simply put, embeddings are vectors that are mapped according to the semantics of the input text. As a result, similair semantic sentences are mapped closer together in the feature space. This description clearly does not justify the miracelous and complex workings of sentence embedding models and the larger field of language understanding. This field has seen a tremendous amount of advancement in recent years and produces a new state-of-the-art every few months. If you would like to read more on this topic, I advice to start with https://machinelearningmastery.com/statistical-language-modeling-and-neural-language-models/ .

To convert the text to vectors, a pre-trained Skip-Thought model has been used (Kiros et. al, 2015). The model had been trained on a large data set on which it learned to predict the previous and next sentence given the input sentence. The idea is that by learning surrounding sentences, the semantics of the input sentence can be learned. This model can be used to pre-process sentences for a wide variety of NLP tasks.

In [0]:
#prepare the data for the encoder
skt_data = [line.strip() for line in data.description]
assert len(skt_data) == len(data.description)
In [0]:
from skip_thoughts import configuration
from skip_thoughts import encoder_manager

bi_skt_encoder = encoder_manager.EncoderManager()

bi_skt_encoder.load_model(configuration.model_config(bidirectional_encoder=True),
                   vocabulary_file = r"vocab.txt",
                   embedding_matrix_file = r"embeddings.npy",
                   checkpoint_path = r"model.ckpt-500008")

Converting the sentences to embeddings takes some time and we would not like to run the Skip-Thought model everytime we clear our environment. Therefore, we save them in a pickle file

In [0]:
import pickle

bi_skt_embedding = bi_skt_encoder.encode(skt_data)

bi_skt_embedding_file = open('bi_skt_embedding_file', 'wb')
pickle.dump(bi_skt_embedding, bi_skt_embedding_file)
bi_skt_embedding_file.close()

The three lines of code below allow us to load the embeddings from the pickle file.

In [0]:
file = open('bi_skt_embedding_file', "rb")
embeddings = pickle.load(file)
file.close()

To check whether all the embeddings are loaded, we can check the shape of the embeddings: around 100,000 vectors with size 2400.

In [0]:
bi_skt_embedding.shape
Out[0]:
(100393, 2400)

2400 might be too much for our task and will make our machine learning models a bit too complex. The sklearn library offers us a neat trick: principle component analysis (PCA). PCA allows us to reduce the dimensions (features) while trying to preserve the information stored in the initial dimensions. The exact working of the algorithm behind PCA can be a bit daunting, so I included a gif that might give you some intuition on how it works (Principal component analysis (PCA): Explained and implemented, 2018).

Principal Component Analysis - PCA

We need to fit the PCA on our data in order to let the PCA-model learn its distribution. We choose to reduce the vector size from 2400 to 1000. Important is to note that we seperately fit the PCA on our train data and not on our test data. We do not want that the test data influences how the train data is pre-processed. It is always important to keep in mind what pre-processing techniques we can apply on our train and test data seperately!

In [0]:
from sklearn.decomposition import PCA

pca = PCA(n_components=1000)

X_train_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.transform(X_test)

K-Nearest Neighbour

The first model that we will train on our data is K-nearest neighbour. This is a fairly simple model that can work remarkably well on certain data sets. The model remembers the positions of all the examples in the feature space in the train data set. When the model is asked to classify a wine from the test data set, the model looks at the classes of k nearest wines in the feature space and through majority vote classifies the wine.

This model has one major weakpoint: it cannot handle data with a lot of dimensions well. This weakness is also called the curse of dimensionality. Therefore, we will apply PCA again, specifically for KNN. 20 dimensions seem to be reasonable.

In [0]:
knn_pca = PCA(n_components = 20)

X_train_knn = knn_pca.fit_transform(X_train)
X_test_knn = knn_pca.transform(X_test)

We import the KNN classifier from the sklearn library and fit our training examples on it in three lines of code. Yes, this easy it can be.

In [0]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_jobs = -1)
knn.fit(X_train_knn, y_train_label)

After fitting the model, we can ask it to classify wines. We make a difference between the performance on the wines in the training set (on which the model is trained) and the wines in the test set.

KNN performs not bad on the training set. The model is able to identify more than 70% of the top wines. On the test set the accuracy is not much higher than our baseline of 87%. The recall on the test data is also much lower compared to the training set, a difference of 25 procent points.

In [0]:
print("Train accuracy: " + str(knn.score(X_train_knn, y_train_label)))
print("Train recall: " + str(recall_score(knn.predict(X_train_knn), y_train_label)))
Train accuracy: 0.9061545408664701
Train recall: 0.7414578587699316
In [0]:
print("Test accuracy: " + str(knn.score(X_test_knn, y_test_label)))
print("Test recall: " + str(recall_score(knn.predict(X_test_knn), y_test_label)))
Test accuracy: 0.8681222359456552
Test recall: 0.49576719576719575

The phenomenon in which a model score much higher on the training set than on the test set is called overfitting. When a model overfits, what the model has learned on the train set does not generalize well to the test set.

A good analogy to understand overfitting are the anti-tank dogs trained by the Sovjets. Dogs were meant to be trained to leave explosive packages under German tanks, which would be detonated. The dogs were trained on Sovjet diesel tanks and not on German gasoline tanks. When deployed, the dogs ran towards the familiar diesel smelling Sovjet tanks instead of the German gasoline tanks. The dogs have unintentionally been trained to recognise a tank by the smell of it.

Logistic Regression

The second model we take a look at is logistic regression, a cousin of linear regression. The main difference between the models is that linear regression can be used to predict a continious value, while logistic regression can be used to classify e.g. either 0 or 1. The curve of a logistic regression model fitted on only one feature looks like this:

Logistic Regression

Instead of one variable, we have 1000 variables. Just try to imagine how such a line would look like in a 1000 dimensional plane. Luckily, Python and the sklearn LogisticRegression object will take care of this.

In [0]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train_reduced, y_train_label)
Out[0]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

The performance on the training set is lower than the performance of the KNN model. However, the difference between the performance on the training set and test set is much lower. The logistic regression does not overfit and outperforms the KNN model!

In [0]:
print("Train accuracy: " + str(clf.score(X_train_reduced, y_train_label)))
print("Train recall: " + str(recall_score(clf.predict(X_train_reduced), y_train_label)))
Train accuracy: 0.8932982707785481
Train recall: 0.7241147467503362
In [0]:
print("Test accuracy: " + str(clf.score(X_test_reduced, y_test_label)))
print("Test recall: " + str(recall_score(clf.predict(X_test_reduced), y_test_label)))
Test accuracy: 0.8917885174708156
Test recall: 0.6918990703851262

Feed-forward network

We have arrived at the (in)famous branch of machine learning: deep learning. Deep neural networks have achieved unbelievable results on tasks that include image recognition, natural language generating and machine translation. I would have no idea where to start to explain how these networks work; calling them complex would be an understatement. They consist of conventional models on steroids stacked on top of eachother. They are notorious for overfitting and ideally require much more data than we have in our example. If you are interested in learning about how such models work, I advise to start to watch the great videos of Blue1Brown and then progress with http://www.deeplearningbook.org/.

Below, a relatively simple feed-forward network has been constructed by using the keras package. This package is ideal when starting to learn to use neural networks in Python. Due to how neural networks, we have to change the encoding of the labels from 0 (mediocere wine) to [1, 0] and from 1 (top wine) to [0, 1].

In [0]:
from keras.utils import np_utils
# convert integers to dummy variables (i.e. one hot encoded)
train_dummy_y = np_utils.to_categorical(y_train_label)
test_dummy_y = np_utils.to_categorical(y_test_label)

First we initialize a model. Next, we add layers to the model. The model below has one input layer with 64 nodes, one hidden layer with 32 nodes and one output layer with 2 nodes, corresponding to the number of wine classes we have. The compile method specifies how the model will be trained; for the scope of this article, I will not get in too much detail on this.

In [0]:
import keras
from keras import Sequential
from keras import layers

model = Sequential()
model.add(layers.Dense(64, activation = "relu", input_shape = (1000, )))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(32, activation = "relu"))
model.add(layers.Dropout(0.3))
model.add(layers.Dense(2, activation = "sigmoid"))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 64)                64064     
_________________________________________________________________
dropout_5 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_6 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 2)                 66        
=================================================================
Total params: 66,210
Trainable params: 66,210
Non-trainable params: 0
_________________________________________________________________

We can utilize some tricks to help our model to learn and to reduce the training time. First, we specifiy the weights the model should attach to the errors of the classes. Since we have far less top wines in our data set, we weigh a missclassification of a top wine higher than the missclassification of another wine. The weights are proportional to the proportion of classes in the training data. This helps to increase the recall on our unbalanced data set.

Second, we use early stopping. The model will be trained in a given number of rounds i.e. epochs. With early stopping, we ask the model to check the accuracy of the model on the test data set after each epoch. If the accuracy does not increase substantially in 5 rounds straight, the model stops training. This method can reduce overfitting and shortens the training time.

In [0]:
class_weight = {1:1/(np.sum(y_train_label)/ len(y_train_label)), 0:1}

from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_acc', mode='auto', verbose=1, patience=5)

Now, we finally train our model. Additionaly to the training data, we specify the test data as validation data, set the amount of epochs to 50, point to the class weights and specify earlystopping. Below, you can see the performance of the model increasing over the epochs. Due to early stopping, our model is done learning after 24 epochs instead of 50 epochs, which cuts the training time in half.

After the model is done training, we plot the accuracy on the training and test set for each epoch. Further, we print the recall scores.

In [0]:
history = model.fit(X_train_reduced, train_dummy_y, validation_data=(X_test_reduced, test_dummy_y), epochs = 50, class_weight=class_weight, callbacks = [es])

plt.plot(history.history['acc'], label = "train")
plt.plot(history.history['val_acc'], label = "test")
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

print("Recall train: ", recall_score(np.argmax(train_dummy_y, axis = 1),  np.argmax(model.predict(X_train_reduced), axis = 1)))
print("Recall test: ", recall_score(np.argmax(test_dummy_y, axis = 1),  np.argmax(model.predict(X_test_reduced), axis = 1)))
Train on 75294 samples, validate on 25099 samples
Epoch 1/50
75294/75294 [==============================] - 9s 122us/step - loss: 0.8636 - acc: 0.7452 - val_loss: 0.4774 - val_acc: 0.7574
Epoch 2/50
75294/75294 [==============================] - 6s 74us/step - loss: 0.7394 - acc: 0.7848 - val_loss: 0.4280 - val_acc: 0.7905
Epoch 3/50
75294/75294 [==============================] - 6s 81us/step - loss: 0.6613 - acc: 0.8115 - val_loss: 0.4815 - val_acc: 0.7695
Epoch 4/50
75294/75294 [==============================] - 5s 70us/step - loss: 0.5767 - acc: 0.8424 - val_loss: 0.3508 - val_acc: 0.8416
Epoch 5/50
75294/75294 [==============================] - 5s 71us/step - loss: 0.5032 - acc: 0.8661 - val_loss: 0.3213 - val_acc: 0.8616
Epoch 6/50
75294/75294 [==============================] - 5s 71us/step - loss: 0.4384 - acc: 0.8859 - val_loss: 0.3444 - val_acc: 0.8562
Epoch 7/50
75294/75294 [==============================] - 6s 80us/step - loss: 0.3934 - acc: 0.9003 - val_loss: 0.3232 - val_acc: 0.8728
Epoch 8/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.3622 - acc: 0.9071 - val_loss: 0.3030 - val_acc: 0.8847
Epoch 9/50
75294/75294 [==============================] - 7s 95us/step - loss: 0.3280 - acc: 0.9164 - val_loss: 0.3714 - val_acc: 0.8568
Epoch 10/50
75294/75294 [==============================] - 7s 93us/step - loss: 0.3111 - acc: 0.9200 - val_loss: 0.3010 - val_acc: 0.8895
Epoch 11/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.2903 - acc: 0.9264 - val_loss: 0.2951 - val_acc: 0.8968
Epoch 12/50
75294/75294 [==============================] - 7s 93us/step - loss: 0.2782 - acc: 0.9302 - val_loss: 0.3000 - val_acc: 0.8945
Epoch 13/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.2516 - acc: 0.9371 - val_loss: 0.3253 - val_acc: 0.8887
Epoch 14/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.2512 - acc: 0.9362 - val_loss: 0.2949 - val_acc: 0.9015
Epoch 15/50
75294/75294 [==============================] - 7s 97us/step - loss: 0.2348 - acc: 0.9405 - val_loss: 0.2825 - val_acc: 0.9109
Epoch 16/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.2225 - acc: 0.9436 - val_loss: 0.3167 - val_acc: 0.8989
Epoch 17/50
75294/75294 [==============================] - 7s 95us/step - loss: 0.2103 - acc: 0.9467 - val_loss: 0.3038 - val_acc: 0.9082
Epoch 18/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.2041 - acc: 0.9485 - val_loss: 0.2991 - val_acc: 0.9156
Epoch 19/50
75294/75294 [==============================] - 7s 95us/step - loss: 0.1942 - acc: 0.9504 - val_loss: 0.2931 - val_acc: 0.9160
Epoch 20/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.1922 - acc: 0.9521 - val_loss: 0.3170 - val_acc: 0.9075
Epoch 21/50
75294/75294 [==============================] - 7s 94us/step - loss: 0.1782 - acc: 0.9559 - val_loss: 0.3107 - val_acc: 0.9104
Epoch 22/50
75294/75294 [==============================] - 7s 97us/step - loss: 0.1778 - acc: 0.9558 - val_loss: 0.3244 - val_acc: 0.9095
Epoch 23/50
75294/75294 [==============================] - 7s 95us/step - loss: 0.1738 - acc: 0.9562 - val_loss: 0.3167 - val_acc: 0.9109
Epoch 24/50
75294/75294 [==============================] - 7s 95us/step - loss: 0.1675 - acc: 0.9581 - val_loss: 0.3404 - val_acc: 0.9096
Epoch 00024: early stopping
Model Accuracy
Recall train:  0.9997010165437512
Recall test:  0.7802064359441409

As we see clearly in the plot, our model is slightly overfitted. The accuracy on the training set is well above 95%, while the accuracy on the test set barely surpasses 90%. We see a much larger difference when looking at the recall. In the training set, almost 100% of the top wines are classified correctly, while 78% of the top wines are classified correctly in the test set. Nevertheless, this neural network is the top performer so far.

Gradient Boosting

We have arrived at our last model: gradient boosting. To be honest, I have never used this model before writing this article. I found online that this model is being used in online data science challenges and often ends on top. I had the idea that in the end nothing really can top the accuracy and recall of a neural network; I could not have been more wrong.

An extreme oversimplification of how this model works:

  • Fit a model to the data: F(x) = y
  • Calculate residuals: y – F_1(x)
  • Fit a model to the residuals: h(x) = y – F_1(x)
  • Create a new model consisting of the two previous models
  • Calculate residuals of new model and keep stacking models

In the end, you have an ensemble of models that individually would not perform remarkably well on the data, but together can achieve impressive results.

The gradient boosting implementation of XGBoost has been used. The classifier has all the default hyperparameters, except max_depth is set from 3 to 8.

In [0]:
from xgboost import XGBClassifier
model = XGBClassifier(max_depth = 8)
model.fit(X_train_reduced, y_train_label)
print(model)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=8, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)
In [0]:
predictions = model.predict(X_train_reduced)
print("Train accuracy: " + str(accuracy_score(predictions, y_train_label)))
print("Train recall: " + str(recall_score(predictions, y_train_label)))
predictions = model.predict(X_test_reduced)
print("Test accuracy: " + str(accuracy_score(predictions, y_test_label)))
print("Test recall: " + str(recall_score(predictions, y_test_label)))
Train accuracy: 0.9840624750976174
Train recall: 0.998307761732852
Test accuracy: 0.9258536196661221
Test recall: 0.9197422378441711
 When I first saw these results I was baffled. The recall on the test set smashes the results from the neural network. I never expected to be able to identify over 90% of all the top wines in the data set correctly!

Conclusion

In this article, I demonstrated how you can utilize a range of methods and models to make sense of text. I hope you take away from this article:

  • How to go about converting raw text to data on which we can train models
  • Which metrics you can use to evaluate models
  • The notion that there is no free lunch in machine learning
  • And of course, the ridiculous performance of gradient boosting

This article demonstates how one could go about a rather simple classification problem using embeddings and four different machine learning models. However, there are still a wide range of alternative methods to try out on this data set or a similair one:

  • Try to use a more recent and state-of-the-art sentence embedding model such as BERT or XLNet.
  • Most models have been used with their default hyperparameters. See if you can increase the performance by applying cross validation.
  • Instead of classifying top wines, you can also try to predict the actual score of the wine, which would make this a regressions problem instead of a classification problem.

You can access the notebook of this code online with the following link: https://github.com/kcambrek/wines/blob/master/wine_review.ipynb

References

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems (pp. 3294-3302).

Principal component analysis (PCA): Explained and implemented. (2018). [Photograph]. Retrieved from https://medium.com/@raghavan99o/principal-component-analysis-pca-explained-and-implemented-eeab7cb73b72

Van der Linden, P. (2018, August 28). Organizations need to give unstructured data its rightful place if they want to get value out of data. Retrieved 16 August 2019, from https://www.capgemini.com/2018/08/reorganizing-unstructured-data/

Wine Reviews. (2017, November 27). Retrieved 16 August 2019, from https://www.kaggle.com/zynicide/wine-reviews

Artikel door Kees Brekelmans