Prepping text for a neural network

I’ve been studying natural language processing  using deep neural networks. Neural networks are great for discovering hidden correlation within high dimensional data or correlation within data that exhibits nested structure such as images and text. But they don’t deal in English. So you can’t just feed a neural network a sentence and expect it to compute anything – after all, these networks are just computational algorithms.

These things are hella cool – and they look cool as well.

neural_network

The mathematical object (some vector containing n-dimensional arrays – 1d for the simple networks) is input on the input  layer, and then a series of linear transformations occurs by taking the dot product of the input value and the weight matrix (all the lines to the right). Then the result is passed through an activation function to get the hidden layer, which repeats the same process. Error is calculated using the training label, and the error is back propogated using partial derivatives and the chain rule (from multivariable calculus – you know, the class that sounded a lot harder than it was).

When dealing with natural language processing, you have to find a way to convert words to mathematical objects, so that the network can incorporate them in to its algorithm.

During my studies, I came across a training data set of movie reviews and their respective sentiment of ‘positive’ and ‘negative’ based on the star rating. In order to train a neural network to ‘learn’ which reviews were positive and negative I had to solve the problem of converting both the review in to a number or sequence of numbers and the sentiment in to a sequence of numbers.

It turned out that the way to do this was to convert the sentiment label in to a list of boolean integers, i.e. 0s and 1s. The input file consisting of lines of text that read either “POSITIVE” or “NEGATIVE”. So the question was framed as – “How do I convert ‘positive’ and ‘negative’ to 0s and 1s?

Okay, so this is NOT complicated. The first and obvious answer is to write a function that  will read the file, parse the lines, and append 0s and 1s to a list.

def binarize_input(file):
    
    with open(file, 'r') as infile:
        binarized_list = []
        for row in infile.readlines.strip():
        
            if str(row) == "POSITIVE":
                binarized_list.append(1)
            else:
                binarized_list.append(0)

        return binarized_list

Okay easy. When my neural network outputs its final classification, it will be given a probability of either being a 1, or a 0, a Positive, or a Negative.

Dealing with the input turned out to be not so easy. The weird thing about neural networks that I didn’t quite expect when I began was that you have to sort of encapsulate the entire scope of the data within each input. This is the case for simple neural networks anwyays. First you have to convert the input string of text to a mathematical object, but you can’t just input any ‘ol string. If this were the case, then this would totally screw up the computation occurring within the network. You’d have say, 30 input nodes for one sentence and then say 100 input nodes for the next, etc.. Having a dynamic input like this is not something I’m aware of working.

Instead, you have to take the entire set of words you’ll be considering in your network, and train on those words. This means you’ll have an input node for every word in your ‘vocabulary’, aka, your ‘bag of words’.  So what do you do? You have to input a vector with an integer representing each of the words. If you have 20,000 different words across all of your reviews, you’ll have 20,000 words in your vocab ‘bag of words’! And if thats the case, then you’ll have 20,000 input nodes! Wha!! Crazy!

So the solution, it turns out, is to vectorize all of the words and then either create a count representation or a binarization. In other words, we have a vector with 20,000 word positions that start at zero.

>>> bagowords = [0,0,0,0....0]
>>> len(bagowords)
>>> 20000

For each word in our sentence we go through and flip its position to a 1 (binarized) or we add 1.

from collections import Counter
import numpy as np

total_counts = Counter()
with open(review_file, 'r') as input:
    reviews = input.readlines()    
    for review in reviews.readlines.strip():
        for word in review.split(" "):
            total_counts[word] += 1

#dictionary comprehension like a ninja
word2index = {int(value):str(key) for (key, value) in enumerate(total_counts.keys())}

def get_input(review):
    vector = np.zeros(len(word2index.keys())):
    for word in review.split(" "):
        idx = word2index.get(word, None) # if the word is in our vocab
        if idx is not None:
            vector[idx] += 1 # or just = 1
        else:
            pass

With this we achieve a numpy array that is filled with 20000 positions, and if we see the word in our current review, we add 1 to that words position. We could just as well binarize this by setting the word equal to one, effectively eliminating any weight metric from the neural network’s correlation analysis.

Pretty cool!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s