Matrix binarization one liner

In my previous post I talked a little bit about preparing text for a neural network. During my study of this problem, I came across a bit of python that I didn’t quite understand.

import pandas as pd
import numpy as np

labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)  # What does this do??

Given a list of words in a text file, either ‘Positive’ or ‘Negative’ read in to  a pandas data frame object with no header, binarize the labels and set to a new object reference.

In the last post, I gave an incredibly simple solution to reading the words and converting them to 1s and 0s. This solution is a little more elegant. And it turns out that thanks to Numpy, you can extend it to matrices and N-d arrays as well.

Basically, the line above iterates through the pandas ‘labels’ data frame, which  isonly asingle column (or Series) and performs a boolean operation on each element (thanks to the == operator). If the element is ‘positive’, then it gets assigned a value of True. If its not, it is assigned False.

When the .astype() function is called the True or False values will be converted to a new type, in this case the np.int_ type.

Numpy is THE python package for performing linear algebra operations and as such treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math.

Lets see another example. The following is the case of a random matrix where instead of performing an == comparison, we’ll convert any number that is greater than 0.5 to a 1, and anything less to a 0.

>>> a = (a > 0.5).astype(np.int_)
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)

>>> a = np.random.rand(4, 4)

>>> a
>>> array([[ 0.549,  0.715,  0.603,  0.545],
       [ 0.424,  0.646,  0.438,  0.892],
       [ 0.964,  0.383,  0.792,  0.529],
       [ 0.568,  0.926,  0.071,  0.087]])

>>> a = (a > 0.5).astype(np.int_)  # Where the numpy magic happens.

>>> array([[1, 1, 1, 1],
           [0, 1, 0, 1],
           [1, 0, 1, 1],
           [1, 1, 0, 0]])

Whats going on here is that you are automatically iterating through every element of every row in the 4×4 matrix and applying a boolean comparison to each element.

If > 0.5 return True, else return False.

Again, by calling the .astype method and passing np.int_ as the argument, you’re telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.

I love it!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s