In my previous post I talked a little bit about preparing text for a neural network. During my study of this problem, I came across a bit of python that I didn’t quite understand.
import pandas as pd import numpy as np labels = pd.read_csv('labels.txt', header=None) Y = (labels=='positive').astype(np.int_) # What does this do??
Given a list of words in a text file, either ‘Positive’ or ‘Negative’ read in to a pandas data frame object with no header, binarize the labels and set to a new object reference.
In the last post, I gave an incredibly simple solution to reading the words and converting them to 1s and 0s. This solution is a little more elegant. And it turns out that thanks to Numpy, you can extend it to matrices and N-d arrays as well.
Basically, the line above iterates through the pandas ‘labels’ data frame, which isonly asingle column (or Series) and performs a boolean operation on each element (thanks to the == operator). If the element is ‘positive’, then it gets assigned a value of True. If its not, it is assigned False.
When the .astype() function is called the True or False values will be converted to a new type, in this case the np.int_ type.
Numpy is THE python package for performing linear algebra operations and as such treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math.
Lets see another example. The following is the case of a random matrix where instead of performing an == comparison, we’ll convert any number that is greater than 0.5 to a 1, and anything less to a 0.
>>> a = (a > 0.5).astype(np.int_)
>>> np.random.seed(0) >>> np.set_printoptions(precision=3) >>> a = np.random.rand(4, 4) >>> a >>> array([[ 0.549, 0.715, 0.603, 0.545], [ 0.424, 0.646, 0.438, 0.892], [ 0.964, 0.383, 0.792, 0.529], [ 0.568, 0.926, 0.071, 0.087]]) >>> a = (a > 0.5).astype(np.int_) # Where the numpy magic happens. >>> array([[1, 1, 1, 1], [0, 1, 0, 1], [1, 0, 1, 1], [1, 1, 0, 0]])
Whats going on here is that you are automatically iterating through every element of every row in the 4×4 matrix and applying a boolean comparison to each element.
If > 0.5 return True, else return False.
Again, by calling the .astype method and passing np.int_ as the argument, you’re telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.
I love it!