In my previous post I talked a little bit about preparing text for a neural network. During my study of this problem, I came across a bit of python that I didn’t quite understand.

import pandas as pd
import numpy as np
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_) # What does this do??

Given a list of words in a text file, either ‘Positive’ or ‘Negative’ read in to a pandas data frame object with no header, binarize the labels and set to a new object reference.

In the last post, I gave an incredibly simple solution to reading the words and converting them to 1s and 0s. This solution is a little more elegant. And it turns out that thanks to Numpy, you can extend it to matrices and N-d arrays as well.

Basically, the line above iterates through the pandas ‘labels’ data frame, which isonly asingle column (or Series) and performs a boolean operation on each element (thanks to the == operator). If the element is ‘positive’, then it gets assigned a value of True. If its not, it is assigned False.

When the .astype() function is called the True or False values will be converted to a new type, in this case the np.int_ type.

Numpy is THE python package for performing linear algebra operations and as such treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math.

Lets see another example. The following is the case of a random matrix where instead of performing an == comparison, we’ll convert any number that is greater than 0.5 to a 1, and anything less to a 0.

`>>> a = (a > 0.5).astype(np.int_)`

```
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)
>>> a = np.random.rand(4, 4)
>>> a
>>> array([[ 0.549, 0.715, 0.603, 0.545],
[ 0.424, 0.646, 0.438, 0.892],
[ 0.964, 0.383, 0.792, 0.529],
[ 0.568, 0.926, 0.071, 0.087]])
>>> a = (a > 0.5).astype(np.int_) # Where the numpy magic happens.
>>> array([[1, 1, 1, 1],
[0, 1, 0, 1],
[1, 0, 1, 1],
[1, 1, 0, 0]])
```

Whats going on here is that you are automatically iterating through every element of every row in the 4×4 matrix and applying a boolean comparison to each element.

If > 0.5 return True, else return False.

Again, by calling the **.astype** method and passing **np.int_** as the argument, you’re telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.

I love it!

### Like this:

Like Loading...