Matrix binarization one liner

In my previous post I talked a little bit about preparing text for a neural network. During my study of this problem, I came across a bit of python that I didn’t quite understand.

import pandas as pd
import numpy as np

labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)  # What does this do??

Given a list of words in a text file, either ‘Positive’ or ‘Negative’ read in to  a pandas data frame object with no header, binarize the labels and set to a new object reference.

In the last post, I gave an incredibly simple solution to reading the words and converting them to 1s and 0s. This solution is a little more elegant. And it turns out that thanks to Numpy, you can extend it to matrices and N-d arrays as well.

Basically, the line above iterates through the pandas ‘labels’ data frame, which  isonly asingle column (or Series) and performs a boolean operation on each element (thanks to the == operator). If the element is ‘positive’, then it gets assigned a value of True. If its not, it is assigned False.

When the .astype() function is called the True or False values will be converted to a new type, in this case the np.int_ type.

Numpy is THE python package for performing linear algebra operations and as such treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math.

Lets see another example. The following is the case of a random matrix where instead of performing an == comparison, we’ll convert any number that is greater than 0.5 to a 1, and anything less to a 0.

>>> a = (a > 0.5).astype(np.int_)
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)

>>> a = np.random.rand(4, 4)

>>> a
>>> array([[ 0.549,  0.715,  0.603,  0.545],
       [ 0.424,  0.646,  0.438,  0.892],
       [ 0.964,  0.383,  0.792,  0.529],
       [ 0.568,  0.926,  0.071,  0.087]])

>>> a = (a > 0.5).astype(np.int_)  # Where the numpy magic happens.

>>> array([[1, 1, 1, 1],
           [0, 1, 0, 1],
           [1, 0, 1, 1],
           [1, 1, 0, 0]])

Whats going on here is that you are automatically iterating through every element of every row in the 4×4 matrix and applying a boolean comparison to each element.

If > 0.5 return True, else return False.

Again, by calling the .astype method and passing np.int_ as the argument, you’re telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.

I love it!

 

Prepping text for a neural network

I’ve been studying natural language processing  using deep neural networks. Neural networks are great for discovering hidden correlation within high dimensional data or correlation within data that exhibits nested structure such as images and text. But they don’t deal in English. So you can’t just feed a neural network a sentence and expect it to compute anything – after all, these networks are just computational algorithms.

These things are hella cool – and they look cool as well.

neural_network

The mathematical object (some vector containing n-dimensional arrays – 1d for the simple networks) is input on the input  layer, and then a series of linear transformations occurs by taking the dot product of the input value and the weight matrix (all the lines to the right). Then the result is passed through an activation function to get the hidden layer, which repeats the same process. Error is calculated using the training label, and the error is back propogated using partial derivatives and the chain rule (from multivariable calculus – you know, the class that sounded a lot harder than it was).

When dealing with natural language processing, you have to find a way to convert words to mathematical objects, so that the network can incorporate them in to its algorithm.

During my studies, I came across a training data set of movie reviews and their respective sentiment of ‘positive’ and ‘negative’ based on the star rating. In order to train a neural network to ‘learn’ which reviews were positive and negative I had to solve the problem of converting both the review in to a number or sequence of numbers and the sentiment in to a sequence of numbers.

It turned out that the way to do this was to convert the sentiment label in to a list of boolean integers, i.e. 0s and 1s. The input file consisting of lines of text that read either “POSITIVE” or “NEGATIVE”. So the question was framed as – “How do I convert ‘positive’ and ‘negative’ to 0s and 1s?

Okay, so this is NOT complicated. The first and obvious answer is to write a function that  will read the file, parse the lines, and append 0s and 1s to a list.

def binarize_input(file):
    
    with open(file, 'r') as infile:
        binarized_list = []
        for row in infile.readlines.strip():
        
            if str(row) == "POSITIVE":
                binarized_list.append(1)
            else:
                binarized_list.append(0)

        return binarized_list

Okay easy. When my neural network outputs its final classification, it will be given a probability of either being a 1, or a 0, a Positive, or a Negative.

Dealing with the input turned out to be not so easy. The weird thing about neural networks that I didn’t quite expect when I began was that you have to sort of encapsulate the entire scope of the data within each input. This is the case for simple neural networks anwyays. First you have to convert the input string of text to a mathematical object, but you can’t just input any ‘ol string. If this were the case, then this would totally screw up the computation occurring within the network. You’d have say, 30 input nodes for one sentence and then say 100 input nodes for the next, etc.. Having a dynamic input like this is not something I’m aware of working.

Instead, you have to take the entire set of words you’ll be considering in your network, and train on those words. This means you’ll have an input node for every word in your ‘vocabulary’, aka, your ‘bag of words’.  So what do you do? You have to input a vector with an integer representing each of the words. If you have 20,000 different words across all of your reviews, you’ll have 20,000 words in your vocab ‘bag of words’! And if thats the case, then you’ll have 20,000 input nodes! Wha!! Crazy!

So the solution, it turns out, is to vectorize all of the words and then either create a count representation or a binarization. In other words, we have a vector with 20,000 word positions that start at zero.

>>> bagowords = [0,0,0,0....0]
>>> len(bagowords)
>>> 20000

For each word in our sentence we go through and flip its position to a 1 (binarized) or we add 1.

from collections import Counter
import numpy as np

total_counts = Counter()
with open(review_file, 'r') as input:
    reviews = input.readlines()    
    for review in reviews.readlines.strip():
        for word in review.split(" "):
            total_counts[word] += 1

#dictionary comprehension like a ninja
word2index = {int(value):str(key) for (key, value) in enumerate(total_counts.keys())}

def get_input(review):
    vector = np.zeros(len(word2index.keys())):
    for word in review.split(" "):
        idx = word2index.get(word, None) # if the word is in our vocab
        if idx is not None:
            vector[idx] += 1 # or just = 1
        else:
            pass

With this we achieve a numpy array that is filled with 20000 positions, and if we see the word in our current review, we add 1 to that words position. We could just as well binarize this by setting the word equal to one, effectively eliminating any weight metric from the neural network’s correlation analysis.

Pretty cool!

 

Philosophically and physically, why do proteins function?

Every so often, I cruise the forums for questions to answer. Its usually late at night, and I usually should be working on something else. Nonetheless, I seek questions I feel like I can answer and I give them a crack. One of my favorite forums is ResearchGate where people ask a lot of biology research questions. Today I’m posting my response to a fairly poorly worded question on why is that proteins do what they do. I’ve linked the original and I’ve corrected the question’s English for this blog.

Philosophically and physically, why do proteins function?

Question: I’ve been doing bio research for many years, however in terms of physics and philosophy, I’m pretty confused why proteins function. e.g., How/Why do endonucleases know to cut DNA? What is the force to make it cut? Why do transcription factors know to bind specific DNA sequences? Still, what force make them bind? Is someone capable of explaining it in physical or logical terms? Since I studied bio major, teachers always taught me the rules (A can do this, B can do that), but why?

Response

Hi Suipwksiow,

This is such a great question! And it is one of my favorites.

First – Ke-Wei is absolutely correct – philosophy is utterly irrelevant to this question. Philosophy refers to the synthesis of knowledge from information extracted from data. That data can come from literature, biological observations, careful  observation of physical phenomena, etc. To understand your own question, its key to understand how all of these things/fields are related. In other words, how is biology related to chemistry, and how is that related to physics? They are all  just ways that we humans have categorized our  study of the same thing – the natural universe – and they are essentially different levels of focus. Here’s a completely made up story to illustrate how these fields are related…

The literature PhD studies the words created by other humans, and he is fascinated by how many different versions of the same story there could be. He thinks about it and concludes that all of these minds are similar in composition, but develop down different trajectories giving rise to the different versions of what appear to be relatively similar stories.

The neurobiologist reads a summary of this  work and thinks, fascinating – but if all of these minds are truly similar in composition, then I should be able to observe similar patters across the brain, which my predecessors showed to be comprised of neural cells. So they disect the brain and image brain activity, and measure brain chemicals as best they can to draw conclusions about the similarities between brains. They note that certain regions of the brain exhibit frequent excitation of electrical activity which leads to the release of a variety of chemicals – neurotransmitters.

The biochemist reads the neurobiologists latest discoveries in a journal over lunch and begins to speak with his good friend the general chemist about the possible structures of these neurotransmitters. Together they use what they learned from the neurobiologist and propose a theory that these chemicals are comprised of a variety of different atoms and have a specific shape, which then bind to proteins on the surface of the neural cells. They are able to define with great precision the interaction between the proteins and the neurotransmitters and determine that their shape is key to the binding process.

The chemist and biochemist go on after this publication to work with a theoretical physicists  to explain how these atoms bind to one another. They develop a theory based on the observations of fellow scientists that these proteins and molecules aren’t static, but vibrate with high frequency – causing the proteins to change conformation. They realized that 85% of the time the protein is in confirmation A, while the other 15% of the time it is in confirmation B. Astounding.

The physicist then reads this paper and thinks to themself – my goodness, these molecules behave in such an interesting way! Perhaps I can develop a theory to explain why these molecules vibrate to begin with! He starts to consider the very nature of the constituent parts of the molecules, the atoms themselves. He studies the behaviors of the electrons among all of the elements, and builds a giant magnetized ring to accelerate simple atoms close to the speed of light so that he can smash them together to see what they break apart in to. This work gains traction and before long 1000s of experimental and theoretical physicists are collaborating to understand what the fabric of  matter is and they realize that it is nothing more than intersecting fields produced by subatomic particle waves rippling through the very fabric of space-time.

So all of these scientists elect a spokesperson to condense all of the findings from the  neurobiologist, the chemists, the biochemist, the physicists both experimental and theoretical, to write a review article. The spokesperson brings the review article to the literature PhD and discusses it all over a cup of coffee.

So this is the end of the completely made up story – I hope you’ve made the connection.

All of these fields are studying different aspects of the same thing. 

If you can learn a little bit from each field, then the reason that macromolecules behave the way they do (for example, why transcription factors bind to DNA) becomes apparent. I’ll give you a leading summary, and with this you can start your journey to gaining a broader understanding for yourself.

The physicists and chemists taught us that each atom has a different configuration of atoms.When these atoms bind to one another, we call this a molecule and we know that molecules a huuuuge variety of charge characteristics. Some molecules are polar, some are neutral, some are negative, some are positive, etc. Some are stable,  others are highly unstable, the list goes on. Proteins are just molecules. They are a combination of atoms.

But here is the clincher:

Most atoms when bound together will form an unflexible lattice that is a repeating pattern over and over. Metals and rocks. There is no flexibility to a lattice that is repeating and these molecules either stay in the conformation forever if they are stable, or break down if they are unstable (as is the case of high atomic weight atom lattices). And actually, most of the atoms in the universe are just single proton atoms – hydrogen. However, when stars compress low atomic weight atoms such as hydrogen and helium in their cores (leading to never ending nuclear fusions explosions that make our sun what it is), they form a variety of different atoms now with different numbers of protons and electrons. When these stars go nova, they undergo one final massive compression, fusing together LOTS of atoms, and explode in a ridiculous display call a super nova (if the sun is big enough). These explosions innevitably lead to the create of the most important element (atom) to life. The CARBON atom.

The difference between a protein and piece of metal is that proteins, and all other molecules that form the basis for life, revolve around the all important carbon atom – which provides the atomic property of flexibility. Proteins are structured around the carbon atom and carbon atoms allow a large variety of different types of atoms to come together. There is an entire field dedicated to studying this flexibility and the endless combinations of atoms that revolve around carbon. This field is called ‘Organic Chemistry’.

So when it comes to a transcription factor binding to DNA, what is REALLY happening?  Well, DNA is an organic molecule, based around the carbon atom, rich in phosphorous, and therefore exhibiting an overall negative charge. But,the individual nucleotides that are ordered along the phosphate backbone have their own charge domains. So you can think of a piece  of DNA, a transcriptional start site for example, as a unique sequence of charges that are either exposed (homochromatic) or not exposed (homochromatic). Lets not  forget that a transcription factor is a protein, which is also a molecule with its own unique pattern of charges across its surface. What happens when a positive charge and a negative charge come in to contact? They bind. But the charge pattern across the sequence of DNA and the binding domain of the protein have to match in order for them to actually bind.

Its almost as if each molecule was encoding some kind of charge pattern allowing it to bind…

Indeed, this is the very high level essence of the genetic code. This molecule encodes every protein that exists in the cell – transcription factors, cell membrane signaling molecules, hell, even the histone proteins that the DNA wraps itself around are encoded. The entire cell is just a crazy complex combination of encodings that determine which molecules bind with which, when, and where. And its all based on fundamental physical properties of atoms.

To be absolutely clear, the complexity of these systems arises through a natural evolutionary process. And there is plenty of evidence to show that complex systems absolutely CAN arise from simple beginnings using an evolutionary mechanism. We have demonstrated this both in living systems and computationally. Google for evidence if you are unsure whether or not to believe me.

So, in summary – why do molecules like transcription factors behave the way they do? Philosophically that is an irrelevant question. Scientifically that is THE question that theoretical and experimental physicists are currently trying to answer. Guys like Sheldon from ‘The Big Bang Theory’, or Stephen Hawkin in real life, guys like Albert Einstein,  etc. How do molecules behave the way they do? Encoded charges spread through the molecule by their fundamental physical properties. Even the behavior of proteins as complex as endonucleases can be deconstructed in to sequences of behaviors that arise from their composition and fundamental physical properties.

Pretty neat!