Code Vectorization Part 1

Code vectorization is sort of like word vectorization. It is not a simple task. Right now, as far as I know there are no algorithms that can turn vectorize code for input into machine learning algorithms. While people may not think that having a neural network analyze their code would be a good thing, it probably would be because a neural network may be able to be used for optimization of code, and maybe even language to language conversion. This would be quite significant because currently language conversions have to be done manually. This could most likely also be done at a higher level for framework to framework conversion, for example if someone was converting a website from web forms to MVC. While quite a bit of research has been done into word vecotorization, there(as far as I know) has not been a lot of research into code vectorization. However, since there is a lot of research into word vectorization we will be able to build off of that research and easily create methods that are based off of those that have been used successfully in word vectorization. However, code is quite a bit different from the english language. There are many different problems that I would have to solve in order to make something like this work. However, I think I can do it because it would be a lot easier building off of existing research into word vectorization such as the skip gram model and we can learn from a lot of the techniques they used there. However, I will definitely need to make some changes in order to get code vectors to be generated properly.

Code vs the english language

Code is very different in formatting from the english language. If we assume that we will always take in code that compiles, then all of our code will be in proper syntax and it will be formatted. Where as the english language will most likely never be as formatted as code is. This is one distinction. Another is that in english there are many different words that you will want to subsample. This is not really the case with code. While we may need to invent a “placeholder” technique for variable names, we will not need to subsample any terms.

Blocks vs lines vs statements

This is another distinction between the english language. In the english language words are the most basic unit. The second most basic unit is the sentence. However words can carry almost as much meaning as a whole sentence. With code, this is almost the case. With code vectorization, you need to decide whether or not you want to vectorize statements of code e.g. int i = 0;, lines of code for(int i = 0; 5 > i; i++), or whole blocks of code for(int i = 0; 5 > i; i++) { ... }. However, we also need to decide what constitutes a block, a statement, and a line of code. A statement might only constitue simple things such as the creation of a variable, or even just the type of a variable, or even a full for loop header. While all three of these methods can be used, code has to be formatted into blocks mostly so I think that blocks would be the way to go. However, if we use the largest unit we are also going to have fairly large vector sizes. It might be better if we only vectorize statements because then instead having a vector of length 50,000, we can have a vector of length 300. However, this will most likely be determined by experimenting with several different parameters.

The skip gram model

The skip gram model is a popular model for generating word vectors. We may be able to apply a similar technique to generate code vectors. The skip gram model involves creating a fake neural network and feeding it one hot vectors in order to obtain a trained weight matrix. Then this weight matrix is taken from the neural network and word vectors can be obtained from it. The skip gram model works by trying to predict which words surround a word. This model is very effective so it is used quite commonly. We may be able to apply something similar for code vectorization. Because if we can obtain good word vectors for the english language by predicting surrounding words, we may be able to get good code vectors by predicting surrounding code units(blocks, lines or statements). However, experimentation is needed.


One thing that the authors of the skip gram model proposed was sub sampling of frequent words. This meant that words like the would be subsampled heavily to create better word vectors for rarer words. While subsampling will probably not be needed while creating code vectors, we will probably need to use placeholders. Placeholders would stand in for variable values and names. This would make the system more robust because we could train our code vectorization software to look for all integers and not just integers named something specificly. These placeholders may or may not be needed and again experimentation is needed.


These are some of my ideas for making code vectorization a reality. In order to work on this I will need to implement a neural network that supports a soft max activating layer and it can be trained with either PSO(Particle Swarm Optimization) or backproppogation. I can then work on experimenting with some of these different ideas that I have listed here.