Enough NLP to be dangerous

Imaginary Problem

Suppose we want to predict the next word of a sentence.

“Oh, I really love ________”
“The Netherlands is _______”
“The final episode of Game of Thrones was ______”

How can we model the problem?

Let’s discuss three of them:

Statistical models and N-Grams
Naive one-hot encoding
Learn embeddings, like word2vec

N-Grams

An n-gram is a subsequence of n items from a given sequence.

input = ‘My dog is beautiful and funny’

Bigram = ‘my,’ ‘my dog,’ ‘dog is,’ ‘is beautiful,’ ‘beautiful and,’ ‘and funny’
Trigrams = ‘my,’ ‘my dog,’ ‘my dog is,’ ‘dog is beautiful,’ ‘is beautiful and,’ ‘beautiful and funny’

N-Grams are Markov Chain models

Suppose we want to predict which word x should follow word y?

Condition the likelihood of x occuring in the context of bigrams.
P(dog | my) = C(my dog) / C (my)
Basically, P(wn | wn-1), or more broadly P(wn | w1, w2, …, wn-1).
Similar to a Markov models, as they also assume that one can predict the future without looking too far!
Read Introduction to N-gram models, by James Pustejovsky, 2015 edition of the Computational Linguistics course at Brandeis University.

One-hot encoding

“Suits without Mike and Harvey? Give me a break!”
- Suits = [1, 0, 0, 0, 0, 0, 0, 0, 0], without = = [0, 1, 0, 0, 0, 0, 0, 0, 0], Mike = = [0, 0, 1, 0, 0, 0, 0, 0, 0], …, break = [0, 0, 0, 0, 0, 0, 0, 0, 1]
- (But for all the words in all the sentences in our dataset)
Use the hot-encoded words to train the model.

The problem?

Lack of context!
Transformation is not made via supervision
Words are represented by ‘random vectors,’ which have no real meaning.
Similar words are not necessary closed to each other.

Embeddings

We transform the words to vectors in a supervised manner.
Words with similar meaning would be grouped together.
(Do the “animal example” on the board)
Canonical example: king - man + woman = queen
Unintuitively: it learns not only about words that co-occur, but also about words that did not co-occur.

Example

Vector Representation of words. Source: https://www.tensorflow.org/tutorials/representation/word2vec

Word2Vec and Fastex

There are many techniques to learn vector representations from words:
- Word2Vec (via CBOW and skip-grams)
- Fastex
- …
Interesting posts on Word2vec: 1, 2, 3, 4
How to use pre-trained word embeddings in PyTorch

Seq2Seq

Really popular in translation tasks, e.g., from one language to the other.
RNNs (we learned about them before)
Input -> [Encode -> {Context] -> Decode} -> Output
See the nice examples by Jay Alammar
Attention mechanisms: Attention allows the model to focus on the relevant parts of the input sequence as needed.
Nice refs for seq2seq models: 1, 2

For you, in practice

A traditional flow when applying NLP in SE:

Given some source code, you:

Transform it into a bag of ‘words.’ Often, the AST tokens.
- To discuss: how to visit the AST?
You pass it through an embedding layer (just use pytorch’s one).
The outputs of the embedding layer are then inputs to the rest of your architecture.
In future lectures: code2vec

Discussion

When transforming the AST into a ‘list of words,’ can you see a challenge (that does not happen so intensively in natural language)?

English has ~171k words.
How many words does ‘code’ have?