Imaginary Problem

Suppose we want to predict the next word of a sentence.

How can we model the problem?

Let’s discuss three of them:

N-Grams

An n-gram is a subsequence of n items from a given sequence.

input = ‘My dog is beautiful and funny’

N-Grams are Markov Chain models

Suppose we want to predict which word x should follow word y?

One-hot encoding

The problem?

Embeddings

Example

Vector Representation of words. Source: https://www.tensorflow.org/tutorials/representation/word2vec

Word2Vec and Fastex

Seq2Seq

For you, in practice

A traditional flow when applying NLP in SE:

Given some source code, you:

Discussion

When transforming the AST into a ‘list of words,’ can you see a challenge (that does not happen so intensively in natural language)?

Bibliography