Suppose we want to predict the next word of a sentence.
“Oh, I really love ________”
“The Netherlands is _______”
“The final episode of Game of Thrones was ______”
Let’s discuss three of them:
Statistical models and N-Grams
Naive one-hot encoding
Learn embeddings, like word2vec
An n-gram is a subsequence of n items from a given sequence.
input = ‘My dog is beautiful and funny’
Suppose we want to predict which word x should follow word y?
Condition the likelihood of x occuring in the context of bigrams.
P(dog | my) = C(my dog) / C (my)
Basically, P(wn | wn-1), or more broadly P(wn | w1, w2, …, wn-1).
Similar to a Markov models, as they also assume that one can predict the future without looking too far!
Read Introduction to N-gram models, by James Pustejovsky, 2015 edition of the Computational Linguistics course at Brandeis University.
“Suits without Mike and Harvey? Give me a break!”
Suits = [1, 0, 0, 0, 0, 0, 0, 0, 0], without = = [0, 1, 0, 0, 0, 0, 0, 0, 0], Mike = = [0, 0, 1, 0, 0, 0, 0, 0, 0], …, break = [0, 0, 0, 0, 0, 0, 0, 0, 1]
(But for all the words in all the sentences in our dataset)
Use the hot-encoded words to train the model.
Lack of context!
Transformation is not made via supervision
Words are represented by ‘random vectors,’ which have no real meaning.
Similar words are not necessary closed to each other.
We transform the words to vectors in a supervised manner.
Words with similar meaning would be grouped together.
(Do the “animal example” on the board)
Canonical example: king - man + woman = queen
Unintuitively: it learns not only about words that co-occur, but also about words that did not co-occur.
There are many techniques to learn vector representations from words:
Word2Vec (via CBOW and skip-grams)
Fastex
…
Really popular in translation tasks, e.g., from one language to the other.
RNNs (we learned about them before)
Input -> [Encode -> {Context] -> Decode} -> Output
See the nice examples by Jay Alammar
Attention mechanisms: Attention allows the model to focus on the relevant parts of the input sequence as needed.
A traditional flow when applying NLP in SE:
Given some source code, you:
Transform it into a bag of ‘words.’ Often, the AST tokens.
You pass it through an embedding layer (just use pytorch’s one).
The outputs of the embedding layer are then inputs to the rest of your architecture.
In future lectures: code2vec
When transforming the AST into a ‘list of words,’ can you see a challenge (that does not happen so intensively in natural language)?
The course contents are copyrighted (c) 2018 - onwards by TU Delft and their respective authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.