In this example, we will be analyzing movie discussions and create a trivial synonym engine. Our engine is based on Word2Vec, a family of shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. In essence, Word2Vec attempts to understand meaning and semantic relationships among words.
We will be using the Spark machine learning package to implement our synonyms service. Spark Machine Learning comes in two flavours:
Since Spark 2.0, MLlib is in maintenance mode, meaning that no new features are implemented for it. Therefore, for new projects, it should be avoided. Some features of MLlib are yet to be ported to SparkML, and the documentation is better for MLlib.
For the remaining of the tutorial, we will be using the SparkML variant.
The dataset we will be using comes from Kaggle; the full dataset is available at this location.
We load the data as an RDD file. As the data contains HTML code, we need to clear it out. We also need to remove punctuation marks and lower case all our words. This will make our input vocabulary much smaller and therefore Word2Vec will not need to use too big vectors.
val path = s"../datasets/imdb.csv"
val data = sc.textFile(path).
// Remove HTML, string escapes and punctuation
map(w => w.replaceAll("""<(?!\/?a(?=>|\s.*>))\/?.*>""", "")).
map(w => w.replaceAll("""[\…\”\'\’\`\,\(\)\"\\]""", "")).
// Make lowercase
map(w => w.trim.toLowerCase).
// Word2Vec works at the sentence level
flatMap(c => c.split("[.?!;:]"))
Let's check what our raw data looks like
data.take(3).foreach(l => println(" R:" + l))
R:x R:jennifer ehle was sparkling in pride and prejudice R: jeremy northam was simply wonderful in the winslow boy
Since SparkML is based on Dataframes, we need to convert our source RDD to a suitable Dataframe. To do so, we first create a schema, consisting of a Sequence of fields that contain Arrays of Strings :-)
Remember that Word2Vec treats text as a bag of words; a bag of word representation on a computer is an Array of Strings.
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// Convert trainData from RDD[String] to DataFrame[Array[String]]
val schema = StructType(Seq(StructField("text", ArrayType(StringType, true), true)))
var documentDF = spark.createDataFrame(data.map(r => org.apache.spark.sql.Row(r.split(" "))), schema)
documentDF.take(2).foreach(println)
[WrappedArray(x)] [WrappedArray(jennifer, ehle, was, sparkling, in, pride, and, prejudice)]
In the dataframe above, we have lots of words that are repeating: think for example articles ('a', 'the'), prepositions (at, on, in) etc. Those words do not add much information to our dataset. You can get an intuitive understanding about this fact by trying to remove those words from everyday sentences: for example, "a cat is under the table" can be converted to "cat is under table" or even to "cat is table" and still get the idea.
To increase the information density of our vectors, we can remove stopwords with StopWordsRemover
transformer.
We do so in a non-distructive manner; we add a new column in our Dataframe where the contents of our input text
have been processed to remove stopwords.
import org.apache.spark.ml.feature.StopWordsRemover
// Remove stop words
val stopWordsRemover = new StopWordsRemover().setInputCol("text").setOutputCol("nostopwords")
documentDF = stopWordsRemover.transform(documentDF)
documentDF.take(2).foreach(println)
[WrappedArray(x),WrappedArray(x)] [WrappedArray(jennifer, ehle, was, sparkling, in, pride, and, prejudice),WrappedArray(jennifer, ehle, sparkling, pride, prejudice)]
We are now ready to train our model! To do so, we set the VectorSize parameter to 200; this means the output will
To exclude the long tail of words that do not appear frequently, we remove words will less than 10 appearences in our dataset.
import org.apache.spark.ml.feature.Word2Vec
// Learn a mapping from words to Vectors
val word2Vec = new Word2Vec().
setInputCol("text").
setOutputCol("result").
setVectorSize(200).
setMinCount(10)
val model = word2Vec.fit(documentDF)
Out of the box, the Word2Vec API only allows us to check related for a single word. Let's give it a try:
// Find synonyms for a single word
model.findSynonyms("pitt", 10).collect.foreach(println)
[dourif,0.6937394142150879] [shin,0.6449180841445923] [reservoir,0.6232109665870667] [driver,0.5692991614341736] [neve,0.5690980553627014] [garrett,0.5564441084861755] [freeman,0.5445412397384644] [pitts,0.5419933199882507] [clooney,0.5413327217102051] [brad,0.5296022891998291]
What we see is that Word2Vec actually managed to uncover some related terms given a popular name in the dataset. What is more interesting however, is to see whether we can extract meaningfull terms with respect to a provided phrase. For this, we need to use Word2Vec's findSynonyms(s: Vector)
function.
To do so, we first define a function toDF
that converts an input string to a vector representation suitable for searching; this basically just tokenizes an input string and converts it to a Spark Dataframe (hence the name).
def toDF(s: String) =
spark.createDataFrame(Seq(
s.trim.
toLowerCase.
split(" ")
).map(Tuple1.apply)).
toDF("text")
toDF("James Bond").collect.foreach(println)
[WrappedArray(james, bond)]
We then call the transform
method on the created Dataframe; this converts our Dataframe to a vector representation using the same vocabulary as our corpus.
val q = model.transform(toDF("James Bond"))
q.printSchema
root |-- text: array (nullable = true) | |-- element: string (containsNull = true) |-- result: vector (nullable = true)
To automate the steps above, we create a method that takes a query (as String) and prints the 10 most relevant terms in our model, excluding terms that are included in the query itself.
def query(s: String) = {
val q = model.transform(toDF(s))
val qTokens = s.toLowerCase.split(" ")
model.
findSynonyms(q.first.getAs[Vector]("result"), 10).
filter(r => !qTokens.contains(r(0))).
collect.
foreach(println)
}
query("Movie")
[film,0.8394587635993958] [flick,0.5519292950630188] [documentary,0.5171377062797546] [cartoon,0.49328547716140747] [mini-series,0.48773887753486633] [show,0.4875704050064087] [flic,0.48537948727607727] [picture,0.4752477705478668] [stinker,0.4617299735546112]
query("brad pitt is a great actor")
[lew,0.6632834672927856] [freeman,0.6589218378067017] [shue,0.6399856209754944] [morrow,0.6370916366577148] [schoelen,0.6293715238571167] [suchet,0.6270862221717834] [elisabeth,0.6265118718147278] [anita,0.625586748123169] [mathis,0.6212812662124634] [goldman,0.6208907961845398]
One of the nice side effects of being able to uncover latent meanings with tools like Word2Vec is being able to solve analogy problems. In the original Word2Vec paper, the authors show that, when trained on a sufficiently large corpus (billions of items), Word2Vec models can uncover relationships such as:
v(king) - v(man) + v(woman) =~ v(queen)
or, otherwise put: Man is to a king what Woman is to a queen (i.e. their gender). This works simply by performing algebraic vector operations on transformed vector reprensetations of words.
To check whether our model can uncover such relationships as well, we first implement a few simple vector operations.
import org.apache.spark.ml.linalg.DenseVector
import math._
def vectorDiff(xs: Vector, ys: Vector) : Vector =
new DenseVector((xs.toArray zip ys.toArray).map { case (x,y) => x - y})
def vectorDistance(xs: Vector, ys: Vector) =
sqrt((xs.toArray zip ys.toArray).map { case (x,y) => pow(y - x, 2) }.sum)
Then, we implement our analogy function; it returns the Euclidean distance between the vector differences between the entered terms as pairs:
def analogy(x: String, isToY: String, likeZ: String, isToA: String) {
val q = model.transform(toDF(x))
val w = model.transform(toDF(isToY))
val m = model.transform(toDF(likeZ))
val k = model.transform(toDF(isToA))
val left = vectorDiff(q.first.getAs[Vector]("result"), w.first.getAs[Vector]("result"))
val right = vectorDiff(k.first.getAs[Vector]("result"), m.first.getAs[Vector]("result"))
println(vectorDistance(left, right))
}
analogy("king","man","queen","woman")
4.755026306703955
analogy("soldier","army","sailor","navy")
2.109409555544533
analogy("Athens","Greece","Paris","France")
2.082910303399422
analogy("brother","sister","grandson","grandaughter")
1.7550715431266148
// The dataset is from the mid-00s :-)
analogy("brad pitt","angelina jolie","Leonardo DiCaprio", "Gisele Bundchen")
1.5404416252080781