In the context of some of the Twitter research I’ve been doing, I decided to try out a few natural language processing (NLP) techniques. So far, word2vec has produced perhaps the most meaningful results. Wikipedia describes word2vec very precisely:
“Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.”
During the two weeks leading up to the January 2018 Finnish presidential elections, I performed an analysis of user interactions and behavior on Twitter, based on search terms relevant to that event. During the course of that analysis, I also dumped each Tweet’s raw text field to a text file, one item per line. I then wrote a small tool designed to preprocess the collected Tweets, feed that processed data into word2vec, and finally output some visualizations. Since word2vec creates multidimensional tensors, I’m using T-SNE for dimensionality reduction (the resulting visualizations are in two dimensions, compared to the 200 dimensions of the original data.)
The rest of this blog post will be devoted to listing and explaining the code used to perform these tasks. I’ll present the code as it appears in the tool. The code starts with a set of functions that perform processing and visualization tasks. The main routine at the end wraps everything up by calling each routine sequentially, passing artifacts from the previous step to the next one. As such, you can copy-paste each section of code into an editor, save the resulting file, and the tool should run (assuming you’ve pip installed all dependencies.) Note that I’m using two spaces per indent purely to allow the code to format neatly in this blog. Let’s start, as always, with importing dependencies. Off the top of my head, you’ll probably want to install tensorflow, gensim, six, numpy, matplotlib, and sklearn (although I think some of these install as part of tensorflow’s installation).
# -*- coding: utf-8 -*- from tensorflow.contrib.tensorboard.plugins import projector from sklearn.manifold import TSNE from collections import Counter from six.moves import cPickle import gensim.models.word2vec as w2v import numpy as np import tensorflow as tf import matplotlib.pyplot as plt import multiprocessing import os import sys import io import re import json
The next listing contains a few helper functions. In each processing step, I like to save the output. I do this for two reasons. Firstly, depending on the size of your raw data, each step can take some time. Hence, if you’ve performed the step once, and saved the output, it can be loaded from disk to save time on subsequent passes. The second reason for saving each step is so that you can examine the output to check that it looks like what you want. The try_load_or_process() function attempts to load the previously saved output from a function. If it doesn’t exist, it runs the function and then saves the output. Note also the rather odd looking implementation in save_json(). This is a workaround for the fact that json.dump() errors out on certain non-ascii characters when paired with io.open().
def try_load_or_process(filename, processor_fn, function_arg): load_fn = None save_fn = None if filename.endswith("json"): load_fn = load_json save_fn = save_json else: load_fn = load_bin save_fn = save_bin if os.path.exists(filename): return load_fn(filename) else: ret = processor_fn(function_arg) save_fn(ret, filename) return retdef print_progress(current, maximum):
sys.stdout.write("\r")
sys.stdout.flush()
sys.stdout.write(str(current) + “/” + str(maximum))
sys.stdout.flush()def save_bin(item, filename):
with open(filename, “wb”) as f:
cPickle.dump(item, f)def load_bin(filename):
if os.path.exists(filename):
with open(filename, “rb”) as f:
return cPickle.load(f)def save_json(variable, filename):
with io.open(filename, “w”, encoding=“utf-8”) as f:
f.write(unicode(json.dumps(variable, indent=4, ensure_ascii=False)))def load_json(filename):
ret = None
if os.path.exists(filename):
try:
with io.open(filename, “r”, encoding=“utf-8”) as f:
ret = json.load(f)
except:
pass
return ret
Moving on, let’s look at the first preprocessing step. This function takes the raw text strings dumped from Tweets, removes unwanted characters and features (such as user names and URLs), removes duplicates, and returns a list of sanitized strings. Here, I’m not using string.printable for a list of characters to keep, since Finnish includes additional letters that aren’t part of the english alphabet (äöåÄÖÅ). The regular expressions used in this step have been somewhat tailored for the raw input data. Hence, you may need to tweak them for your own input corpus.
def process_raw_data(input_file): valid = u"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ#@.:/ äöåÄÖÅ" url_match = "(https?:\/\/[0-9a-zA-Z\-\_]+\.[\-\_0-9a-zA-Z]+\.?[0-9a-zA-Z\-\_]*\/?.*)" name_match = "\@[\_0-9a-zA-Z]+\:?" lines = [] print("Loading raw data from: " + input_file) if os.path.exists(input_file): with io.open(input_file, 'r', encoding="utf-8") as f: lines = f.readlines() num_lines = len(lines) ret = [] for count, text in enumerate(lines): if count % 50 == 0: print_progress(count, num_lines) text = re.sub(url_match, u"", text) text = re.sub(name_match, u"", text) text = re.sub("\&\;?", u"", text) text = re.sub("[\:\.]{1,}$", u"", text) text = re.sub("^RT\:?", u"", text) text = u''.join(x for x in text if x in valid) text = text.strip() if len(text.split()) > 5: if text not in ret: ret.append(text) return ret
The next step is to tokenize each sentence (or Tweet) into words.
def tokenize_sentences(sentences): ret = [] max_s = len(sentences) print("Got " + str(max_s) + " sentences.") for count, s in enumerate(sentences): tokens = [] words = re.split(r'(\s+)', s) if len(words) > 0: for w in words: if w is not None: w = w.strip() w = w.lower() if w.isspace() or w == "\n" or w == "\r": w = None if len(w) < 1: w = None if w is not None: tokens.append(w) if len(tokens) > 0: ret.append(tokens) if count % 50 == 0: print_progress(count, max_s) return ret
The final text preprocessing step removes unwanted tokens. This includes numeric data and stop words. Stop words are the most common words in a language. We omit them from processing in order to bring out the meaning of the text in our analysis. I downloaded a json dump of stop words for all languages from here, and placed it in the same directory as this script. If you plan on trying this code out yourself, you’ll need to perform the same steps. Note that I included extra stopwords of my own. After looking at the output of this step, I noticed that Twitter’s truncation of some tweets caused certain word fragments to occur frequently.
def clean_sentences(tokens): all_stopwords = load_json("stopwords-iso.json") extra_stopwords = ["ssä", "lle", "h.", "oo", "on", "muk", "kov", "km", "ia", "täm", "sy", "but", ":sta", "hi", "py", "xd", "rr", "x:", "smg", "kum", "uut", "kho", "k", "04n", "vtt", "htt", "väy", "kin", "#8", "van", "tii", "lt3", "g", "ko", "ett", "mys", "tnn", "hyv", "tm", "mit", "tss", "siit", "pit", "viel", "sit", "n", "saa", "tll", "eik", "nin", "nii", "t", "tmn", "lsn", "j", "miss", "pivn", "yhn", "mik", "tn", "tt", "sek", "lis", "mist", "tehd", "sai", "l", "thn", "mm", "k", "ku", "s", "hn", "nit", "s", "no", "m", "ky", "tst", "mut", "nm", "y", "lpi", "siin", "a", "in", "ehk", "h", "e", "piv", "oy", "p", "yh", "sill", "min", "o", "va", "el", "tyn", "na", "the", "tit", "to", "iti", "tehdn", "tlt", "ois", ":", "v", "?", "!", "&"] stopwords = None if all_stopwords is not None: stopwords = all_stopwords["fi"] stopwords += extra_stopwords ret = [] max_s = len(tokens) for count, sentence in enumerate(tokens): if count % 50 == 0: print_progress(count, max_s) cleaned = [] for token in sentence: if len(token) > 0: if stopwords is not None: for s in stopwords: if token == s: token = None if token is not None: if re.search("^[0-9\.\-\s\/]+$", token): token = None if token is not None: cleaned.append(token) if len(cleaned) > 0: ret.append(cleaned) return ret
The next function creates a vocabulary from the processed text. A vocabulary, in this context, is basically a list of all unique tokens in the data. This function creates a frequency distribution of all tokens (words) by counting the number of occurrences of each token. We will use this later to “trim” the vocabulary down to a manageable size.
def get_word_frequencies(corpus): frequencies = Counter() for sentence in corpus: for word in sentence: frequencies[word] += 1 freq = frequencies.most_common() return freq
Now we’re done with all preprocessing steps, let’s get into the more interesting analysis functions. The following function accepts the tokenized and cleaned data generated from the steps above, and uses it to train a word2vec model. The num_features parameter sets the number of features each word is assigned (and hence the dimensionality of the resulting tensor.) It is recommended to set it between 100 and 1000. Naturally, larger values take more processing power and memory/disk space to handle. I found 200 to be enough, but I normally start with a value of 300 when looking at new datasets. The min_count variable passed to word2vec designates how to trim the vocabulary. For example, if min_count is set to 3, all words that appear in the data set less than 3 times will be discarded from the vocabulary used when training the word2vec model. In the dimensionality reduction step we perform later, large vocabulary sizes cause T-SNE iterations to take a long time. Hence, I tuned min_count to generate a vocabulary of around 10,000 words. Increasing the value of sample, will cause word2vec to randomly omit words with high frequency counts. I decided that I wanted to keep all of those words in my analysis, so it’s set to zero. Increasing epoch_count will cause word2vec to train for more iterations, which will, naturally take longer. Increase this if you have a fast machine or plenty of time on your hands
def get_word2vec(sentences): num_workers = multiprocessing.cpu_count() num_features = 200 epoch_count = 10 sentence_count = len(sentences) w2v_file = os.path.join(save_dir, "word_vectors.w2v") word2vec = None if os.path.exists(w2v_file): print("w2v model loaded from " + w2v_file) word2vec = w2v.Word2Vec.load(w2v_file) else: word2vec = w2v.Word2Vec(sg=1, seed=1, workers=num_workers, size=num_features, min_count=min_frequency_val, window=5, sample=0)print("Building vocab...") word2vec.build_vocab(sentences) print("Word2Vec vocabulary length:", len(word2vec.wv.vocab)) print("Training...") word2vec.train(sentences, total_examples=sentence_count, epochs=epoch_count) print("Saving model...") word2vec.save(w2v_file)
return word2vec
Tensorboard has some good tools to visualize word embeddings in the word2vec model we just created. These visualizations can be accessed using the “projector” tab in the interface. Here’s code to create tensorboard embeddings:
def create_embeddings(word2vec): all_word_vectors_matrix = word2vec.wv.syn0 num_words = len(all_word_vectors_matrix) vocab = word2vec.wv.vocab.keys() vocab_len = len(vocab) dim = word2vec.wv[vocab[0]].shape[0] embedding = np.empty((num_words, dim), dtype=np.float32) metadata = "" for i, word in enumerate(vocab): embedding[i] = word2vec.wv[word] metadata += word + "\n" metadata_file = os.path.join(save_dir, "metadata.tsv") with io.open(metadata_file, "w", encoding="utf-8") as f: f.write(metadata)tf.reset_default_graph()
sess = tf.InteractiveSession()
X = tf.Variable([0.0], name=‘embedding’)
place = tf.placeholder(tf.float32, shape=embedding.shape)
set_x = tf.assign(X, place, validate_shape=False)
sess.run(tf.global_variables_initializer())
sess.run(set_x, feed_dict={place: embedding})summary_writer = tf.summary.FileWriter(save_dir, sess.graph)
config = projector.ProjectorConfig()
embedding_conf = config.embeddings.add()
embedding_conf.tensor_name = ‘embedding:0’
embedding_conf.metadata_path = ‘metadata.tsv’
projector.visualize_embeddings(summary_writer, config)save_file = os.path.join(save_dir, “model.ckpt”)
print(“Saving session…”)
saver = tf.train.Saver()
saver.save(sess, save_file)
Once this code has been run, tensorflow log entries will be created in save_dir. To start a tensorboard session, run the following command from the same directory where this script was run from:
tensorboard –logdir=save_dir
You should see output like the following once you’ve run the above command:
TensorBoard 0.4.0rc3 at http://node.local:6006 (Press CTRL+C to quit)
Navigate your web browser to localhost:<port_number> to see the interface. From the “Inactive” pulldown menu, select “Projector”.
Once you’ve selected “projector”, you should see a view like this:
Tensorboard’s projector view allows you to interact with word embeddings, search for words, and even run t-sne on the dataset.
There are a lot of things to play around with in this view. You can search for words, fly around the embeddings, and even run t-sne (on the bottom left) on the dataset. If you get to this step, have fun playing with the interface!
And now, back to the code. One of word2vec’s most interesting functions is to find similarities between words. This is done via the word2vec.wv.most_similar() call. The following function calls word2vec.wv.most_similar() for a word and returns num-similar words. The returned value is a list containing the queried word, and a list of similar words. ( [queried_word, [similar_word1, similar_word2, …]] ).
def most_similar(input_word, num_similar): sim = word2vec.wv.most_similar(input_word, topn=num_similar) output = [] found = [] for item in sim: w, n = item found.append(w) output = [input_word, found] return output
The following function takes a list of words to be queried, passes them to the above function, saves the output, and also passes the queried words to t_sne_scatterplot(), which we’ll show later. It also writes a csv file – associations.csv – which can be imported into Gephi to generate graphing visualizations. You can see some Gephi-generated visualizations in the accompanying blog post.
I find that manually viewing the word2vec_test.json file generated by this function is a good way to read the list of similarities found for each word queried with wv.most_similar().
def test_word2vec(test_words): vocab = word2vec.wv.vocab.keys() vocab_len = len(vocab) output = [] associations = {} test_items = test_words for count, word in enumerate(test_items): if word in vocab: print("[" + str(count+1) + "] Testing: " + word) if word not in associations: associations[word] = [] similar = most_similar(word, num_similar) t_sne_scatterplot(word) output.append(similar) for s in similar[1]: if s not in associations[word]: associations[word].append(s) else: print("Word " + word + " not in vocab") filename = os.path.join(save_dir, "word2vec_test.json") save_json(output, filename) filename = os.path.join(save_dir, "associations.json") save_json(associations, filename) filename = os.path.join(save_dir, "associations.csv") handle = io.open(filename, "w", encoding="utf-8") handle.write(u"Source,Target\n") for w, sim in associations.iteritems(): for s in sim: handle.write(w + u"," + s + u"\n") return output
The next function implements standalone code for creating a scatterplot from the output of T-SNE on a set of data points obtained from a word2vec.wv.most_similar() query. The scatterplot is visualized with matplotlib. Unfortunately, my matplotlib skills leave a lot to be desired, and these graphs don’t look great. But they’re readable.
def t_sne_scatterplot(word): vocab = word2vec.wv.vocab.keys() vocab_len = len(vocab) dim0 = word2vec.wv[vocab[0]].shape[0] arr = np.empty((0, dim0), dtype='f') w_labels = [word] nearby = word2vec.wv.similar_by_word(word, topn=num_similar) arr = np.append(arr, np.array([word2vec[word]]), axis=0) for n in nearby: w_vec = word2vec[n[0]] w_labels.append(n[0]) arr = np.append(arr, np.array([w_vec]), axis=0)tsne = TSNE(n_components=2, random_state=1)
np.set_printoptions(suppress=True)
Y = tsne.fit_transform(arr)
x_coords = Y[:, 0]
y_coords = Y[:, 1]plt.rc(“font”, size=16)
plt.figure(figsize=(16, 12), dpi=80)
plt.scatter(x_coords[0], y_coords[0], s=800, marker=“o”, color=“blue”)
plt.scatter(x_coords[1:], y_coords[1:], s=200, marker=“o”, color=“red”)for label, x, y in zip(w_labels, x_coords, y_coords):
plt.annotate(label.upper(), xy=(x, y), xytext=(0, 0), textcoords=‘offset points’)
plt.xlim(x_coords.min()-50, x_coords.max()+50)
plt.ylim(y_coords.min()-50, y_coords.max()+50)
filename = os.path.join(plot_dir, word + “_tsne.png”)
plt.savefig(filename)
plt.close()
In order to create a scatterplot of the entire vocabulary, we need to perform T-SNE over that whole dataset. This can be a rather time-consuming operation. The next function performs that operation, attempting to save and re-load intermediate steps (since some of them can take over 30 minutes to complete).
def calculate_t_sne(): vocab = word2vec.wv.vocab.keys() vocab_len = len(vocab) arr = np.empty((0, dim0), dtype='f') labels = [] vectors_file = os.path.join(save_dir, "vocab_vectors.npy") labels_file = os.path.join(save_dir, "labels.json") if os.path.exists(vectors_file) and os.path.exists(labels_file): print("Loading pre-saved vectors from disk") arr = load_bin(vectors_file) labels = load_json(labels_file) else: print("Creating an array of vectors for each word in the vocab") for count, word in enumerate(vocab): if count % 50 == 0: print_progress(count, vocab_len) w_vec = word2vec[word] labels.append(word) arr = np.append(arr, np.array([w_vec]), axis=0) save_bin(arr, vectors_file) save_json(labels, labels_file)x_coords = None
y_coords = None
x_c_filename = os.path.join(save_dir, “x_coords.npy”)
y_c_filename = os.path.join(save_dir, “y_coords.npy”)
if os.path.exists(x_c_filename) and os.path.exists(y_c_filename):
print(“Reading pre-calculated coords from disk”)
x_coords = load_bin(x_c_filename)
y_coords = load_bin(y_c_filename)
else:
print("Computing T-SNE for array of length: " + str(len(arr)))
tsne = TSNE(n_components=2, random_state=1, verbose=1)
np.set_printoptions(suppress=True)
Y = tsne.fit_transform(arr)
x_coords = Y[:, 0]
y_coords = Y[:, 1]
print(“Saving coords.”)
save_bin(x_coords, x_c_filename)
save_bin(y_coords, y_c_filename)
return x_coords, y_coords, labels, arr
The next function takes the data calculated in the above step, and data obtained from test_word2vec(), and plots the results from each word queried on the scatterplot of the entire vocabulary. These plots are useful for visualizing which words are closer to others, and where clusters commonly pop up. This is the last function before we get onto the main routine.
def show_cluster_locations(results, labels, x_coords, y_coords): for item in results: name = item[0] print("Plotting graph for " + name) similar = item[1] in_set_x = [] in_set_y = [] out_set_x = [] out_set_y = [] name_x = 0 name_y = 0 for count, word in enumerate(labels): xc = x_coords[count] yc = y_coords[count] if word == name: name_x = xc name_y = yc elif word in similar: in_set_x.append(xc) in_set_y.append(yc) else: out_set_x.append(xc) out_set_y.append(yc) plt.figure(figsize=(16, 12), dpi=80) plt.scatter(name_x, name_y, s=400, marker="o", c="blue") plt.scatter(in_set_x, in_set_y, s=80, marker="o", c="red") plt.scatter(out_set_x, out_set_y, s=8, marker=".", c="black") filename = os.path.join(big_plot_dir, name + "_tsne.png") plt.savefig(filename) plt.close()
Now let’s write our main routine, which will call all the above functions, process our collected Twitter data, and generate visualizations. The first few lines take care of our three preprocessing steps, and generation of a frequency distribution / vocabulary. The script expects the raw Twitter data to reside in a relative path (data/tweets.txt). Change those variables as needed. Also, all output is saved to a subdirectory in the relative path (analysis/). Again, tailor this to your needs.
if __name__ == '__main__': input_dir = "data" save_dir = "analysis" if not os.path.exists(save_dir): os.makedirs(save_dir)print(“Preprocessing raw data”)
raw_input_file = os.path.join(input_dir, “tweets.txt”)
filename = os.path.join(save_dir, “data.json”)
processed = try_load_or_process(filename, process_raw_data, raw_input_file)
print("Unique sentences: " + str(len(processed)))print(“Tokenizing sentences”)
filename = os.path.join(save_dir, “tokens.json”)
tokens = try_load_or_process(filename, tokenize_sentences, processed)print(“Cleaning tokens”)
filename = os.path.join(save_dir, “cleaned.json”)
cleaned = try_load_or_process(filename, clean_sentences, tokens)print(“Getting word frequencies”)
filename = os.path.join(save_dir, “frequencies.json”)
frequencies = try_load_or_process(filename, get_word_frequencies, cleaned)
vocab_size = len(frequencies)
print("Unique words: " + str(vocab_size))
Next, I trim the vocabulary, and save the resulting list of words. This allows me to look over the trimmed list and ensure that the words I’m interested in survived the trimming operation. Due to the nature of the Finnish language, (and Twitter), the vocabulary of our “cleaned” set, prior to trimming, was over 100,000 unique words. After trimming it ended up at around 11,000 words.
trimmed_vocab = [] min_frequency_val = 6 for item in frequencies: if item[1] >= min_frequency_val: trimmed_vocab.append(item[0]) trimmed_vocab_size = len(trimmed_vocab) print("Trimmed vocab length: " + str(trimmed_vocab_size)) filename = os.path.join(save_dir, "trimmed_vocab.json") save_json(trimmed_vocab, filename)
The next few lines do all the compute-intensive work. We’ll create a word2vec model with the cleaned token set, create tensorboard embeddings (for the visualizations mentioned above), and calculate T-SNE. Yes, this part can take a while to run, so go put the kettle on.
print print("Instantiating word2vec model") word2vec = get_word2vec(cleaned) vocab = word2vec.wv.vocab.keys() vocab_len = len(vocab) print("word2vec vocab contains " + str(vocab_len) + " items.") dim0 = word2vec.wv[vocab[0]].shape[0] print("word2vec items have " + str(dim0) + " features.")print(“Creating tensorboard embeddings”)
create_embeddings(word2vec)print(“Calculating T-SNE for word2vec model”)
x_coords, y_coords, labels, arr = calculate_t_sne()
Finally, we’ll take the top 50 most frequent words from our frequency distrubution, query them for 40 most similar words, and plot both labelled graphs of each set, and a “big plot” of that set on the entire vocabulary.
plot_dir = os.path.join(save_dir, "plots") if not os.path.exists(plot_dir): os.makedirs(plot_dir)num_similar = 40
test_words = []
for item in frequencies[:50]:
test_words.append(item[0])
results = test_word2vec(test_words)big_plot_dir = os.path.join(save_dir, “big_plots”)
if not os.path.exists(big_plot_dir):
os.makedirs(big_plot_dir)
show_cluster_locations(results, labels, x_coords, y_coords)
And that’s it! Rather a lot of code, but it does quite a few useful tasks. If you’re interested in seeing the visualizations I created using this tool against the Tweets collected from the January 2018 Finnish presidential elections, check out this blog post.
Article Link: https://labsblog.f-secure.com/2018/01/30/nlp-analysis-of-tweets-using-word2vec-and-t-sne/