2. please go to http://ADDRESS
and enter a sentence
interesting relationships?
gensim generated the data for those visualizations
by computing the semantic similarity of the input
3. who am I?
William Bert
developer at Carney Labs (teamcarney.com)
user of gensim
still new to world of topic modelling,
semantic similarity, etc
4. gensim: “topic modeling for humans”
topic modeling attempts to uncover the
underlying semantic structure of by identifying
recurring patterns of terms in a set of data
(topics).
topic modelling
does not parse sentences,
does not care about word order, and
does not "understand" grammar or syntax.
6. gensim isn't about topic modeling
(for me, anyway)
It's about similarity.
What is similarity?
Some types:
• String matching
• Stylometry
• Term frequency
• Semantic (meaning)
7. Is
A seven-year quest to collect samples from the
solar system's formation ended in triumph in a
dark and wet Utah desert this weekend.
similar in meaning to
For a month, a huge storm with massive
lightning has been raging on Jupiter under the
watchful eye of an orbiting spacecraft.
more or less than it is similar to
One of Saturn's moons is spewing a giant plume
of water vapour that is feeding the planet's
rings, scientists say.
?
8. Who cares about semantic similarity?
Some use cases:
• Query large collections of text
• Automatic metadata
• Recommendations
• Better human-computer interaction
9. gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
corpus = stream of vectors of document
feature ids
for example, words in documents are
features (“bucket of words”)
10. gensim.corpora
TextCorpus and other kinds of corpus classes
>>> corpus = TextCorpus(file_like_object)
>>> [doc for doc in corpus]
[[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
Dictionary class
>>> print corpus.dictionary
Dictionary(8472 unique tokens)
dictionary maps features (words) to feature ids
(numbers)
11. need massive collection of documents that
ostensibly has meaning
sounds like a job for wikipedia
>>>wiki_corpus= WikiCorpus(articles) # articles
is Wikipedia text dump bz2 file. several hours.
>>>wiki_corpus.dictionary.save("wiki_dict.dict")
# persist dictionary
>>>MmCorpus.serialize("wiki_corpus.mm", wiki_corpu
s) # uses numpy to persist corpus in Matrix
Market format. several GBs. can be BZ2’ed.
>>>wiki_corpus= MmCorpus("wiki_corpus.mm") #
revive a corpus
12. gensim.models
transform corpora using models classes
for example, term frequency/inverse document
frequency (TFIDF) transformation
reflects importance of a term, not just
presence/absence
13. gensim.models
>>> tfidf_trans =
models.TfidfModel(wiki_corpus, id2word=dictionar
y) # TFIDF computes frequencies of all document
features in the corpus. several hours.
TfidfModel(num_docs=3430645, num_nnz=547534266)
>>> tfidf_trans[documents] # emits documents
in TFIDF representation. documents must be in
the same BOW vector space as wiki_corpus.
[[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...
]
>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans
[wiki_corpus], id2word=dictionary) # builds new
corpus by iterating over documents transformed
to TFIDF
16. topics again for a bit
• SVD decomposes a matrix into three simpler matrices
• full rank SVD would be able to recreate the underlying
matrix exactly from those three matrices
• lower-rank SVD provides the best (least square error)
approximation of the matrix
• this approximation can find interesting relationships among
data
• it preserves most information while reducing noise and
merging dimensions associated with terms that have similar
meanings
17. topics again for a bit
• SVD:
alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
•Original paper:
www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA
_Deerwester1990.pdf
• General explanation:
tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf
• Many more
18. gensim.models
>>> lsi_trans =
models.LsiModel(corpus=tfidf_corpus, id2word=dicti
onary, num_features=400, decay=1.0, chunksize=2000
0) # creates LSI transformation model from tfidf
corpus representation
>>> print lsi_trans
LsiModel(num_terms=100000, num_topics=400, decay=1
.0, chunksize=20000)
19. gensim.similarities
(the best part)
>>> index =
Similarity(corpus=lsi_transformation[tfidf_trans
formation[index_corpus]], num_features=400, outp
ut_prefix=”/tmp/shard”)
>>>
index[lsi_trans[tfidf_trans[dictionary.doc2bow(to
kenize(query))]]] # similarity of each document
in the index corpus to a new query document
>>> [s for s in index] # a matrix of each
document’s similarities to all other documents
[array([ 1. , 0. , 0.08, 0.01]),
array([ 0. , 1. , 0.02, -0.02]),
array([ 0.08, 0.02, 1. , 0.15]),
array([ 0.01, -0.02, 0.15, 1. ])]
20. about gensim
four additional models available
dependencies: optional:
numpy Pyro
scipy Pattern
created by Radim Rehurek
•radimrehurek.com/gensim
•github.com/piskvorky/gensim
•groups.google.com/group/gensim
21. thank you
example code, visualization code, and ppt:
github.com/sandinmyjoints
interview with Radim:
williamjohnbert.com
23. gensim.models
• term frequency/inverse document frequency
(TFIDF)
• log entropy
• random projections
• latent dirichlet allocation (LDA)
• hierarchical dirichlet process (HDP)
• latent semantic analysis/indexing (LSA/LSI)
24. slightly more about gensim
Dependencies: numpy and scipy, and optionally
Pyro for distributed and Pattern for lemmatization
data from Lee 2005 and other papers is available
in gensim for tests
Hi everyone, thanks for coming.I’m going to start off with a quick demo app. Please go to the address you see up there. Hookbox somtimes takes several seconds to connect to the channel, so give it some if it’s red. It will turn green. You’ll be invited to submit a sentence, particularly a statement or fact that has a mixture of nouns, verbs, and adjectives. There’s some examples of the kinds of sentence that might work well with this demo, but take a moment to think of a sentence of your own and go ahead and submit. We should see them pop up here on the visualization screen.The idea is to provide a bit of a concrete grounding for the talk, and this will serve as an example of one thing you can do with gensim, or at least the data generated by it.What do we see? We have a table comparing a number of submitted sentences with what I’m going to call similarity scores between them. The darker the green, the higher the score. Are there any interesting results?Here we also have some clustering visualizations that attempt to group the inputs that were found to have highest scores together. How do they cluster together?clickHopefully that worked and showed some interesting relationships among the input. (If not, well, I blame the input.)gensim, which I’ll be talking about today, was generating all the underlying similarity scores, measuring how similar each sentence was to the other ones. I’m going to explain how to get results like this from gensim.
A few quick words about me:William BertDeveloper at Carney Labs for about seven monthsCarney Labs is basically a startup wholly owned by a larger company in Alexandria called Team Carney.I use gensim at work developing a conversational tutoring web app.Topic modelling is still pretty new to me and I’m constantly learning more about it so my knowledge is still growing, but I’m really fascinated by it and trying to learn more by working with it a lot, and by doing things like this presentation.
gensim is a free Python framework for doing topic modelling. I’m going to blaze through a quick overview of topic modelling, then discuss how gensim uses it to do semantic similarity, generating data like what we saw.Topic modelling attempts to uncover the underlying semantic structure of text (or other data) by using statistical techniques to identify abstract, recurring patterns of terms in a set of data. These patterns are called topics. They may or may not correspond to our intuitive notion of a topic. Topic modelling models documents as collections of features, representing the documents as long vectors that indicate the presence/absence of important features, for example, the presence or absence of words in a document. We can use those vectors to create spaces and plot the locations of documents in those spaces and use that as a kind of proxy for their meaning.What isn't topic modelling?Topic modelling: does not parse sentences. in fact, knows nothing aboutword order. makes no attempt to "understand" grammar or languagesyntax. What does a topic look like?
Let’s take a quick look at some topics now, and we’ll come back to them again after I walk through how to generate them.clickThese three abbreviated topics were extracted from a large corpora of texts by gensim using a technique called latent semantic analysis (LSA). Just quickly note how they are collections of words that don’t necessarily/intuitively seem to belong together. There are also positive and negative scalar factors for each word, which get smaller in magnitude as the topic goes on. We don't see it here, but each of these topics actually has thousands more terms—these are just the first ten. So that’s what a topic looks like when I talk about topics, but the truth is...
gensim isn’t really about topic modeling, for me anyway. clickIt’s really about similarity. Topics are a means to an end.clickA few wordsabout similarity, because it can be elusive.clickThere are different kinds of similarity: String matching – how many characters strings have in commonStylometry – which is the similarity of style that looks at, say, length of words or sentences, use of function words, ratio of nouns to verbs, etc. Used to identify authors, for example.Term frequency – do the documents use the same words the same number of times (when scaled and normalized)?The kind of similarity I'm interested in is semantic similarity. Similarity of meanings.But what is that?
Take a moment to read these three sentences.They might be said to share certain elements: non-earth planets, weather, duration, research and data collectionHow would you quantify their similarity? How would you decide that two are more similar to each other than to the third? A study done in Australia in 2005 skipped over the question of defining semantic similarity formally and abstractly and instead defined it as what a sample of Australian college students think is similar. They had students read hundreds of paired short excerpts from news articles (and these sentences are excerpted from some of those)and rank the pairwise similarity on a scale. They then examined all the classifications, and found that they had a correlation of0.6. That’s obviously a positive correlation, but not terribly high. So, humans doesn’t necessarily agree with each other about semantic similarity—it’s kind of a fuzzy notion.That said, we’re going to try to put a number on it. In fact, the study I mentioned found that a particular topic modelling technique called latent semantic analysis (LSA) could achieve also a 0.6 correlation with the human ratings—correlating with the study participant’s choice about as well as they correlated with each other.
Why do we care about semantic similarity? Some use cases for document similarity comparison:-Traditionally used on large document collections. Legal discovery.Answer questions or aid searching huge corpora like government regulations, manuals, patent databases, etc.-Automatic metadata: system can intelligently suggest tags and categories for documents based on other documents they’re similar too. -Something that came up recently on the gensim google group: in a CMS, when a user creates a new post, they want to see posts that may be similar in content. We can do that with semantic similarity , and in fact, someone actuallly made this into a plugin for plone using gensim.-Recommendations, plagiarism dectection, exam scoring. And there are a number of other use cases. There are some new and fast online algorithms that work in realtime, whereas aspreviously, work was often done in batches. This brings up another potential use:-better HCI. Matching on similarityrather than words or with regexes allows us to accept broader ranges of input, in theory. So to make it happen, enter...gensim
To get our topics that we can then use to compute semantic similarity, we’re going to start by turning a large set of documents, which we’ll call a training/background corpus, into numeric vectors. We’ll use the tools in gensim’s corpora package.When I say document, a document can be as short as one word, or as long as many pages of text, or anywhere in between. My examples and the demo app are mostly sentence-size documents.In gensim, a corpus is an iterable that returns its documents as sparse vectors. (A sparse vector is just a compact way of storing large vectors that are mostly zeroes.)Corpus can made from a file, database query, a network stream, etc, as long as you can stream the documents and emit vectors. gensim iterates over the documents in a corpus with generators, so it uses constant memory, which means you can have enormous corpora and indexes.How you generate those vectors from the documents is up to you. (So a corpus isn't inherently tied to words or even language, it could be constructed from features of anything such as music or video, if you can figure out what to use as features [like amplitude or frequencies?] and how to extract them.)If your features are the presence or absence of words, your corpus is in what's called "bucket of words" (BOW) format. Gensim provides a convenience class called TextCorpus for creating a such corpus from a text file.So here we have a list of documents, each document is a list of (feature id, count) tuples. Feature #40 appears one time in document #0, etc.For BOW, we also need a dictionary...
Dictionary maps feature ids back to features (words). The corpus class will generate this for me.So the vectors indicate the presence of words in particular documents, and the resulting matrix containing these vectors will represent all the words appearing in all the documents.
To do interesting and useful things with semantic similarity, we need a good training or background corpus. Finding a good training corpus is something of an art. You want a large collection of documents (at least ten of thousands) that are representative of your problem domain. it can be difficult to find or build such a corpus. clickor, you can just use wikipedia. Helpfullyfor experimenting, gensim comes with a WikiCorpus class and other code for building a corpus from awikipedia article dump. clickWikicorpus makes two passes, one to extract the dictionary, and another to create and store the sparse vectors. It takes about 10 hours on an i7 to generate and serializecorpus and dictionary, though it uses constant memory. The resulting output vectors are about 15 GB uncompressed, about 5 compressed.So after these operations, wiki_corpus is now a BOW vector space representation of Wikipedia, embodied in a large corpus file in the Matrix Market format (popular matrix file format) and a several megabyte dictionary mapping ids to tokens (words). What can we do with our corpus?
We can transform corpora from one vector space to another using models. Transformationscan bring out hidden structure in the corpus, such as revealing relationships between words and documents. They can also represent the corpus in a more compact way, preserving much information while consuming fewer resources.A gensim 'transformation' is any object which accepts a sparse document via dictionary notation and returns another sparse document.”One useful transformation that we can generate from our BOW corpus is term frequency/inverse document frequency (TFIDF). Instead of a count of word appearances in a document, we get a score for each word that also takes into account the global frequency of that word. So a word’s TFIDF value in a given document increases proportionally to the number of times a word appears in that particular document, but is offset by the frequency of the word in the entire corpus, which helps to control for the fact that some words are generally more common than others.
Transformations are initialized with a training corpus,so we realize a TFIDF transformation from the corpus we just generated,wiki_corpus. This also takes several hours to generate for wikipedia.num_docs is number of documents in dictionary.num_nnz is number of non-zeroes in the matrix.clickOnce our model is generated, we can transform documents represented in one vector space model—the wiki_corpus BOW space--and emit them in another—the wiki_corpus TFIDF space as (word_id, word_weight) tuples where weight is a positive, normalized float. These documents can be anything, new and unseen, as long as they have been tokenized and put into the BOW representation using the same tokenizer and dictionary word->id mappings that were used for the wiki corpus.click We can emit these new representations right into a fresh MmCorpus, which could also be serialized and persisted on disk (also requiring several GBs). However,
TFIDF corpus itself is not all that interesting except as a stepping stone to another model called LSI.Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. Original paper is from 1990. It produced the results we saw in the visualization.We can generate an LSI model almost the same way we did the TFIDF model, but we do need to provide an extra parameter called num_features, which brings us back to topics...
num_features is a parameter to LSI telling it how many topics to make. Here again are the topics we saw. These were generated by LSI from Wikipedia articles. What are these topics? It’s hard to say, exactly. The “themes” are unclear. But they are in some sense the corpora’s "principal components” (And in fact, principal component analysis is similar, if you know what that is.)Here’s a brief rundown of how LSI works to calculate these topics...
LSI uses a technique called singular value decomposition (SVD) to reduce the original term/document matrix’s number of dimensions and keep the most information for a given number of topics. I understand the technique conceptually but I’m not going to try to get into the math behind it because I don’t really understand it well enough to explain it, and there are plenty of resources online that explain it in great and accurate detail. Nonetheless, I’ll at least describe it briefly: SVD decomposes the word/document matrix into three simpler matrices. Full rank SVD will recreate the underlying matrix exactly, but LSA uses lower-order SVD, which provides the best (in the sense of least square error) approximation of the matrix at lower dimensions. By lowering the rank, dimensions associated with terms that have similar meanings are merged together. This preserves the most important semantic information in the text while reducing noise, and can uncover interesting relationships among the data of the underlying matrix. Still, the meaning of the terms and topics is not really apparent to us. This is because a single LSI topic is not about a single thing; they work as a set, The topics contain both positive and negative values, which cancel each other out delicately when generating vectors for documents.This is one of the reasons LSI topics are hard to interpret.---The original matrix can be too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").The original matrix can be noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy.The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.---A = m*n matrixA = U * S * V^twhere U is an m*k matrix, V is an n*k matrix, and S is a k*k matrix, and k is the rank of the matrix A.
As I said, there are plenty of resources online to explain it, so for now I will direct your questions there.
So we generate our somewhat mysterious LSI model, asking for 400 topics. Interesting note: no one has really figured out how to determine the best numer of topics for a given corpus for LSI, but experimentally people have found good results between 200 and 500.This will also take several hours. LSI model generation can be distributed to multiple CPUs/machines through a library called Python Remote Objects (Pyro), leading to faster model generation times. When it’s done, we have an LSIModel with 100,000 terms—that’s the size of the dictionary we created. The decay parameter gives more emphasis to new documents if any are added to the model after initial generation.Because the SVD algorithm is incremental, the memory load isconstant and can be controlled by a chunksize parameter thatsays how many documents are to be loaded into RAM at once. Larger chunks speed things up, but also require more RAM.
Now we get to the best part. With our LSI transformation, we can now use the classes in gensim.similarities to create an index of all the documents that we want to compare subsequent queries against. The Similarity class uses fixed memory by splitting the index across shards on disk and mmap'ing them in as necessary. output_prefix is for the filenames of the shards.What is index_corpus? Could be my original training/universe corpus--Wikipedia. Then the index would tell me which Wikipedia document any new queries are most similar to. But index corpus could also be a set of entirely different documents, for example arbitrary sentences typed in by a group of Python programmers,and the index will determine which of those documents my query is most similar to. You can even add new documents to an index in realtime.clickSo to calculate similarity of a query, we tokenize and preprocess query the same we treated wiki corpus, then convert to BOW, then do TFIDF transform, then LSI tranform, give that to the index, and we’ll get a list of similarity scores between the query and each document in the index. clickYou can also calculate the similarity scores between all documents in the index and get back a 2-dimensional matrix, which is what the visualization app was doing every time a new document was added to the index. This is what makes realtime similarity comparisons possible, for some value of similar.
A few more things about gensim before I wrap up:TFIDF and LSI are only two of six models it implements. There are a couple more weighting models and a couple more dimensionality reduction models, each with different properties, but I haven’t had a chance to work with those very much.click clickgensim’s dependencies are numpy and scipy, and optionally Pyro for distributed model generation and Pattern for additional input processing.clickTo give credit where credit’s due, I want to say a few words about where gensim comes from.It was created by a Czech guy named Radim Rehurek, and his work to make the algorithms scalable and online contributed to his Phd thesis. He is an active developer and is very helpful on the mailing list. He’s working hard to build a community around gensim and make it into a robust open source project (LGPL license).I asked him a few questions about his work on gensim; the questions and the answers are up on my personal site, williamjohnbert.com.Randim says: "Gensim has no ambition to become an all-encompassing production level tool, with robust failure handling and error recoveries.”But in my experience it has performed well, and Radim mentioned several commercial applications that are using it in addition to universities
Thanks for listening. This presentation, some sample, and the demo app are available on my github page, github.com/sandinmyjointsI should note that the demo web app and the visualization are actually not part of gensim. gensim generated the data but the app is Flask and hookbox, the clustering is scipy and scikit-learn,and the visualization is d3.questions?
In addition to TFIDF, gensim has implemented several VSM algorithms, most of which I know nothing about, but to do justice to gensim’s capabilities:-TFIDF —weights tokens according to importance (local vs global).preserves dimensionality -Log Entropy- another term weighting function that uses log entropy normalization. preserves dimensionality. -Random Projections —approximates tfidf distances but less computationally expensive. reduces dimensionality -Latent Dirichlet Allocations (LDA) — a generative model that produces more human-readable topics. reduces dimensionality.-Hierarchical Dirichlet Process (HDP) — very new, first described in a paper from 2006, but not all operations are fully implemented in gensim yet.-Latent semantic indexing/analysis (LSI/LSA) —grandaddy of topic modelling similarity techniques. reduces dimensionality. Original paper is Deerwester et al 1990. I have used it most.
Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatizationdata from Lee 2005 and other papers is available in gensim for tests
These recurring patterns called topics may or may not correspond to our intuitive notion of a topic. The abbreviated ones printed here were extracted from a large corpora of texts by gensim using a technique called latent dirichlet allocation (LDA) that actually does tend to produce human-readable topics (but not all the techniques do that, and latent semantic analysis, which is what the demo app used and what we’ll be looking at soon, does not)Some themes—they appear to be “about” something, with the terms having a decreasing weighting