Daniel Tunkelang argues that knowledge representation is overrated for AI systems and computation is underrated. He discusses past attempts at knowledge representation like Cyc and Freebase, and how today's data-driven approaches using large datasets have proven more effective than rule-based systems for tasks like machine translation and question answering. Tunkelang advocates for semi-structured data and data-driven recommendations and queries to empower users and fill gaps in systems' knowledge. He concludes that communication is both the problem and solution, and systems should leverage users as intelligent partners rather than relying solely on perfect schemas or vocabularies.
14. Plain Old Search Engines are Pretty Good Too
http://blog.stephenwolfram.com/2011/01/jeopardy-ibm-and-wolframalpha/
14
15. The Unreasonable Effectiveness of Data
§ simple models + lots of data >>
elaborate models + less data
§ machine translation: parallel corpora >>
elaborate rules for syntactic and semantic patterns
§ semantic web formalism just means semantic
interpretation on shorter strings between angle brackets
Alon Halevy, Peter Norvig, and Fernando Pereira (2009)
15
16. Today’s Challenge
1. Knowledge representation is overrated.
2. Computation is underrated.
3. We have a communication problem.
16
18. Semi-structured Data at LinkedIn
Summary <person>
I lead a data science <id>
team at LinkedIn, which <first-name />
analyzes terabytes of <last-name />
data to produce products <location>
and insights that serve <name>
LinkedIn’s members. <country>
Prior to LinkedIn, I led a <code>
local search quality team </country>
at Google and was a </location>
founding employee of <industry>
faceted search pioneer …
Endeca (acquired by </person>
Oracle in 2010), where…
20. Another Example: Helping a Friend
Dear Daniel,
I'm attaching the resume of an old friend who just moved up
to the Bay Area.
He has a very strong background in:
§ mobile / wireless applications
§ start-ups and new product launches
§ international expansion
Best regards,
XXX
20
24. Data-Driven Computation Serves Communication
for i in [1..n]!
s ← w 1 w 2 … w i!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← {s}!
a.prob ← Pc(s)!
B[i] ← {a}!
for j in [1..i-1]!
for b in B[j]!
s ← wj wj+1 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← b.segs U {s}!
a.prob ← b.prob * Pc(s)!
B[i] ← B[i] U {a}!
sort B[i] by prob!
truncate B[i] to size k!
24
25. Recommendations Leverage Semi-structured Data
Job Corpus Stats
Matching Transition probabilities
Connectivity
Binary yrs of experience to reach title
title industry …
Exact matches: education needed for this title
geo description …
company functional area geo, industry,
…
User Base Soft Similarity
(candidate expertise, job description)
transition
Filtered 0.56
probabilities,
Similarity
Candidate similarity, (candidate specialties, job description)
… 0.2
Transition probability
Text (candidate industry, job industry)
General Current Position 0.43
expertise title
specialties summary Title Similarity
education tenure length 0.8
headline industry
Similarity (headline, title)
geo functional area
experience … 0.7
.
derive
d
.
.
25
29. There is no perfect schema or vocabulary.
§ And even if there were, not everyone would use it.
§ Knowledge representation has only succeeded within
narrow scope.
§ Brute force is surprisingly effective but does not leverage
the user as an intelligent partner.
29
30. Communication is the problem and the solution.
§ Rich communication channel fills gaps in system’s
knowledge representation and in user’s knowledge.
§ Use data science to make the system smart, but be
humble and empower the human user.
You've got the brawn
I've got the brains
Let's make lots of money
Pet Shop Boys, “Opportunities”
30
Two icons of artificial intelligence from science fiction: the HAL 9000 computer from 2001: A Space Odyssey and the android Data from Star Trek: The Next Generation. Both exceed human beings in their ability to assimilate knowledge and to reason using that knowledge. Both interact with human beings in natural language.Despite all of our technological advances, the closest we have come to this vision is talking to Siri. An improvement on the 1960s ELIZA program for sure, but still a baby step.
In 1945, Vannevar Bush put forth his vision of amemex (a portmanteau of "memory" and "index”) as a device in which individuals would compress and store all of their books, records, and communications, "mechanized so that it may be consulted with exceeding speed and flexibility". The memex would provide an "enlarged intimate supplement to one's memory”. The concept of the memex influenced the development of hypertext systems,eventually leading to the creation of the World Wide Web and personal knowledge base software.
A pure embodiment of AI vision: Cyc was started in 1984 as an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning.Typical pieces of knowledge represented in the database are "Every tree is a plant" and "Plants die eventually". When asked whether trees die, the inference engine can draw the obvious conclusion and answer the question correctly. The knowledge base contains over one million human-defined assertions, rules or common sense ideas. These are formulated in a language based on predicate calculus.
Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people (and machines) to access common information more effectively.Freebase is a wonderful resource, and search engines are starting to use it as a structured data resource. But using Freebase for structured queries is a lot trickier than using Google for free-text queries, largely because Freebase is incomplete in unpredictable ways. In particular, Freebase has difficulties with Null, nothing, unknown or N/A values. For example, in the results for "fires of unknown cause", there is no way to tell whether the cause of the fire is really unknown or the data is missing.
Wolfram Alpha is an answer engine developed by Wolfram Research. It is an online service that answers factual queries directly by computing the answer from structured data.Wolfram Alpha is impressive. It’s no wonder that Wolfram Alpha serves as the back end for many Siri queries.Unfortunately, its natural-language interface is brittle. As we can see from these two queries, it can roughly report the number of software engineers in the San Francisco Bay Area, but not the number of software companies. Nobody is perfect. But what is disconcerting is that the system does nothing to suggest that the latter answer is less reliable than the former. Does the system know how to answer the second question? There is no way for the user to be sure, other than perhaps by trial and error eventually leading to resolution or frustrated resignation. This is a communication problem.
Deep Blue was a chess-playing computer developed by IBM. In 1997, the machine defeated world champion Garry Kasparov in a match.What was its secret sauce? Could it think? Did it learn to play chess and represent that wisdom in a knowledge base? Not really – to borrow a line from Toy Story, it won by using brute force with style. It was a massively parallel system (by 1997 standards) made with special-purpose chips.
A decade later, IBM did it again. IBM researchers decided to build a system to beat humans at a more modern game than chess – namely the Jeopardy! television quiz show featuring trivia in history, pop culture, sports,, etc. Moreover, many Jeopardy questions (or “answers”, since the gimmick of the game is that the question-answer process is inverted) involve word play, which would seem particularly challenging for a machine.Like Deep Blue, Watson is all about computation. Its knowledge base is mined from 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia. It uses a server cluster with 720 cores and relies on parallel processing to parse questions and search its knowledge base for candidate responses.In February 2011, Watson defeated former Jeopardy! champions Ken Jennings and Brad Rutter in a televised match.
Watson’s achievement was impressive. But let’s put things in perspective. Even a plain old search engines do pretty well at Jeopardy. The comparison isn’t entirely fair: in judging the search engines, we are only requiring that they return pages on which the answer should appear, not giving specific actual answers. One can try various simple strategies for going further. Like getting the answer from the title of the first hit—which with the top search engines actually does succeed about 20% of the time.Still, the point should be clear. None of these strategies are using sophisticated semantic representations. Computation – brute force with style -- is the big winner.
In 2009, Google researchersAlon Halevy, Peter Norvig, and Fernando Pereira wrote a popular article entitled “The Unreasonable Effectiveness of Data”. It has often been paraphrased as better date beats clever algorithms. But for our purposed we can interpret it as celebrating the triumph of computation over knowledge representation as a means to produce semantic or intelligent behavior.
Let’s take stock of what kind of data we have. Most of our data is semi-structured data -- the broad space that lives in between structured data (the rigid schemas we associated with relational database systems) and unstructured data (e.g., the free text indexed by search engines). The structure in semi-structured data takes the forms of tags and structural elements without a rigid schema (e.g., XML).
LinkedIn has one of the largest and richest collections of semi-structured data on the consumer internet. Here you can see how our people data combines free text, a connection network, and a collection of structured tags. And these aren’t the only entities – we have companies, jobs, etc.
Here I’m searching for people I know in the Bay Area who have “data” anywhere in their profiles and currently work at Google, Yahoo!, or Twitter. Maybe I should look at my Facebook connections too. Did I mention that I’m hiring? The power of such a search is incredible, and the experience is highly intuitive even for a user who has no idea that either the data or the search query is “semi-structured”. The interaction revolves around facets that are well represented in both the data and the user’s mental model.
True story, redacted only the protect my friend’s privacy.
Of course I know that the first place to research companies is LinkedIn. So I started with a generic company search for “mobile”. The results are reasonable, given the query lack of specificity. But clearly I need edto be more specific.
Here is my revised query: small mobile companies headquartered in the Bay Area in software-related industries. This may not have been exactly what my friend was looking for, but it was a great starting point. Specifically, the systems helped me map his information need to a query that captured its spirit.
Computation is powerful – especially at our scale of data and users. Applying machine learning allows us to produce recommendations for job matching, content, community, etc. And of course it drives the feature LinkedIn is most famous for: People You May Know.
One of the steps in processing search queries is to parse them and establish query interpretations – in this case, that “linkedin” refers to a company and “ceo” refers to a job title. We do so using a hidden Markov model (HMM) trained on our corpus statistics and search logs. This allows us to handle word-sense ambiguity, e.g., “dell” as a first name, last name, or company name.
In order to evaluate a job-candidate pair, we first use common-sense filtering to determine if the candidate is even plausible, e.g., we don't need fancy algorithms to determine that a sales executive in Turkey isn't a good match for a software engineering job in Mountain View. After this filtering, we take the two bags of features and create a single set of features for the pair to represent the matching. The matching features can be binary features (e.g., is the candidate in the same industry as the job?), softer (e.g., based on the transition probability between the the candidate's current job and the potential new one), and textual (we can use standard information retrieval methods to compare documents). Combining all of these using weights learned through regression, we can assign scores to matches. Note again that scale matters -- our corpus statistics are essential to computing the above features without falling victim to sparsity.
If the value of your network reflects the saying that "you are who you know", Skills offers the complementary "you are what you know". Skills are diverse -- ranging from Ballet to Hadoop. In order to identify the set of skills, we turn again to the unreasonable effectiveness of data. Many of our 160M+ users have a Specialties section where they list their skills as free text. By mining these sections and other profile elements, we generated a set of potential skills for our entire corpus. Bootstrapping on that list, we implemented a suggested skills feature that is leading to increasing adoption of our controlled vocabulary.
Skills is still in beta. But here you see how related skills – which are derived by mining our corpus – can increase recall on a search for people who have expertise in WordNet, a lexical database developed at Princeton. We can’t rely on people to mention WordNet in their profiles. But we can expand our search to include related skills like ontologies and semantic search. Of course it’s a precision / recall tradeoff – but one that is completely transparent to the user.
The same technique can be used to disambiguate a query like [owl]. If you’re looking for OWL specialists rather than ornithologists, then it’s helpful to require some supporting evidence, such as expertise in the semantic web or RDF.
Knowledge representation isn’t the answer. Computation is great. But with semi-structured data and data-driven computation, we can get even further.
To achieve the best results, we have to exploit the strengths of both people and machines. That means using computation to support communication.
Web search is beginning to embrace semi-structured data – using the unreasonable effectiveness of data to exploit the structure it has and derive latent structure where possible. The result is more user control and a more intuitive communication between the user and the system. What was once exotic is rapidly becoming mainstream.
At this year’s Strata conference, my colleague Monica Rogati one-upped Norvig etal’s argument about the unreasonable effectiveness of data. Not all data is created equal, and quality trumps quality. This is a teaser – I recommend you watch her talk on “The Model and the Train Wreck”.