Entity Linking, the task of mapping ambiguous Named Entities to unique identifiers in a knowledge base, is a cornerstone of multiple Information Retrieval and Text Analysis pipelines. So far, no single entity linking algorithm has been able to offer the accuracy and scalability required to deal with the ever-increasing amount of data in the web and become a de-facto standard. In this work, we propose a framework for entity linking that
leverages graph embeddings to perform collective disambiguation. This framework support pluggable algorithms for generating embeddings and ranking the candidates. Moreover, we use our framework to evaluate a reference pipeline that uses DBpedia as knowledge base and leverages specific algorithms for fast candidate search and high-performance state-space search optimization. Compared to existing solutions, our pipeline offers state-of-the-art accuracy on a variety of datasets without any supervised training and provides real-time execution even when processing documents with dozens of Named Entities. Additionally, the flexibility of our framework allows adapting adapt to a multitude of scenarios by balancing accuracy and execution time.
3. Alberto Parravicini 23/05/2019
Component of applications that require
high-level representations of text:
1. Search Engines, for semantic search
2. Recommender Systems, to retrieve documents
similar to each other
3. Chat bots, to understand intents and entities
Use Cases
3
4. Alberto Parravicini 23/05/2019
An EL system requires 2 steps:
1. Named Entity Recognition (NER): spot
mentions (a.k.a. Named Entities)
● High-accuracy in the state-of-the-art[1]
The EL Pipeline (1/2)
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
4
5. Alberto Parravicini 23/05/2019
An EL system requires 2 steps:
2. Entity Linking: connect mentions to entities
The EL Pipeline (2/2)
5
6. Alberto Parravicini 23/05/2019
An EL system requires 2 steps:
2. Entity Linking: connect mentions to entities
The EL Pipeline (2/2)
6
7. Alberto Parravicini 23/05/2019
An EL system requires 2 steps:
2. Entity Linking: connect mentions to entities
The EL Pipeline (2/2)
7
8. Alberto Parravicini 23/05/2019
● Novel unsupervised framework for EL
● No dependency on NLP
● First EL algorithm to use graph embeddings
● Accuracy similar to supervised SoA techniques
● Highly scalable and real-time execution time
● < 1 sec to process text with 30+ mentions
Our contributions
8
9. Alberto Parravicini 23/05/2019
Our EL Pipeline
Input text
with NER
Graph Building Embeddings Candidate Finder
Disambiguation
Graph Learning
Execution
Output
1 2 3 4
9
10. Alberto Parravicini 23/05/2019
We obtain a large graph from DBpedia
● All the information of Wikipedia, stored as triples
● 12M entities, 170M links
Step 1/4
Graph Creation
10
12. Alberto Parravicini 23/05/2019
● Graph embeddings encode vertices as vectors
● “Similar” vertices have “similar” embeddings
● Idea: entities with the same context should have
low embedding distance
Embeddings Creation (1/2)
Step 2/4
12
13. Alberto Parravicini 23/05/2019
● In our work, we use DeepWalk[1]
● Like word2vec[2]
, it leverages random walks (i.e.
vertex sequences) to create embeddings
● Embedding size 170, walk length 8
● DeepWalk uses only the graph topology
● Simple baseline, we can use better algorithms
and leverage graph features
Embeddings Creation (2/2)
[1] Bryan Perozzi et al.2014. Deepwalk
[2] Tomas Mikolov et al. 2013. Distributed
representations of words and phrases and
their compositionality
Step 2/4
13
14. Alberto Parravicini 23/05/2019
Candidates Finder
Idea: for each mention, select a few candidate
vertices with index- based string similarity
● Solve ambiguity following redirect and
disambiguation edges
Step 3/4
14
15. Alberto Parravicini 23/05/2019
Candidates Finder
Step 3/4
Idea: for each mention, select a few candidate
vertices with index- based string similarity
● Solve ambiguity following redirect and
disambiguation edges
15
16. Alberto Parravicini 23/05/2019
Disambiguation (1/3)
●We want to pick the “best” candidate for
each mention
● In a “good” solution, candidates are related to
each other (e.g. Donald Trump, Hillary Clinton)
●Observation: a good tuple of candidates has
embeddings close to each other
●Evaluating all combinations
is infeasible
● 10 mentions with 100 candidates 10010
Step 4/4
16
17. Alberto Parravicini 23/05/2019
17
●We use an heuristic state-space search
algorithm to maximize:
Sum of string
similarities
Step 4/4
Sum of embedding
cosine similarities
w.r.t. tuple mean
Disambiguation (2/3)
19. Alberto Parravicini 23/05/2019
● We compared against 6 SoA EL algorithms, on 5
datasets
● Our Micro-averaged F1 score is comparable with
SoA supervised algorithms
Results: accuracy
19
20. Alberto Parravicini 23/05/2019
● Different settings enable real-time EL, with
minimal loss in accuracy
● E.g. number of iterations, early stop
Results: exec. time
20
21. Alberto Parravicini 23/05/2019
● Execution time is well divided between
Candidate Finder and Disambiguation
Results: exec. time of
single steps
21
22. Thank you!
Fast Entity Linking via Graph Embeddings
● Novel unsupervised framework for EL
● First EL algorithm to use graph embeddings
○ Accuracy similar to supervised SoA techniques
● Real-time execution time
Alberto Parravicini, alberto.parravicini@polimi.it
Rhicheek Patra
Davide Bartolini
Marco Santambrogio
2019-05-{17-31}, NGCX
23. Alberto Parravicini 23/05/2019
●Turn topology and properties of each vertex
into a vector
●“Similar” vertices have “similar” embeddings
Embeddings
Step 2/4
embedding
properties
topology
Hamilton, William L., Rex Ying, and Jure Leskovec.
"Representation learning on graphs: Methods and
applications." arXiv preprint arXiv:1709.05584 (2017).
25. Alberto Parravicini 23/05/2019
Candidates Finder
Idea: for each mention, select a small number of
candidate vertices with string similarity
● We use a simple index-based string search
● Fuzzy matching with 2-grams and 3-grams
● This provide a simple baseline (60-70%
accuracy)
Step 3/4