Data Works MD June 2021 - https://www.meetup.com/DataWorks/events/277780086/
Video - https://www.youtube.com/watch?v=SSVTFwaVlH8&ab_channel=DataWorksMD
-------------------------------------------------
Graph Analytics: Rich Relationships = Powerful Insights
Demo-Driven exploration of graph analytics to identify criminals, discover trolls, analyze social networks, and more using community detection, centrality, link prediction, and graph embedding for incorporation into machine learning models, along with creating graphs from Wikipedia via wikidata. Includes all you need to get started and a review and use of two graph query languages – cypher and SPARQL across multiple environments including Neo4J, Amazon Neptune, and Nvidia Cuda graphs for large scale graph processing using GPUs. Relationships are what it is all about
-------------------------------------------------
John Hebeler, Fellow for Lockheed Martin, is a developer of large scale, data-driven solutions using machine learning, graph analytics, and high-speed messaging across computer resources that reach from the clouds to the edge. Along the way he writes, presents, and teaches - (mostly learns and plays). He holds a Phd in Information Systems, an MBA, and a BSEE.
2. All Data is really a Graph...
• Classics
• Seven Bridges of Königsberg
• Traveling Salesman
• "Data" Networks
• Computer
• Social
• Maps
• Internet
• Value
• Sometimes Relationshipsalone provide
insights
• Reveals Powerful patterns
• Analytics enrich the possibilities
• Extends all the way to deep machine
learning
• Let's Get Started...
2
(c) John Hebeler 2021
3. Graph
Components
• Node (Label)
• Country
• Hierarchy
• Edge (Relationship)
• LOCATED_IN
• Can have Direction
• Properties
• StartDate
• Node and/or Edge Resident
• Metadata
StartDate
Name
Node
Edge w Dir
Property
3
(c) John Hebeler 2021
7. Property vs Knowledge Graphs
Property Graph
• Basic Node-Edge-Property
• No Formal Schema
• Cypher, Tinkerpop/Gremlin
Knowledge Graph
• Schema
• Can describe real-world entities
• Hierarchical Classes, Containment,
Constraints, Rules, Equivalence...
• Ontologies (RDF/OWL)
• Incrementally Expressive
• SPARQLQuery Language
• Select and Construct
• Reasoner-Enabled
• Derives new assertions
• Validates current assertions
• Enables entry verification
Car
Truck
Vehicle
Type of Type of
Car ID:
567
Is A
Labels:
Vehicle, Car
Properties
* Type: Honda Accord
* VIN: 1234
Honda
Accord
Car model
1234
VIN
7
(c) John Hebeler 2021
16. Getting Started with Neo4J
• Open Source Community Version with DataScience Extensions
• Neo4J: https://neo4j.com/
• DownloadContainer: https://hub.docker.com/_/neo4j
$ docker pull neo4j
$ docker run
--publish=7474:7474 --publish=7687:7687 # can use others too - 7687 interferes with VMWare
--volume=$HOME/neo4j/data:/data --volume=$HOME/neo4j/import:/import
• Add GDS (https://neo4j.com/download-center/#algorithms)and APOC Methods
• Load Functionality
• $ cp neo4j-graph-data-science-x.x.x.jar to $NEO4J_HOME/plugins
• $ cp apoc-4.2.0.1-core.jar from $NEO4J_HOME/labs to $NEO4J_HOME/plugins
• Update Configuration in $NEO4J_HOME (/var/lib/neo4j)/conf/neo4j.conf
• dbms.memory.heap.initial_size=512m
• dbms.memory.heap.max_size=5G
• dbms.security.procedures.unrestricted=gds.*,apoc.*
• dbms.security.procedures.whitelist=gds.*,apoc.*
• CYPHER Graph Query Language: https://neo4j.com/developer/cypher/intro-cypher/
• Import from popularfomats (csv, json, …)
16
(c) John Hebeler 2021
17. Getting Started with WikiData and Sparql
• All (almost) of Wikipedia as a Graph
• https://www.wikidata.org/wiki/Wikidata:Main_Page
• Sparql Overview
• https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/
• https://query.wikidata.org/ (use the query helper)
• Presidental Demonstration
FILTER: instance of human
position held President of the United States
SHOW: date of birth
child
spouse
ADD: ORDER BY DESC (?date_of_birth)
• Can export findings to common formats (csv, json,…)
17
(c) John Hebeler 2021
18. Getting Started with Nvidia CudaGraph
• Manages large data sets across multiple CPUs/Cores
• Contains all major graph analytic libraries
• Basic numeric graph data (must preprocess most data)
1 2
1 3
2 4
• Download Cuda Container
• https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai
$ docker pull nvcr.io/nvidia/rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04
$ docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786
nvcr.io/nvidia/rapidsai/rapidsai:21.06-cuda11.0-runtime-ubuntu18.04
• Contains working notebooks in cudagraph
18
(c) John Hebeler 2021
19. Getting Started with AWS Neptune
• Obtain an AWS Account
• Become Familiar with TinkerPop/Gremlin
• https://tinkerpop.apache.org/
• Create Neptune Database from AWS Console
• Interact with SPARQL endpoint or Apache TinkerPop™ Gremlin
Websockets Server
19
(c) John Hebeler 2021
20. Getting Started with DASK
• Manageslarge data sets across multiple CPUs/Cores
• Python Library
• Installwith Anaconda ordirect with pip install
• Start up GUI
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='6GB')
Available on port 8787 by default
• ReadinLarge Data File
• import dask.dataframe asdd
• df= dd.read_csv(...)
• df.x.sum().compute() #This uses the single-machine scheduler by default
• from dask.distributed importClient
• client =Client(...) #Connect todistributed cluster and override default
• df.x.sum().compute() #This now runs on the distributedsystem
• FilterData Set
• reducedData =dataInput['user'].isin(searchList)
• Simplify/Restructure the data
• from urllib.parseimport urlparse
• http_admin['urlbrief']=http_admin['url'].map(lambda x: urlparse(x).netloc, meta=('new_col', 'object') )
• http_admin.compute()
• http_admin.head()
• Drop unwantedcolumns
• http_small =http_admin.drop(['url','activity', 'content'], axis=1)
• #createonelargecsv filewith the listed collumns
• http_small[['user','pc','urlbrief']].to_csv("http_admin_brief2.csv", single_file=True)
20
(c) John Hebeler 2021
21. References
• Books
•Graph Databases: Ian Robinson...
•Graph Algorithms: Mark Needham
•Graph Analytics with Neo4j: E Scifo
• Sites
•neo4j.com (also https://sandbox.neo4j.com/?usecase=graph-data-science)
•aws.amazon.com/neptune
•tinkerpop.apache.org
•wikidata.org
•dask.org
21
(c) John Hebeler 2021