This document summarizes Max De Marzi's presentation on ETL (extract, transform, load) processes for loading data into Neo4j. It discusses using the Neo4j REST API, Gremlin and Groovy, and the Neo4j Batch Importer for ETL. It also provides an example of ETL from a SQL database by identifying relationships between rows and importing the data without node IDs.
2. About Me
Built the Neography Gem (Ruby
Wrapper to the Neo4j REST API)
Playing with Neo4j since 10/2009
• My Blog: http://maxdemarzi.com
• Find me on Twitter: @maxdemarzi
• Email me: maxdemarzi@gmail.com
• GitHub: http://github.com/maxdemarzi
3. Agenda
• ETL your mind
• ETL with Batch and the REST API
• ETL with Gremlin and Groovy
• ETL with the Batch Importer
• ETL from SQL
10. Language LanguageCountry Country
language_code language_code country_code
language_name country_code country_name
word_count primary flag_uri
Language Country
name name
IS_SPOKEN_IN
code code
word_count as_primary flag_uri
12. Country
name
flag_uri
language_name
number_of_words
yes_in_langauge
no_in_language
currency_code
currency_name
Country
Language
name name
flag_uri SPEAKS
number_of_words
yes
no
Currency
code
name
14. Batch command from REST API
Great for importing Facebook/Twitter friends
Keep each request under 10k commands
Preferably send a request every 2k to 5k commands
16. Why Batch
Transactional: any failures not
committed.
Ordered: responses guaranteed
to be in the same order as sent.
Continuous loading/updating
nodes and relationships in
spurts or streaming.
18. Commit every 1000 changes or so, make sure to stop the transaction to commit the
last few changes at the very end.
Look into auto-indexing to make life easier.
Disabled by default. See Docs for trick to make it full text
instead of exact index.
http://docs.neo4j.org/chunked/milestone/auto-indexing.html
19. Crazy Format is ok
Id :: Title :: Genre|Genre|Genre
But it’s preferable to stay clear of
escape characters like “|”
String location of data file, converted to URL, then processed one line at a time.
Movie vertex created, genre vertex created unless it exists (index lookup), edge
from movie to genre is created.
Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-
one/
30. What about multiple types of nodes?
No problem, just add the MAX(node_id) from the first table.
Full walk-through at:
http://maxdemarzi.com/2012/02/28/batch-importer-part-2/
Need help? E-mail me, catch me on Google chat or Skype.
Please don’t be shy…. and read my blog:
http://maxdemarzi.com