Last week I attended the NYC Neo4j Meetup on Data Modeling & Natural Language Search, even better, I happened to win tickets to GraphConnect NYC which was just a few days later. I decided to compile some notes from the event, but rather than just bullet-pointing out the content, I decided to compile a brief synopsis of each of the forums I attended so you know which slides to review or video to check out first. This should be a pretty decent place to start if you're just looking to get your bearings when it comes to Graph DBs and Neo4j. There are also a bunch of videos available at http://watch.neo4j.org/.
Intro/GraphDBs and Data Complexity
Adoption rates of graph data-bases continue to increase, with telecom, financial services, and web/ISV-firms driving the bulk of growth. This is primarily due to the evolution of data and data connectedness (Note: coming from a relational database model, you can think of connectedness as a data-model with lots of joins). Although graph databases and graph modeling can seem intimidating, it's actually a very natural way to model the real-world.
For example, compare the complexity of answering the below questions in their respective domains in a relational DB versus a graph DB.
- Subway or Transit Systems: "How do I get from point A to point B?"
- Data-Centers or Networks: "If switch X goes down, what systems are impacted?"
- Recommendation Engines: "What did customers who bought this product also buy?"
- Social Networks: "How many of my friends know person X?"
These can certainly be modeled in a relational world, and the questions answered using appropriate algorithms but all of that extra work becomes unnecessary if you use a graph DB instead. Adoption rates seem to indicate this assessment is correct, especially in products where determining correlations and causality is a key feature.
Graph DBs, Features and Benefits
In addition to being intuitive and natural to model, graph DBs also provide the advantage of providing constant query times even as the data-set grows. That is, you are only paying the traversal of nodes (compare to outer joining on a large table). Additionally, since graph DBs are have non-rigid structures, it is easier to evolve the structure of a graph DB when compared to a relational DB.
Querying a Graph DB
A graph consists of nodes (entities) and vertices (lines to other nodes). Each of these vertices can be labeled with a relationship descriptor. Additionally, each node can have N number of properties.
Cypher is the language Neo4j provides to describe and query the graph. Cypher queries can provide the engine the following data:
- A starting node
- What relationships to traverse
- What data to return
- What property values to match
- How to order the data
Not unlike the relational world, it's critical to model data correctly when building an application that is backed by a graph DB. It's helpful to know the questions you are going to ask of your data when you are building your data-model.
Given a system that is used for knowledge management and exploring one's professional social network, a typical question asked of the system might be "Which people who work for the same company have the same skills as I do?"
- Identifying entities (People, Companies, Skills)
- Identifying relationships (A person WORKS_FOR a company, a person HAS_SKILL)
Convert these relationships and entities into cypher, and examine how easy (or difficult) it is to answer your sample question(s). It may also be helpful to create a sample graph and ask the question(s) with the graph model in front of you (See slide 14).
Tests should be similar to tests that you'd have in an application with a relational store. In addition, you should be able to test that your data model is sufficient/adequate, as well as be able to ensure the queries you designed return the correct data. Of course, these tests also serve the purpose of documentation and act as regression tests.
In Java, there is an entirely in-memory version of Neo4j that should only be used for testing. This engine can be injected via dependency injection (as you would do with other dependencies). In doing so, you should also include a small, static data-set. This should provide you the ability to easily execute your queries with full knowledge of what the correct response should be. The exact strategy (how this gets injected, etc.) will depend on your architecture.
Neo4j Application/Platform Choices
Neo4j has 3 options as far as how it can be installed/leveraged.
- Hosted in a Java process
- Provides native access to the Neo4j Java API
- Wraps the embedded instance
- Provides a JSON REST interface
- Provides a way to execute custom [Java] code on the server
- Provides you with a hook to control the HTTP request and response format
- Basically, provides a hook where into the HTTP pipeline (if you request "/get_skills/" handle it like this...)
There are specific code examples for writing unit tests of embedded code as well as Server Extensions starting at slide 30 of Ian's presentation.
Marvel has a significant inventory of stories, sprawling storylines, authors, and characters. Additionally, stories from the marvel Universe don't obey the rules that we are likely used to modeling. For example, characters can have a property for attribute for a very limited time, die and come back to life, exist in multiple places at once...there are basically no rules, whatsoever. This would be a relational modeling nightmare.
Graph DBs are being used to solve a number of challenges associated with this universe. For example, given this scenario: "A character is a superhero at a time T, which occurs in issue I. This character is also related to an overarching theme in the Marvel history, and is part of a team."
The purpose of a visualization is to provide the user with a better understanding of the structure and relationships of the data.
Visualizing Nodes: Dos and Don'ts
- Use images and icons (as opposed to just circles with words in them)
- Use colors and/or gradients (gradients can be useful in expressing a numeric value, but don't get too fancy or combine colors and expect the user to know what you are expressing)
Visualizing Vertices: Dos and Don'ts
- Use colors to indicate categories of relationships
- Use width to indicate strength of a connection
Interactivity with the Graph: Dos and Don'ts
- It is important to allow the user to easily add and remove filters
- Provide an easy way to combine qualities and aggregate them under a node
- Animation: If you are going to animate, make sure you do it in a way that does not cause the user to lose track of individual nodes (Note, out of the box D3 does not smoothly animate)
- Layout: Don't force the user into the layout you think is best for the data. The user might be looking for something you were not looking for when you designed the application. Provide multiple layouts for the same data-set.
- Explore: One of the biggest advantages of graph DBs is the ability to discover correlations and causality you did not intend n finding. It is important you design visualizations with this in mind. This also means you don't need to show all the data you have at once, because it may not all be relevant. You can work around this by aggregating data under a node (think, "click to expand"). Aggregations can be done by grouping on common node properties, for example.
- Uniqueness: Not every relationship is useful (Why do all nodes tie back to one node? This can cause a lot of noise on a graph visualization and provides little to no value most of the time)
- Structural Rigidity: Don't assume that how you store your data is how users want to analyze/explore/think about the data.
- Don't overwhelm the user with data
- Don't try and be too clever. Make everything intuitive.
- Using 3D (causes occlusion problems, is hard to navigate, hard to print)
- Bad color choices
- No labels (What does this vertex actually mean?)
- No legend
- No tooltips on hover
- Complexity (Not providing filtering)
- No emphasis on important nodes
- Black backgrounds
- Bad navigation
In terms of a graph DB, you can search by simply traversing paths, matching all of the paths that are "matches" (based on any criteria), and returning the matches. Note, unlike a relational store, you are only paying the cost of the traversal (number of total nodes has little to no impact).
It is also possible (and useful) to model time in a graph DB. Rather than modeling specific times and dates, Kenny explains that depending ont eh task at hand, it might be easier to model moments or snapshots in time. See this graph gist for more details, but essentially modeling
T0 > T1 > T2
...is helpful in the context of natural language search (or search in general) because you can model out things like:
At T0, Pam viewed Product X. At T1, Pam viewed Product Y. At T2, Pam purchased product X.
With a larger sample set, it is possible to make useful deductions on this data. For example, suggesting people who are viewing product Y to take a look at Product X. Kenny suggests storing all these path traversals in non-real time, storing them in a JSON-friendly store, and then at runtime, extracting the relevant traversals based on some action (e.g., Viewing product Y). Note, Extrapolating path traversals can be compute intensive, you may want to distribute & map-reducecomments powered by Disqus