Graph Databases

We have first cited the graph data model in the Introduction to Big Data note. Until now, we have explored many aspects of relational data bases, but now we are changing the data model completely. The main reason driving this discussion are the limitations of classical relational databases: queries like traversal of a high number of relationships, reverse traversal requiring also indexing foreign keys (need double index! Index only work in one direction for relationship traversal, i.e. if you need both direction you should build an index both for the forward key and backward key), looking for patterns in the relationships, are especially expensive when using normal databases. We have improved over the problem of joining with relational database using Document Stores with three data structure, but these cannot have cycles. We call index-free adjacency: we use physical memory pointers to store the graph.

Graph Data Models

Graph databases are more generic than standard relational databases. One of the main use cases that it attempts to solve is the cost of joins in standard relational databases, the problem that Wide Column Storage was attempting to solve too!

Design objectives

Graphs are variegate: it is not feasible to think of a single data model that can fit all the possible use cases. For this reason, we list here many choices of possible models and properties of databases that should be considered when designing a database.

Read or Write intensive: we have discussed in Introduction to Big Data that it is important to consider if the load is read or write intensive (OLAP or OLTP).
Local or distributed: it is important to consider if the data is stored in a single machine or in many machines.
Native or non-native: if the database is native, it means that it is designed to store graphs, otherwise it is non-native (perhaps the underlying layer has a different data model).

Triple Stores

This data model is commonly known as RDF (Resource Description Framework), which is W3C standard. It is a simple model that consists of triples, which are made of a subject, a predicate and an object (note: these are all labels, not properties!). The subject is the entity, the predicate is the property and the object is the value of the property. This is a very simple model, but it is not very efficient for querying. We have explored this model in Metadati web e web semantico. This kind of graph data model is usually very easy to implement with relational databases with tables made of three columns.

In triple stores, not all labels are compatible to subject object or properties.

	Subject	Property	Object
IRI
Literal
Blank Node

Labeled property graph model

This is the most natural model for us Computer Scientists. We take the classical notion of Graphs (See Grafi). And extend it with properties and labels. Properties are flat-maps in nodes and edges. Each property is a map from strings to values that represents a certain property of the node or edge. Labels are some strings attached to nodes, they are commonly called types. There is no limit for the number of labels and properties for each.

Each node and each edge can be associated with a map from strings to values, which represents its properties. So we can have as many properties as we would like for nodes and edges.

Labels can be seen as table names, nodes as records, and properties as the attribute values for the records.

Graph Databases-20241207175432993 — Comparison between some nodes and relational databases from the Textbook

Labels can be stores as properties, this is what Triple Store people could argue.

In a labeled property graph, nodes can have one or more labels and zero or more properties, edges are directed and have exactly one label(s) and zero or more properties.

Meaning: Every node and edge is labelled!

Neo4j

The nodes in a query that map to specific real nodes are called anchors/bound nodes as they anchor the query to real nodes.

Graph databases maintain fast query times even when storing vast amounts of data. Learning to trust our graph database is important when learning to structure our graphs without denormalizing them.

Data Types

These are the classical types.

Neo4j type	JSONiq/JSound/XML Schema type
STRING BOOLEAN INTEGER FLOAT NULL DATE TIME DATETIME ZONED DATETIME ZONED TIME LOCAL DATETIME LOCAL TIME DURATION LIST	string boolean long double null date time dateTime dateTimeStamp time (time zone mandatory) dateTime (no time zone) time (no time zone) duration array
But we have some extensions for graph types too!

Point for (GIS) data types
Paths
Nodes
Edges.

Operations

With Clause

This is very similar to the functional let clause in Querying Denormalized Data. Example:

WITH 1 as x
RETURN x + 1

Set Clause

This used to add some labels to the nodes:

MATCH (a {y:1}) WHERE a.x=1 or a.x=8 SET a: yourRook
MATCH (a {y:1}) WHERE a.x=2 or a.x=7 SET a: yourKnight
MATCH (a {y:1}) WHERE a.x=3 or a.x=6 SET a: yourBishop
MATCH (a {a.x:4, y:8}) SET a: theirKing

UNWIND Clause

This is used to transform a list into a table. It is very similar to the for loop in JSONiq:

UNWIND [1, 2, 3] as x
RETURN x + 1

UNWIND range(1, 8) AS y
UNWIND range(1, 8) AS x   
CREATE (:Field {x: x, y:y})

MATCH Clause

This is the most powerful clause in Cypher. It can used to traverse the graph. It is a quite natural way to find for patterns in a graph database

MATCH (n:Person)-[:KNOWS]->(m)
RETURN n, m

CREATE Clause

This is used to create nodes and relationships in the graph. It is quite similar to the INSERT statement in SQL.

CREATE (einstein:Scientist {name: "Albert Einstein"}),
     	 (tesla:Scientist {name: "Nikola Tesla"}),
        (einstein)-[:KNOWS]->(tesla)

MERGE Clause

This is used to create a node if it does not exist, otherwise it returns it. It is similar to the UPSERT statement in SQL.

MATCH (n:Person {name: "Albert Einstein"})
MERGE (n)-[:KNOWS]->(tesla:Scientist {name: "Nikola Tesla"})

MATCH (a: Field), (b: Field)
WHERE (a.x=b.x OR a.y=b.y) AND (a<>b)
MERGE (a)-[:rook]->(b)

Standards

Two main standards:

SQL/PGQ (we use relational query like, but extend it with graph queries, one simple use case is PostgresSQL with Graph extension)
GQL (Graph Query Language) Mainly people have agreed on these two possible query standards.

Physical Architecture

Neo4j can handle graphs with tens of billions of nodes, relationships, and properties. Apparently it could handle the Facebook graph.

Sharding

We would like to be able to store graph data on different machines, but this is difficult when we have links between different data on another place! This was a difficult problem but has been solved in recent years. Very recently so to say.

Graph in memory

We store the nodes, relationships , properties and labels separately on the disk. But when the database is up, everything is stored in the memory (index-free adjacency map). Usually using adjacency lists (doubly linked lists on the edges for each node).

Stores

There exists a store for nodes, relationships, properties, and labels. Storing the data for example for nodes and properties separately makes the underlying model slightly different from the view exposed to the user, however this greatly increases traversal performance.

Stores for nodes, for example have a size of exactly 9 bytes.

First byte: usage data
next four: ID of the relationship connected for the node
next five: ID of the first property for the node
Final byte is for future use.

Resource Description Framework

See Metadati web e web semantico for a comparison on this topic. RDFs stores every possible property or label as a tuple of three. These are directional and are called:

Subject
Predicate
Object. All of these are identified by URI or some literals or blank nodes (see the linked resource for cases where we have blank nodes, it has to do with reification, which is an added problem due to the model’s simplicity).

URI can be used in each of these three parts, this is very important because it allows to create a tree of data that looks like a taxonomy. Literals can be used for the object part, and they are the values of the properties. Blank nodes are used for the subject or object part, and they are used to represent the absence of it, but not on the relation.

Three Syntaxes

RDF/XML
Turtle
JSON-LD See Metadati web e web semantico, they are the most common syntaxes to represent semantic web.

Graph Data Models#

Design objectives#

Triple Stores#

Labeled property graph model#

Neo4j#

Data Types#

Operations#

With Clause#

Set Clause#

UNWIND Clause#

MATCH Clause#

CREATE Clause#

MERGE Clause#

Standards#

Physical Architecture#

Sharding#

Graph in memory#

Stores#

Resource Description Framework#

Three Syntaxes#