We have first cited the graph data model in the Introduction to Big Data note.
Until now, we have explored many aspects of relational data bases, but now we are changing the data model completely. The main reason driving this discussion are the limitations of classical relational databases: queries like traversal of a high number of relationships, reverse traversal requiring
also indexing foreign keys (need double index! Index only work in one direction for relationship traversal, i.e. if you need both direction you should build an index both for the forward key and backward key), looking for patterns in the relationships, are especially expensive when using normal databases.
We have improved over the problem of joining with relational database using Document Stores with three data structure, but these cannot have cycles.
We call index-free adjacency: we use physical memory pointers to store the graph.
Graph databases are more generic than standard relational databases. One of the main use cases that it attempts to solve is the cost of joins in standard relational databases, the problem that Wide Column Storage was attempting to solve too!
Graphs are variegate: it is not feasible to think of a single data model that can fit all the possible use cases. For this reason, we list here many choices of possible models and properties of databases that should be considered when designing a database.
Read or Write intensive: we have discussed in Introduction to Big Data that it is important to consider if the load is read or write intensive (OLAP or OLTP).
Local or distributed: it is important to consider if the data is stored in a single machine or in many machines.
Native or non-native: if the database is native, it means that it is designed to store graphs, otherwise it is non-native (perhaps the underlying layer has a different data model).
This data model is commonly known as RDF (Resource Description Framework), which is W3C standard. It is a simple model that consists of triples, which are made of a subject, a predicate and an object (note: these are all labels, not properties!). The subject is the entity, the predicate is the property and the object is the value of the property. This is a very simple model, but it is not very efficient for querying.
We have explored this model in Metadati web e web semantico.
This kind of graph data model is usually very easy to implement with relational databases with tables made of three columns.
In triple stores, not all labels are compatible to subject object or properties.
This is the most natural model for us Computer Scientists. We take the classical notion of Graphs (See Grafi). And extend it with properties and labels. Properties are flat-maps in nodes and edges. Each property is a map from strings to values that represents a certain property of the node or edge. Labels are some strings attached to nodes, they are commonly called types. There is no limit for the number of labels and properties for each.
Each node and each edge can be associated with a map from strings to values, which represents its properties. So we can have as many properties as we would like for nodes and edges.
Labels can be seen as table names, nodes as records, and properties as the attribute values for the records.
Comparison between some nodes and relational databases from the Textbook
Labels can be stores as properties, this is what Triple Store people could argue.
In a labeled property graph, nodes can have one or more labels and zero or more properties, edges are directed and have exactly one label(s) and zero or more properties.
The nodes in a query that map to specific real nodes are called anchors/bound nodes as they anchor the query to real nodes.
Graph databases maintain fast query times even when storing vast amounts of data. Learning to trust our graph database is important when learning to structure our graphs without denormalizing them.
MATCH (a {y:1}) WHERE a.x=1 or a.x=8 SET a: yourRook
MATCH (a {y:1}) WHERE a.x=2 or a.x=7 SET a: yourKnight
MATCH (a {y:1}) WHERE a.x=3 or a.x=6 SET a: yourBishop
MATCH (a {a.x:4, y:8}) SET a: theirKing
We would like to be able to store graph data on different machines, but this is difficult when we have links between different data on another place! This was a difficult problem but has been solved in recent years. Very recently so to say.
We store the nodes, relationships , properties and labels separately on the disk.
But when the database is up, everything is stored in the memory (index-free adjacency map). Usually using adjacency lists (doubly linked lists on the edges for each node).
There exists a store for nodes, relationships, properties, and labels. Storing the data for example for nodes and properties separately makes the underlying model slightly different from the view exposed to the user, however this greatly increases traversal performance.
Stores for nodes, for example have a size of exactly 9 bytes.
First byte: usage data
next four: ID of the relationship connected for the node
See Metadati web e web semantico for a comparison on this topic.
RDFs stores every possible property or label as a tuple of three. These are directional and are called:
Subject
Predicate
Object.
All of these are identified by URI or some literals or blank nodes (see the linked resource for cases where we have blank nodes, it has to do with reification, which is an added problem due to the model's simplicity).
URI can be used in each of these three parts, this is very important because it allows to create a tree of data that looks like a taxonomy.
Literals can be used for the object part, and they are the values of the properties.
Blank nodes are used for the subject or object part, and they are used to represent the absence of it, but not on the relation.