p> Document stores provide a native database management system for semi-structured data. Document stores also scale to Gigabytes or Terabytes of data, and typically millions or billions of records (a record being a JSON object or an XML document).
Introduction to Document Stores
A document store, unlike a data lake, manages the data directly and the users do not see the physical layout.
Unlike data lakes, using document stores prevent us from breaking data independence and reading the data file directly: it offers an automatic manager service for semi-structured data that we need to throw and read quickly.
In this case, we would like an efficient manner to store documents, as original relational databases did, while offering a relaxed version of the integrity properties.
The documents in a document store are organized into collections. Each collection can have millions or billions of documents, while each single document weighs no more than 16 MB. Collections need not to have a schema, but could.
On the need of an unstructured document manager π©
In some cases, it is quite easy to convert a tree given in XML or JSON format into a structured format. The problems comes when we have nestedness. For these kind of documents, sometimes is possible to insert them using different tables. Sometimes is possible to insert into relational databases heterogeneous data too. But we need to insert a lot of NULL to make up for the heterogeneity. This creates the kind of impendance mismatch, the difference between the format of data we would like to have, between what we actually store, that adds up a load. In this context, document stores like MongoDB comes naturally into play.
Size of the documents π¨
a collection can have millions or billions of documents, while each single document weighs no more than 16 MB
This is why document stores can scale up to Gigabytes or terabytes of data. Somewhat similar to HDFS, where each block size had maximum of 128 megabytes.
Enforcing Schemas
Document stores leverage the so called schema on read pattern: the schema is enforced when the data is read, not when it is written. This is a relaxed, more flexible, version of the schema on write pattern, that is used in relational databases. In document stores, the schema is often discovered.
offer query functionality to find out which keys appear in the data, what kind of value is associated with each key, etc; or even functionality that directly infers a schema
MongoDB
MongoDB is one of the most famous document storage engines. This database management system is more comparable to standard database management systems where ETL is the standard, instead of read from file paradigms like Hadoop or Spark.
The Architecture
Replication π₯
If we store petabytes of data, replication is a necessary feature. In MongoDB the collections are partitioned into shards, and each partition has a Replica Set. Each replica set has a primary server and a set of secondary servers. The primary server is the one that accepts write operations, while the secondary servers replicate the data from the primary server. Usually a MongoDBcluster has not many machines.
When writing to the primary servery, a simple optimization is the followign: clients may wait only for a certain number of acknowledgements by the second server. Then the other replicas can continue asynchronously. This is for making it faster for the user.
Physical Storage π©–
Documents in MongoDB are usually stored in BSON format, an equivalent method to JSON (see Markup) but in binary format as it is usually more efficient.
Indices
While Spark would look at all the data in parallel to find a single point data for a certain query, the main advantage with ETL (so MongoDBs) is by having the possibility of creating indexes. With indexes we can look data quite faster. We have studied something similar for relational database in Index, B-trees and hashes
Types of indices π©
There are mainly two types of indices, hash and B-trees. , both have been extensively studied in this node Index, B-trees and hashes, and it’s exactly the same.
Drawbacks of hash indices
- Cannot answer for range queries.
- Space requirement for collisions. The upside is that this indices are quite fast. If no collisions then it is $\mathcal{O}(1)$
While B-trees are slower, they can answer range queries. The complexity is $\mathcal{O}(\log_{n}(N))$ where $n$ is the branching factor.
Default indices π©
MongoDB creates by default a hash indice for the primary key _id
.
Benefiting queries
Certain queries can benefit quite more by having indices:
- Point queries (hash makes this quite fast, while trees make this logarithmic).
- Range queries
Creating an Index
If we want a tree index
db.collection.createIndex({"field": 1})
Else:
db.collection.createIndex({"field": "hash"})
Post-filtering π₯
If we have a query with some keys that have an index, others that do not have an index MongoDB will first look for the indexed keys, and then postfilter the rest of the keys, making the query faster anyways. Under the hood, it creates many plans, estimates the cost of every plan and then executes one.
Operations
Similarly to HTML APIs, MongoDB offers CRUD operations. It is important to know the syntax of these operations.
Read operations π¨
For example:
db.collection.find({
"field": "value",
$or: [
{"field2": "value2"},
{"field3": "value3"}
],
"field4": {$gte: 10},
"field5.subfield": "value4"
}).project({field: 1}).sort({"field": 1}).skip(10).limit(10)
We filter the collection by the field value, and we project only the field we are interested in. Having AND conjunctions is easy, just put in the object. Fields that start with “$” are treated by MongoDB differently. And can encode different operations. Other dollar operations are
- $or, $in, $nin
- $gte
- $lt
And similar.
The Above proejction operator just works for cursors, in our case, the projection is often found inside the find function, as a second argument.
We can index into arrays and objects using dot notation inside the find query.
To query nested structures we need to use the dot syntax. If we don’t do it that way and add a dictionary in the place of value4 instead, it will match exactly the dictionary.
Aggregate queries π¨–
Aggregation query
db.collection.aggregate([
{$match: {"field": "value"}},
{$group: {_id: "$field", "count": {$sum: 1}}},
{$sort: {"count": -1}},
{$limit: 10}
])
Which is very similar to a Spark RDD pipeline.
Insertion, Update and Delete π©
Insertion is quite straighforward: insertOne:
db.collection.insertOne({"field": "value"})
insertMany:
db.collection.insertMany([{"field": "value"}, {"field": "value2"}])
There are the same equivalent for update and delete operations.
Then we attempt to create indices for the most common query searches. It often depends on the query code and the website traffic (if it’s a database for a website). But you don’t have advantage if you can’t fit the indices in the random access memory. Another disadvantage is a slower update: we need to update the index for every update on the table. Just 2ms for a single lookup! Which is much faster than HDFS and Spark.