A data model is an abstract view over the data that hides the way it is stored physically.
The same idea from (Codd 1970) This is why we should not modify data directly, but pass though some abstraction that maintain the properties of that specific data model.
Data Models
Tree view
We can view all JSON and XML data, as presented in HTML and Markup, as trees. This structure is usually quite evident, as it is inherent in their design. Converting from the tree structure to a memory model is known as serialization, while the reverse process is called parsing.
A key distinction lies in edge labeling for JSON models and node labeling for XML models. Additionally, XML accessors are bidirectional, meaning they include a parent pointer, whereas JSON lacks this feature.
XML Information Set
This is how actually is stored in memory when the XML file is parsed into memory. It’s also quite easy to implement after a standard programming course. There are many other parts of the information set, but we will not cover them in the current course (e.g. Namespaces, comments, are not covered.)
Document Information Item
The document information item is the root node of the three view; the first element of XML file should be always present even if the doctype declaration is not present, this is the children of this information item.
Element Information Item
This is just the element tag, as it is explained briefly in HTML e Markup. Parents could be the root document item, or other information items. It can have attributes, children and the local name (the name of the tag).
Attribute Information Item
Here we have a key (called local name), the value, and the parent, which is the element that owns this attribute.
Text Information Item
We just have the text value and the owner for this element.
Validation
Even with a well-formed JSON or XML document, the structure can still be quite disorganized. Imagine a document where values are scattered haphazardly. Our goal should be to impose some order on these documents by providing them with an abstract model of the data. We say that a JSON or XML document adheres to a model when it is correctly validated against that model. In this section, we will explore specific validation modes available for JSON and XML. Usually this validation step is done after the file has been correctly parsed into the memory.
There are some advantages in validation, one is the processing speed. With normalized data, we can use some specific data structures tailored to process that format, and store it more efficiently.
Atomic Types
We need to remember that there are some default atomic types for the data model, which are different from the 4 atomic types present in the well-formedness of JSONs.
- Strings
- Numbers
- Booleans
- Null values
- Dates and times (you need to be careful about month - day division, number of days is dependent of year and months)
- Time intervals
- Binary
These atomic types are usually common everywhere, but the names could vary. The following is a table that summarizes the various names for different technologies
Lexical and Value space
All atomic types have in common that they have a logical value space and a lexical value space. The logical value space corresponds to the actual logical abstraction of the type (e.g., a mathematical set of integers), while the lexical value space corresponds to the representation of logical values in a textual format (such as the decimal, or binary, or hexadecimal representation of an integer).
So the Value Space is what it wants to represent, while the lexical space is what is actually stored.
Static and Dynamic schemas
Dynamic schemas allow for more flexibility, while static schemas are more rigid. The former are more common in JSON, while XML schemas are just static.
For example dynamic schemas can allow for keys that are not defined in the schema if the additionalProperties
is set to true.
JSON Validation
There are some specific types in JSON, and we need to know them well. One of the most famous validation frameworks with JSON is JSound (developed at ETH for teaching purposes), you should know the syntax very well for the scope of the course. See slides, for example here. Another checker could be JSON Schema, which is stanrdardized by the W3C.
Structured Types
In this case, there are no additional types beyond the standard array and map values typically found in JSON. The only distinction is that we might need to impose a cardinality constraint.
Then there are many many ways to describe a requirement for the data model some common ways you need to keep in mind are
- Cardinality: How many times a specific element can appear.
- Requiring presence of a key
- Open and closed object types (difference between JSound and JSON schema).
- Primary key constraints, null and present values. (In jsound we have special characters ?, = and @)
- Any values
- Type unions and intersections, negation, exclusive.
XML Validation
XML Schema is one of the most famouse ways to validate XMLs. This is an example of a XML schema
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="[http://www.w3.org/2001/XMLSchema">](http://www.w3.org/2001/XMLSchema">)
<xs:element name="foo" type="xs:string"/>
</xs:schema>
We see that they are inside a namespace, then we have special tags for elements, types and attributes.
We can also control the number of times an element appears in the document.
XML Types
Here are some types taken from the Book.
- Strings:
xs:string, xs:anyURI
(for strings containing a URI); - Numbers:
xs:decimal, xs:integer, xs:float, xs:double, xs:long, xs:int, xs:short, xs:byte, xs:negativeInteger, xs:positiveInteger, xs:nonNegativeInteger, xs:nonPositiveInteger, xs:unsignedByte, xs:unsignedShort, xs:unsignedInt, xs:unsignedLong
; - Dates and times:
xs:date, xs:time, xs:dateTime, xs:gYearMonth, xs:gYear, xs:gMonth, xs:gDay, xs:gMonthDay, xs:dateTimeStamp
; - Time intervals:
xs:duration, xs:yearMonthDuration, xs:dayTimeDuration
; - Binary types:
xs:hexBinary, xs:base64Binary
- Booleans:
xs:boolean
- Nulls: does not exist as a type in XML Schema (JSON specific).
Normally one can also define custom atomic types! In this case we use keywords like xs:simpleType
and xs:restriction
or xs:complexType
for more complex types (nested elements or sequences).
Dataframes
What are data frames?
Data frames are a generalization of (normalized) relational tables allowing for (organized and structured) nestedness.
Dataframes are a way to represent data in a tabular way, with rows and columns. So we can interpret Dataframes as whatever data model we want, the only requirements is that it is valid against a certain schema that we define. These schemas should not be valid for additional fields, and should not allow “any” values, but some specific types. A relational database is a dataframe, but the inverse is not necessarily true.
Dataformats
Some formats are highly optimized for storing, e.g. Parquet, Avro, Root, Google’s protocol buffers. We gain space and time efficiency. As we have homogeneus rows, we can have more efficient lookups by skipping entire rows. We can also have compression and columnar storage.
Parquet is called that way because the data is stored in a columnar fashion, grouping all the values (across objects) together when they are associated with the same key, which might remind you of the parquet floor of your living room.
Homogeneous vs Heterogeneous
Homogeneous are collections of documents that all follow the same data model, with the rules of typing allowed for dataframes, while heterogeneous doesn’t have this rules. Another important difference is knowing how to classify for flat and nested structures.
Classification of data formats
We have three main categories of data formats:
- Binary: e.g. Parquet, Avro, ORC, Protocol Buffers (not CSV, JSON, XML, YAML)
- Nestedness: e.g. JSON, XML, Parquet, Avro (not CSV)
- Schema: if they need a valid data frame, e.g. Avro, Protocol Buffers, XML Schema, JSON Schema (not CSV, JSON or YAML)
Columnar Storage
Dataframes are more efficient compared to Wide Column Storage in some aspects because the keys only need to be described once, allowing us to store them efficiently. Another reason: if we only need a single column in our Apache Spark task then we could only fetch the needed column and ignore the others. In apache spark, the dataframes are stored as an internal dataframe type.
References
[1] Codd “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM Vol. 13(6), pp. 377–387 1970