Wednesday, April 1, 2015

What do good data models look like?

So you have a new database to look at, and you want a quick take on what you are in for.

Assuming that the database is meant for online transaction processing (OLTP), here's a very rough way to look at it.

Baaaaad Data Model

The picture below represents, in my mind, the prototypical low-quality data model. It has only a few tables, but each table is "wide" - it has a lot of fields in it. There are also only a few ways that the tables are clearly related. In short, the bad data model:
  • has few tables 
  • the tables have lots of columns in them
  • the tables are not tightly related

Baaaad Data Model

Goooood Data Model

So what does a good data model look like? Lots of tables, not very many columns per table, with lots of relationships. In the picture below, the "not very many columns" is represented by the narrow rectangles.

In short, a high-quality data model has:

  • lots of tables
  • few columns per table
  • lots of relationships 

Gooood Data Model
Why?

The OLTP  use case has a few common characteristics:

  • it supports multiple systems and many users concurrently
  • it doesn't know when work will arrive - episodic traffic rather than periodic
  • it doesn't know what work will arrive - traffic is made up of all operations: SELECT, INSERT, UPDATE, DELETE
  • data quality is of high importance

Relational database management systems were created to directly address the OLTP use case, and they do it well.

The characteristics above imply the characteristics that a high-quality data model should have:

  • it should be clear and descriptive of the data
  • it should enforce data quality
  • it should be flexible

Clear and Descriptive

A high-quality data model should be a "picture that is worth a thousand words". You should be able to look at the data model, and figure out what the organization does. You should be able to see every person, place, thing, idea, event, and transaction that is important to the organization. The names of tables should be consistent with the language of the organization. If the organization refers to a "agreement", the table representing it should not be called "contract", "lease",  "arrangement", "accord", "pact", etc.

High-quality data models have a table for each entity (lots of tables). So each entity is identified separately and given a name. Low-quality data models have multiple entities stored in a single table (few tables). So multiple entities are combined into a single table, and hidden. They have no name, and are undocumented.

Data Quality

A high-quality data model should help to enforce that data is reasonable as it is enters the database. In a high-quality data model, each relationship represents a referential integrity CHECK constraint. This is one way that data is checked for reasonableness as it enters the database. So, a data model that has lots of relationships is checking for reasonableness in many places, and making sure "stupid errors" don't occur. 

A high-quality data model that is clear and descriptive also clarifies what accurate data should look like. A single-entity table clarifies what a reasonable record should look like. It makes it easy to define other CHECK constraints. 

A low-quality data model with few relationships has not captured all the different referential integrity CHECK constraints that could be used to maintain data quality. Furthermore, a large table that represents multiple entities makes it difficult to know what a reasonable record should look like. 

Flexibility

This is probably the most subtle, and hardest to grok without experience. 

A high-quality data model should be flexible. It should work well with all current and future systems. It should accept new features without excessive effort. It should accept bug fixes and modifications elegantly. It should not be difficult to work with from one perspective, but easy from another. In short, it should not suffer from program-data dependence. 

High-quality data models store the data in a form uninfluenced by any particular systems' usage of the data. Clear single-entity tables describe and store the data, rather than promote a particular usage of the data. Relationships describe how entities relate, regardless of systems' usage.

Low-quality data models typically impose a bias on usage of the data. When multiple entities are in a single table, the data is stored pre-JOINed in a certain manner. It is difficult to JOIN in other ways. It is also more difficult to separate one entity from an entity that it is pre-JOINed with.

Qualifications

Your mileage may vary according to your situation. Data models are not really good/bad. They have advantages and disadvantages, depending on situations.  

This is a high-level way to get a quick read on a database before digging into the details. I think it's also a good way to remain grounded in the objectives of clarity, data quality, and flexibility.



Followers