Thursday, 8 March 2012

Data modelling and ORM

Have you ever thought of what a database really is? Wikipedia says that: "A database is an organised collection of data for one or more purposes, usually in digital form." Almost all services on the Internet today has a database and I like to look at the database as the foundation of any service. It's the most crucial part. I'm not saying that nothing else is important. But if you were to lose all your code, and still have your database, you can rebuild your service. On the other hand, if you lose your database and all your data, then you have a problem, and it's not a small one. A well known example of what can happen when data goes missing is from Japan's Pension Service in 2007. In an attempt to migrate data, key information was lost, resulting in 50 million pension accounts being corrupted. The entire pension database in Japan was useless. Suddenly there was no way to identify what person had what pension account. All data recorded from 1979 till 1997, corrupted. Some data was recovered, or recollected. But the result was that the ruling party lost the election and I bet some heads were, literally, rolling as well.

When you have data collected you quickly get the need of analysing and updating the information. You could store everything in a text document and still argue it's a database. However, it would become hard to manage. In other words, you do need a good system for organising and storing your data. A wise man, with name Edgar F. Codd, came up with such a system, namely the relational model. The relational model is based on first order logic. An example of such logic is the expression

hospital(ą) → doctor(ą)

which means that ą is a hospital and that ą has a doctor. In other words we're saying that all hospitals has got a doctor. Should be a reasonable assumption even though I'm sure there are some examples of hospitals without doctors as well.

In a relational model relations can be expressed in the following way:

(src : Wikipedia)

In the diagram above we're saying that Samuel Clemens has got login "mark" and phone number "555.555.5555".

Data programs, or data systems, for storing and managing data using the relational model are called RDBMS or relational database management systems. Mr Codd made a set of 13 rules that is required from a database management system for it to be considered rational. Funnily enough, because we start with rule 0, these rules are called Codd's 12 rules.

There exists several RDBMSs. Some of the well known are Oracle, PostgreSQL, MySQL, Sybase and SQL Server (MS SQL). However there are alternatives to the relational database. You've got NoSQL datases, XML-databases and other object oriented databases. Relational databases are the most common though. They've been around since the 70s and has proven very efficient for storing and managing data.

Many common programming languages today are object-oriented (OO). An OO-world is quite different to a relational world. You've got concepts like interface, classes and inheritance. You have different datatypes, like references. And you have different structures, like one object that that consists of other objects. An example of an object model can be:

It's not the best model, but for illustration purposes it should do. In the above example you have a movie type called "Sci-fi". Sci-fi movies has got both space ships and aliens. As well as a title and a year (release year) that they inherited from their parent "Movie" object. You have two types of sci-fi movies, these are short film and feature film.

This might be a good structure in an OO-world. However it's less probable that it's wise to structure the data in the database in the exact same way. It's no problem using a relation database for storing data for an OO-model though. By using an extra transformation layer of logic in between the database and the OO-model all data can be mapped, even though the data model is not at all similar to the object model. What is more challenging is to try to store representations of the objects itself directly in a relational database.

Due to the increased use of object oriented programming languages, and the use of relational databases, there has been developed a method of converting data from the OO-world to a relational world. This method is called Object-relational mapping (ORM). The technique automatically converts objects to relations, and creates a "semi object-oriented" database model. This way you can easily store "objects" in a relational database. All code, for updating, inserting and deleting data is also automatically generated. There has been developed a lot of ORM-tools, where Hibernate is a commonly used for Java projects. As a developer you do not have to directly deal with the database at all when using such a tool.

Problems with ORM
It's slightly flippant but anyhow interesting to look at the meaning of the word hibernate.


intr.v. hi·ber·nat·ed, hi·ber·nat·ing, hi·ber·nates
1. To pass the winter in a dormant or torpid state.
2. To be in an inactive or dormant state or period.
(src :

Not sure why that name was picked for an ORM-tool, but it might be because you are supposed to be given an impression that you can put your brain on idle when using it. That you can just press a generate button and moments later you'll get an automatically generated database and all the SQL code needed for manipulating the data. It's all there, in Java, almost like magic. But there are problems with having your brain on idle mode, and there is no magic that automatically creates a perfect database for you. There are drawbacks, and by my opinion they should be taken seriously.

: normalisation :
When you do data modelling for a relational database a key principle is to normalise the data model. It means to organise the data in such a way that redundancies and dependencies are minimised. One of the effects of a normalised database is that the efficiency is better compared to a database that is not normalised. In a normalised database you often have smaller tables (less columns), hence only parts of the database is affected during modification of the data (DML). In an object oriented world normalisation is not even a subject, hence, when converting using ORM you'll very likely to end up with a database structure that is far from normalised. OK, there might be ORM-tools that allow you to work more in detail on what the physical data model will look like. So with some tools you should be able to normalise. However an effect of auto-generated databases seems very often to be models that are poorly normalised.

: gap in between developer and database :
When you model a database you should think of things like what parts of the database that will be accessed a lot, what information will be updated frequently and what will be selected often. You, as a developer, need to have a relation with the database to make sure that your data is stored in an optimal way. A relation that easily goes missing when you'll get a model automatically generated, where these things has not been considered. When using an ORM-tool there is created a huge gap in between the developer and the database. Many developers have an attitude that they don't need to think of the database, they believe it is enough just dealing with the objects in the object world. Other argues that you can always just move your data to another model later on, so it doesn't matter   how it is structured. Personally I prefer doing things right in the first place, in stead of having to go back and redo everything. I believe that the distance created in between the developer and the database is the key reason why ORM-generated RDBMS' to me seems to be of low quality.

: SQL code :
An ORM-tool also generates all the SQL code needed for manipulating the data. Next to having a non optimal database structure you'll as well get DML queries that are not of best quality. SQL statements tends to become heavy and inefficient.

: modularity :
One advantage of working closer with the physical data model is that you as well more easily can structure it in a way that becomes easy to read. You can achieve a modular database, that is easy to grasp when you'll at a later stage will need to work directly with it. It is harder to achieve this using ORM.

: business logic :
ORM-generated databases tend to have all business logic in the Java code. Hence there is no logic in the database that ensures good data quality. This is OK as long as you only access and manipulate data through the Java layer, but when you access the database directly you more easily corrupt data since there are no triggers, constraints, procedures, functions or what have you got, to ensure the business logic is followed. Some argue that there is no need to work directly with the database, as long as you have automatically ORM-generated code on top of it. I will argue they are wrong. For important and long living systems there is very likely there come a day where you'll need to work with the database directly.

: documentation and naming :
When there is such a big gap in between the developer and the database you'll also find poorly documented database structures. Table comments and column comments can become crucial in the future. When everything is automatically generated naming also often becomes hard to understand. You might get references like "REF_812373272362" in stead of "REF_TABLE1_TABLE2".

: and more :
There are many more problems with ORM. The method is controversial and searching on the Internet you'll find many example of problematic issues when using ORM. On Wikipedia you can read this:
"Both object-oriented programming and relational database management systems (RDBMSs) are extremely common in software today. Since relational databases don't store objects directly (though some RDBMSs have object-oriented features to approximate this), there is a general need to bridge the two worlds. The problem of bridging object-oriented programming accesses and data patterns with relational databases is known as Object-Relational impedance mismatch."

Why ORM?
But when there are so many problems with ORM, why is it still so widely adapted? Despite of not having any done any empirical analysis I have some thoughts on the subject. I reckon it's due to two things:
- lack of knowledge
- lack of time

There are surprisingly many developers with little or no database experience. When these  developers are put to develop a database they prefer pressing a "generate button" in stead of learning how to do it the hard way. In my opinion many of the same developers also have got the wrong idea of what a database is. They look at a database as nothing more than a "stupid" program to store their data. They believe there shall be no logic, whatsoever, in the database. And as mentioned, the strategy of not having any business logic in the database is only a great strategy for in the long rung getting poor quality data.

Initially is it very time consuming being able to use a tool that does the work for you. It takes heaps more time to model a good structured database yourself and write all the SQL code needed for DML operation. However the time you gained in the first place is quickly lost when you start running into performance problems and development problems.

The more complex the solution, and the more data, the more problems you'll run into if you use ORM. However, in the long run, no matter the size or the complexity of the system, if data quality, performance and having the best possible data model is important ORM should not be used.


No comments:

Post a Comment

Allowed HTML tags:
<a href="">hyperlink</a>

Please, show the courtesy of identifying yourself when adding a comment. Anonymous comments will, most likely, be removed.