There was a movement in Java over the past several years to get away from complex inheritance and the implementation of byzantine interfaces specifically, the complexity of EJBs – where you had to implement at least 6 methods that you, personally, likely never cared about.
The idea was to separate the details of persistence, transactions, and other logic irrelevant to the business domain. The movement called for POJOs, or Plain Old Java Objects. Whether that succeeded or not is a matter for other debate. But in a language that generally calls for you to implement clone, serializable, toString, guids, etc. I’ll let you decide.
Many developers have moved away from Java (or in the newer generation, never taken it up) in favor of more agile, dynamic languages such as Ruby and Python, which are still object oriented, but dynamically typed, interpretted, and have a more concise syntax.
Frameworks like Rails have led the way to making web development much easier and more enjoyable. But unfortunately, the number of frameworks (most inspired, if not directly copied by Rails) has multiplied exceedingly. Sadly, many of these frameworks require specific formatting of database tables. Often, they dismiss the importance of SQL, and tend to consider data storage a mere “implementation detail”. Some even make it difficult or impossible for other applications to access the data.
That completely misses the point of a database. If you are confident that your application will be the only one using the data, by all means store it however you like. Store it as serialized objects, in a key value store, in flat files, or cache as much as you can. But the reason you probably chose to use a database as your “implementation detail” is because you had some inkling that you’d like to share this data at some point.
Java started it probably. ORMs like Hibernate wanted to write your SQL queries for you. The type of people that wanted this either didn’t like typing “SELECT * FROM FOO” or just didnt’ know how. Unfortunately, this type of person probably didn’t know anything else about databases, and so tended to want to keep everything in one big ugly table. The true benefit of ORM wasn’t really in the SQL generation anyway. Mapping tables to objects can be tedious but isn’t complex. What really got them to catch on was the built in connection pooling, and then later, in recordset caching. All of a sudden, you could keep your recordset in memory as long as it was valid. Or at least as long as you had spare memory. And even when you needed a change, your database connection was already open. Performance soared. And even the hypothetical database agnostic application became possible.
But at the cost of the database. In order to keep track of how “fresh” data was, you had to store it’s state somewhere. And what better place than right there in the database itself! And thus once clean (hypothetically) tables got polluted with all sorts of metadata that wasn’t what you wanted to store at all, it was merely an “implementation detail” of your ORM and caching mechanism.
Agile, dynamic, lightweight systems followed this lead. Since they were truly lightweight frameworks in dynamic languages, they needed to store meta data even more than a complex statically typed language like Java. And because they could only count on the least common denominator of compatibility between database, more metadata was needed to keep track of more abstractions that could be taken for granted in one database.
Now a table called Employees has more than your name, salary, and manager (and maybe a sequential id for primary key), it has things like “created_by” and “date_last_modified” and maybe application specific information like “is_active” or “version”
What I’m calling for are POSTs, or Plain Old SQL Tables. Let’s get rid of as much application implementation cruft from our doman tables as possible.
I don’t think we can get rid of ids, but I do think we can get rid of access control, caching, and versioning info out of the main tables. I realize that sometimes denormalization can be a good thing for performance, but it’s not time to optimize for performance.
I spent some time yesterday thinking about how I could move historical data out of a table, so that I can do a SELECT * FROM EMPLOYEES and get only the relevant information.
Maybe the answer is SELECT * isn’t the answer, but I think we can do better, to make our tables more readable, and find a way to improve performance later.
I came up with a generic CHANGES table that keeps track fo the last_modified for all tables that want it, and associated join tables such as EMPLOYEES_CHANGES that map to it. There could then also be an EMPLOYEES_REVISIONS table that could store previous revisions if needed, which essentially mirrors EMPLOYEES, but has a REVISION_ID that can be used as a foreign key by CHANGES.
+------------+ | Employees | +------------+ | id | | name | | salary | | manager_id | +------------+ +-------------------+ | employees_changes | +-------------------+ | employee_id | | change_id | +-------------------+ +-------------+ | changes | +-------------+ | id | | timestamp | | editor_id | | revision_id | +-------------+ +---------------------+ | employees_revisions | +---------------------+ | id | | employee_id | | name | | salary | | manager_id | +---------------------+
Of course, all change information could be stored in the REVISIONS table, but this is another pollution, and separating interesting change information from the bulk of the data is also a performance improvement. The downside is the somewhat complex and costly join to get commonly used change information like “last_modified” is more complex. Perhaps there is a way to improve database optimization for such joins, or the caching used by ORMs can be used to keep track of it.
select * from changes where timestamp = (select max(timestamp) from changes where id in (select change_id from employees_changes where employee_id = 1));
That’s not the best query, but it illustrates the process.
I’m not saying this is a good implementation, just that the concept of keeping your database tables clean is probably at least as worthwhile as keeping your objects clean.
It doesn’t really eliminate the problem of cross application compatibility because just as different frameworks expect a ‘last_modified’ or ‘last_changed’ field, they could also likewise have different conventions for implementing multiple table revision information.
In fact, because the data is separated, there is a real temptation for one framework to disregard another frameworks caching implementation and implement their own. Then you run the risk of having invalid data (because they frameworks aren’t using the database’s built in table integrity).
But tables aren’t meant to keep track of whether they’re cached or not. By shoehorning that data into the table, you’re using the databases’ integrity to do it for you. Not a terrible decision, but what do you do when you have different frameworks with different names?
At least in that case it would fail with a missing column name (expected last_modified but got last_edited).
How could you implement a proper failure other than “just knowing” with a multi table solution? Obviously with triggers or procedures, if you had them, but what if you don’t?