Background

There are many reasons for designing and implementing data base schemas that are portable – that is – data that can easily be moved from one database instance to another.

Why do we want to write portable data?

The first big reason is for application scalability. As data sets grow there is an increasing need to shard data into different data stores. Sharding is a way to partition data to allow for better scalability. More on this later…

The second reason is for  application availability. When something catastrophic occurs to an application with just one database, the database restore may be the only way to get the system back up and running. An application with one large data source could take more time to restore than your business can afford. However, if an application’s data is partitioned and stored in several databases and one database has a problem and needs to be restored, that restore would be both: much faster and (depending on your partitioning) affect less customers.

Functional Partitioning

It is a good engineering practice to keep like functionality together and different functionality apart. Decoupling functional areas in your application as well as your database can allow for better scalability of your system. Each functional partition can be configured and tuned so each can reach it’s optimum performance and scalability.

Horizontal Partitioning “If you can’t split it, you can’t scale it”

Functional partitioning by itself is not enough to achieve high scalability and availability. As a functional area grows it could at some point become too large to be scalable and allow for high availability. The workload needs to split up into smaller loads. Horizontal partitioning (known as sharding), is a way to split up a functional area into many partitions. Sharding algorithms could be based on customer id or some other identifier that allows a functional area to be spilt into smaller units.

Writing Portable Data

OK, so now we know some good reasons to write data that can be split and moved. So here are a few things I’ve seen that make that task easier:

  1. Ensure that the identity of the atomic data does not need to be changed when moved. It is much easier to move data from one place to another if the primary key values do not need to be changed. It is much harder to move data that needs to assign a new primary key when moved. Any references to it (foreign keys) will also need to be “fixed-up”. To avoid this problem use one of the following: shared sequencers for assigning identity, UUIDs for identity, or compound primary keys that ensure universal uniqueness.
  2. Include a partition column in each database table. For Rally, we use the subscription id as a way of partitioning our data. We also use a workspace OID column in all of our database tables for data scoped to a workspace.
  3. Avoid table ordering dependencies like foreign keys and triggers that include dependencies on other tables. These require that you will need to move data in a certain order which adds complexity and handling of “special cases”. OK, so we all know that we need foreign keys for referential integrity reasons, so we really can’t do without them in many cases, so make all of your foreign key constraints deferrable so that you can move your table in any order and then when the transaction commits it will check the constraints for violations. Now on to triggers… Triggers are nasty (for portability) when they depend or manipulate data in other tables since they most likely create table ordering dependencies. Try to avoid these at all costs. Find ways that your application can do it or write the trigger in a way it doesn’t need to depend on another table when the data is being moved. You could also disable the trigger during a move operation but that would require you to do the move “off-line” so that no other table access occurs during the disablement.

Well, these are a few of my ideas about writing portable data and I’m sure that you have some good ideas as well. So, please share them by you submitting a comment!