When an amazing feat of engineering becomes commonplace, we tend to take it for granted. Like airplane wi-fi. Or bridges. Or the original iPhone. It’s a weird twist on Arthur C. Clarke’s adage that “any sufficiently advanced technology is indistinguishable from magic”. Any sufficiently mainstream technology is doomed to lose its sense of wonder.
Relational databases are like that. I have used them for over 10 years now, but during a recent conversation with a colleague about NoSQL, I couldn’t sufficiently explain why we should not just give up on trusty old RDBMSs for the new hotness. Aside from pointing him to gar1t’s famous video.
You know when gar1t says:
“You read the latest post on highscalability.com and think you are a f__ing Google architect and parrot slogans like ‘web scale’ and ‘sharding’ but you have no idea what the f__ you are talking about.”
Yeah, that was me. As a young developer I once spent 8 months writing increasingly complex SQL queries for reports. The client couldn’t afford a BI tool but they could afford me, so there I was, trying to run Accounts Receivables in under 10 minutes against hundreds of millions of database records. (The CIO had written the previous version that took 3 hours to finish and nearly brought Production to a halt, so I still felt accomplished.) The senior architect told me to paste this line on top of every query:
“SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;”
And so I did. When asked why, the answer was “so that you don’t lock the table for other operations.” That was enough to feel knowledgeable; so I didn’t dig deeper. For years.
It seems to be a trap we frequently fall into: there’s so much to know and so many new things coming up constantly, that we skim a topic and move on thinking we are experts.
So. Back to basics. Do you know how databases engines work? Do you know comparative tradeoffs between various NoSQL offerings? You knew “MongoDB is Web Scale” is a joke, but why?
Over the next few posts I’ll try to dive deep into understanding how databases work, especially as it pertains to distributed systems.
(note: there is nothing new I’m going to say here - it’s from reading what others have written. Links will take you to the original source. This is a Cliffs Notes for myself of things I already knew and things I have only begun to understand better now.)
For the rest of this post, I’ll do a brief recap of the basics before we get into the interesting stuff later.
ACID: Atomicity, Consistency, Isolation, Durability
Since we use databases as the ultimate source of truth for our data, we expect them to meet certain SLAs. These have been boiled down to 4 key properties, labeled ACID. A DBMS either meets all of them, or has explicit tradeoffs around one or more of those properties.
In order of importance, here is a brief explanation of the four properties:
Transactions that have committed must survive permanently, even if the system crashes. This is usually assured by writing transactions into a log before acknowledging the commit. The log is on non-volatile storage and can be used to recreate the system state prior to failure (often automatically).
The db must treat each transaction as “all or nothing”; if any part fails, the whole transaction must be rolled back as if it never happened. This must hold even in the case of power failures, disk crashes etc.
Atomicity relies on durability; a system that cannot guarantee durability cannot, therefore, claim atomicity.
A catch-all term that means “all the rules defined in the database must be followed” when committing a transaction. The end state, at the end of the transaction, must be valid. Rules here can mean data constraints, cascades, triggers, etc.
Consistency relies on atomicity, in that if there is a violation, we rely on the system’s ability to roll back changes.
At a high level, isolation means that the result of 2 concurrent operations should be the same as if they occurred sequentially. e.g. If you looked up your account balance in the middle of a funds transfer, the result should look to you as if the transfer is already complete.
Sounds easy, but it leads into a whole topic of concurrency control. This feature is where tradeoffs most frequently occur between scale (number of concurrent users who can access a resource) and integrity (is the resource you’re reading the final committed version?)
Remember that line my architect told me to use everywhere:
“SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;”
What does it do? It tells the database what level of integrity is acceptable on this transaction.
Read uncommitted: lowest level. Dirty reads are allowed, meaning one transaction can read uncommitted changes made by another transaction. Locks are only held on the currently running statements, and then released immediately.
Read committed: prevents dirty reads, i.e. any data being read is guaranteed to be committed data. This doesn’t guarantee repeatable reads though; the same query, run within the same transaction, may return different data. Write locks are kept until the end of the transaction; but read locks and range-locks are released as soon as the statement is done.
Repeatable read: unlike the above level, both write and read locks are maintained until the transaction ends. Range-locks are only maintained on the current statement.
Serializable: highest possible level. Read, write and range-locks (if there is a WHERE clause in the query) are obtained and maintained throughout the transaction. This prevents phantom reads. This level is the one that meets the criterion set by the definition; i.e. one transaction ends before another can begin, in a serial manner.
Isolation relies on atomicity, in that if there is a deadlock, we rely on the system’s ability to roll back changes.
ACID becomes interesting in the land of distributed systems. A simple way to understand why is the Two Generals Problem. 2PC and 3PC are common protocols here, worth understanding in detail. An interesting read in this area is how redis distributed locks work. We will dive deeper into distributed transactions in a later post.