Monday, 4 June 2012

Why NOSQL for logging

There were three problems with the use of a relational database table for logging.

First, there was a blocking insert problem.  Whenever something noteworthy happened on my systems (e.g., a sign on, a file upload, an administrative configuration change, etc.) I logged it.  As long as I didn’t have a busy system things were generally fine, but if a couple of different people hit me with extended periods of rapid file uploads, sign-in/offs from unthrottled API clients then my software would shudder and sometimes thrash.

Second, there was an oversubscription problem, where I added even more load onto the log database by using it heavily for common, interactive queries, such as looking back across the log for recent sign-ons.  While that sounds like a good idea because there would only be one authoritative set of records to check, it also magnified the effect of my blocking insert problem.  (e.g., if I got hit with a lot of sign-ons, the act of recording the sign-on in the log would block and slow other sign-ons too.)

Finally, the most serious problem occurred when it was time to upgrade.  My upgrades often involved a schema change in the log database, and that meant I needed to lock the database and update all the log records – often tens of millions of records. This was too frequently an operation that could take hours and often took 100x more time to complete than all other upgrade operations combined.


So…what should I have done?  One answer would have been to look at non-relational NoSQL database technology (such as that available in Apache Cassandra) for my log tables instead.  That would have addressed:
  • the blocking insert problem: nosql databases, especially distributed nosql databases like Cassandra, do not wait for inserts.
  • the oversubscription problem: without delays due to blocking inserts, the problem of lots of reads waiting on blocking inserts goes away
  • schema changes: NoSQL datasets support data of various formats, allowing old and new schema data to live next to each other and preventing outages caused by touching all existing data. (The multiple schemas put a little more burden on the application to keep these straight, but it allows the application to handle multiple versions and/or upgrade old ones in the background without downtime.)
.



CREATE KEYSPACE Mathblaster with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'and strategy_options = [{replication_factor:1}];

Use Jumpstart;

CREATE COLUMN FAMILY users
WITH comparator = UTF8Type
AND key_validation_class=UTF8Type
AND column_metadata = [
{column_name: full_name, validation_class: UTF8Type}
{column_name: email, validation_class: UTF8Type}
{column_name: state, validation_class: UTF8Type}
{column_name: gender, validation_class: UTF8Type}
{column_name: birth_year, validation_class: LongType}
];


CREATE COLUMN FAMILY blog_entry
WITH comparator = TimeUUIDType
AND key_validation_class=UTF8Type
AND default_validation_class = UTF8Type;

No comments:

Post a Comment