Monday, 16 June 2014

Amazon Dynamo DB



What’s Amazon DynamoDB?
DynamoDB is one of the most recent services offered by Amazon.com. Announced on January 18, 2012, it is a fully managed NoSQL database service that provides fast and predictable performance along with excellent scalability. Let’s quickly analyze its positive and negative aspects in the lists below:
PROS:
  • Scalable
  • Simple
  • Hosted by Amazon
  • Good SDK
  • Free account for small amount of reads/writes
  • Pricing based on throughput
CONS:
  • Poor documentation
  • Limited data types
  • Poor query comparison operators
  • Unable to do complex queries
  • 64KB limit on row size
  • 1MB limit on querying

Pros of using DynamoDB

      • The major advantage of DynamoDB over other NoSQL counterparts is amazons infrastructure support
      • Key + columns data model
      • Composite key support
      • Tuneable consistency
      • Distributed counters
      • Largest value supported 64KB
      • Conditional updates
      • Hadoop integration M/R, Hive
      • Monitorable
      • Backups Low-impact – opeartes manually with EMR

Cons of using DynamoDB

      • Deployable Only on AWS
      • Indexes on column values is not supported
      • Integrated caching is not well explained in the document
      • 64KB limit on row size
      • Limited Data Types
      • 1MB limit on Querying and Scanning
      • Limited Capability of Query’s Comparision Operators
Advantages
      • Automatic data replication accross multiple AWS availability zones to protect data and provide high uptime
      • Scalable
      •  Average service-side latencies in single-digit milliseconds
      •  Fully Managed Service
      •  Flexible
      •  Single API call allows you to atomically increment or decrement numerical attributes
      •  Plans are cheap – a free tier allows more than 40 million database operations/month
      •  Uses secure algorithms to keep your data safe
      •  AWS Management Console monitors your table operational metrices
      •  Tightly integrated with Amazon Elastic MapReduce (Amazon EMR)
  • scalability and simplicity of NoSQL
  • consistent performance
  • low learning curve
  • it comes with a decent object mapper for Java
Flaws

DynamoDB's not ideal for storing events

Like most websites, we store a variety of user events. A typical event has an event ID, a user ID, an event type and other attributes that describe actions performed by users. At Dailycred, we needed an event storage that is optimized for reads. For our dashboard we need to quickly filter events by type, sort events by time and group events by user. However, we don't need to record events as soon as they happen. A second of delay is fine.
Using a relational database, we can store events in a denormalized table. Add indices to columns that answer query predicates. In this setup, writes are not very fast, but reads are extremely fast. We can use the richness of SQL to query events easily.
A big limitation of DynamoDB and other non-relational database is the lack of multiple indices. In an events table, we could use the event ID as the hash key, the event time as the range key. This schema would enable us to retrieve most recent events, but we can't filter event by type without doing a full table scan. Scans are expensive in DynamoDB. You could store events in many tables, partitioned by event type or by user ID. Perhaps for each predicate, create an index table with the predicate's value as the key and the event ID as an attribute. Is the added complexity worth it? Probably not. DynamoDB is great for lookups by key, not so good for queries, and abysmal for queries with multiple predicates.
SQL was a better tool in this case, so we decided not to use DynamoDB at all for storing events.

DynamoDB overhead (compared to SQL)

DynamoDB supports transactions, but not in the traditional SQL sense. Each write operation is atomic to an item. A write operation either successfully updates all of the item's attributes or none of its attributes. There are no multi-operation transactions. For example, you have two tables, one to store orders and one to store the user-to-order mapping. When a new order comes in, you write to the order table first, then the mapping table. If the second write fails due to network outage, you are left with an orphaned item. Your application has to recognize orphaned data. Periodically, you will want to run a script to garbage collect those data, which in turn involve a full table scan. The complexity doesn't end here. Your script might need to increase the read limit temporarily. It has to wait long enough between rounds of scan to stay under the limit.

One strike, and you are out

While DynamoDB's provisioned throughput lets you fine tune the performance of individual tables, it doesn't degrade gracefully. Once you hit the read or write limit, your requests are denied until enough time has elapsed. In a perfect world, your auto-scaling script will adjust throughput based on anticipated traffic, increasing and decreasing limits as necessary, but unexpected traffic spikes is a fact of life. Say you bump up the limits as soon as DynamoDB throws a ProvisionedThroughputExceededException, the process could take a minute to complete. Until then, you are at the mercy of retries, a feature that is thankfully enabled by default by the official SDK.

Unit tests: slow or expensive (pick one)

We also run a lot of tests that use DynamoDB, which means a lot of items are written and read very quickly. We run the tests several times a day, which means our development database tables are sitting completely idle most of the time, only to be hammered with reads and writes when we run our unit tests. From a cost perspective, this isn't ideal. However, it's even worse if a developer has to wait extra time for his unit tests to complete.
You can then read data in 1 of three ways. Simple:
  • You read a single row by unique key access. If you have a composite hey, provide both the hash-key and the range-key, else provide just the hash key.
  • You scan the whole table.
  • If you have a composite key, access by the hash-key part and scan (you may filter, but in essence, this is still a scan) on the range key.
There is nothing else you can do, and note that unless doing a full table scan, you must always provide the hash-key, i.e. if you do not know the exact hash key for the row to get, you have to do a full table scan. There is just no other way.


2 down vote
We have just migrated all of our DynamoDB tables to RDS MySQL.
While using DynamoDB for specific tasks may make sense, building a new system on top of DynamoDB is really a bad idea. Best laid plans etc., you always need that extra flexibility from your DB.
Here are our reasons for moving to RDS:
  1. Indexing - Changing or adding keys on-the-fly is impossible without creating a new table.
  2. Queries - Querying data is extremely limited. Especially if you want to query non-indexed data. Joins are of course impossible so you have to manage complex data relations on your code/cache layer.
  3. Backup - Such a tedious backup procedure is a disappointing surprise compared to the slick backup of RDS
  4. GUI - bad UX, limited search, no fun.
  5. Speed - Response time is problematic compared to RDS. You find yourself building elaborate caching mechanism to compensate for it in places you would have settled for RDS's internal caching.
  6. Data Integrity - While the concept of fluid data structure sounds nice to begin with, some of your data is better "set in stone". Strong typing is a blessing when a little bug tries to destroy your database. With DynamoDB anything is possible and indeed anything that can go wrong does.
We now use DynamoDB as a backup for some systems and I'm sure we'll use it in the future for specific, well defined tasks. It's not a bad DB, it's just not the DB to serve 100% of your core system.
As far as advantages go, I'd say Scalability and Durability. It scales incredibly and transparently and it's (sort of) always up. These are really great features, but they do not compensate in any way for the downside aspects.

The supported datatypes aren't overly exciting either: Number, String and a Set of Number and String. The string type is UTF-8 and the Number is a signed 38 precision number. Other notable limits is that there is a max of 64 K per row limit, and that a scan will only scan up to a max of 1Mb of data. Note that there is no binary datatype (we have binary data in out MongoDB setup and use base64 encoding on that in DynamoDB).

Pricing is interesting. What you pay for is throughput and storage, which is pretty different from what you may be used to. Throughput may adjusted to what you need, and it's calculated in kb of row data per second, i.e. a table with rows of up to 1Kb in size that with a requirement of 10 reads per second will mean you need 10 units of read capacity (there is a similar throughput number for write capacity). Read more on pricing here.
We are still testing, but so far I am reasonably happy with DynamoDB, despite the issues listed above. The lack of tools (no, there are no DynamoDB tools. At all. No backup tool, no import / export, nothing) means that a certain amount of app development is necessary to access it, even for the simplest of things. Also, there is no Backup, but I am sure this will be fixed soon.

1) 64KB limit on row size: We have many records in our HBase with data much more than 64KB in one row. We even had some records as big as 100MB. Limiting row size to such a tiny amounts rules out lots of use cases for us instantly. In one of the user forum threads here is what one of the Amazon guys say about the limit :
“64KB cumulative attribute size limit, which isn’t too hard to reach if you have a set of Number values for a single attribute When you have so many values for a set, you should consider switching your table to a hash-range schema.”
I think what he is missing is that there could be genuine cases where switching your table to hash range schema may not be possible. In some cases  single piece of data is bigger than 64KB. You cannot simply change your schema to use more rows instead of more columns in these cases. For example a page crawler may store the entire page content in one field. Somebody might want to store an Image as a field in your NoSQL database. You simply cannot use DynamoDB for such use cases. This is my biggest complaint.
2) Limited Data Types: It doesn’t accept binary data. You have to have strings or numbers or sets of strings or numbers. You cannot store images, byte arrays etc. You can get around it by encoding everything in string using base64 encoding, but base64 encoding produces bigger data. And you are counting your bytes as you cannot hit the 64KB limit!
3) 1MB limit on Querying and Scanning: You cannot get a result bigger than 1MB from Query or Scan operations. You are required to make LastEvaluatedKey call to start from wherever you stopped in the earlier scan request. This is not that bad, but it doesn’t allow you to optimize it for your use cases. In most of our use cases, making one trip for 1MB of data could be too much. Amazon should think about increasing this limit or allowing clients to specify this limit.
DynamoDB is supposed to be scalable. I think these limitation seriously challange the scalability claim. It makes me feel like Amazon cannot make it scalable without imposing these limitations.
4) Limited Capability of Query’s Comparision Operators: You cannot use CONTAINS, NOT_NULL and some other operators when you use Query features of DynamoDB. And the documentation may be wrong! Please read this thread for more information:
https://forums.aws.amazon.com/thread.jspa?threadID=85511&tstart=0
You can always use ‘Scan’ instead of ‘Query’ but then you will be forced to go through each and every record. It’s not necessarily any worse than any of the existing NoSQL solution. But since they offered Query mechanism (in additon to Scan) operation, I was little disapointed.
5) Time Required for Creation of a Table or API Call to know when the table is ready: When you create a table programatically (or even using AWS Console), the table doesn’t become available instantly. The call returns before the table is ready. This means you cannot create a table and use it instantly. Sometimes, we use dynamically created tables. I can undestand why it may take time, but it would be nice if they have an api call that can tell us when the table is ready.
Overall, I am really impressed by the simplicity of DynamoDB. The APIs (even though I don’t like the way they are designed) are pretty simple and schema modeling is also very simple. The forums have started buzzing and I think more and more people are trying DynamoDB out. What I will be watching is whether the points discussed above are preventing people from switching their existing NoSQL solution to Amazon’s managed DynamoDB. At GumGum, the first three issues are blockers and unless they are resolved, we are less likely to switch from HBase to DynamoDB.
Dynamo is an expensive, extremely low latency solution.  If you are trying to store more than 64KB per item, you're doing it wrong, and will end up paying through the nose for your read/write throughput anyway.  If you have data that large, take the latency hit and store it in S3 as others have suggested.
CONS :
  • In MySQL you'll get ACID guarantee, but in Dynamo-db there is no such guarantee.
  • Also in MySQL you can write complex while in Dynamo-db you can't write complex queries.
PROS :
  • It has the property of distributed hash tables hence more performance booster as compared to MySQL
Name
DynamoDB  X
Microsoft SQL Server  X
Description
Hosted, scalable database service by Amazon
Microsofts relational DBMS
Website
Technical documentation
Developer
Amazon
Microsoft
Initial release
2012
1989
License
n.a.
commercial http://db-engines.com/info.png
Implementation language

C++
Server operating systems
hosted
Windows
Database model
Key-value store
Relational DBMS
Data scheme
schema-free
yes
Typing http://db-engines.com/info.png
yes
yes
Secondary indexes
no http://db-engines.com/info.png
yes
SQL
no
yes
APIs and other access methods
RESTful HTTP API
OLE DB
Tabular Data Stream (TDS)
ADO.NET
JDBC
ODBC
Supported programming languages
.Net
ColdFusion
Erlang
Groovy
Java
JavaScript
Perl
PHP
Python
Ruby
.Net
Java
PHP
Python
Ruby
Visual Basic
Server-side scripts http://db-engines.com/info.png
no
Transact-SQL and .NET languages
Triggers
no
yes
Partitioning methods http://db-engines.com/info.png
Sharding
tables can be distributed across several files (horizontal partitioning), but no sharding
Replication methods http://db-engines.com/info.png
yes
yes, but depending on the SQL-Server Edition
MapReduce
no http://db-engines.com/info.png
no
Consistency concepts http://db-engines.com/info.png
Eventual Consistency
Immediate Consistency http://db-engines.com/info.png

Foreign keys http://db-engines.com/info.png
no
yes
Transaction concepts http://db-engines.com/info.png
no http://db-engines.com/info.png
ACID
Concurrency http://db-engines.com/info.png
yes
yes
Durability http://db-engines.com/info.png
yes
yes
User concepts http://db-engines.com/info.png
Access rights for users and roles can be defined via the AWS Identity and Access Management (IAM)
Users with fine-grained authorization concept
Specific characteristics
Data stored in Amazon cloud
Is one of the "Big 3" commercial database management systems besides Oracle and DB2
















So, what should you do when you’re building a new application and looking for the right database option? My recommendation is as follows: Start by looking at DynamoDB and see if that meets your needs. If it does, you will benefit from its scalability, availability, resilience, low cost, and minimal operational overhead. If a subset of your database workload requires features specific to relational databases, then I recommend moving that portion of your workload into a relational database engine like those supported by Amazon RDS. In the end, you’ll probably end up using a mix of database options, but you will be using the right tool for the right job in your application.





References

6 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Those guidelines additionally worked to become a good way to
    recognize that other people online have the identical fervor like mine
    to grasp great deal more around this condition.


    AWS Training in Bangalore


    AWS Training in Bangalore

    ReplyDelete
  3. Super Blog! very useful information to the users to get the certificate. For more knowledge on AWS AWS Online Training Hyderabad

    ReplyDelete

  4. Thank you for sharing!! This is very helpfull Blog..
    Amazon Web Services Online Training

    ReplyDelete