Since the usage Google’s LevelDB in Riak, and cases like BerkleyDB at Yammer we (developers) have realized how we can take a trivial building blocks, and build something really awesome out of them. One of the great lessons that I’ve learned from my experience of the wild (LevelDB, Pinterest just to name a few) can be summarized as
If possible at all try to minimize the length (bytes) of values you save (or write) in a data store.
Developers usually don’t care much about the values they are storing in a data store, which in turn results in performance issues. In this post I would be trying to convince you about keeping an eye on average length of values and for sake of a demo (also throwing out a new combination of tools Tokyocabinet and LZ4) I would be doing a benchmark.
As an example if we look at the JSON from twitter’s REST API; a tweet can be (see this for example) almost around 3K-4K (even longer in case of retweets). Let’s take Tokyocabinet and LZ4, and do some benchmarks proving how much difference we can make! One of the key points here is to choose speed efficient library rather than space (compressed size) efficient library (we don’t want to spend too much time on compressing the data). LevelDB uses Snappy to exploit the same principle for speeding itself up.
So for benchmarks I took some of my tweets (11 to be precise), dumped them into files. Later read those tweets in random order as value for storage to emulate some real world data, here is the link to gist of source code that I used. The last flag (true/false) to function put_in_db line 183 turns compression on/off. Compiling (command gcc -O2 testcab.c lz4.c -ltokyocabinet -o test) and running on my desktop machine gives me:
Total time consumed for 100000 entries 503514(ms)
File size (dump.tcb): 382.5 MB (382,522,880 bytes)
Total time consumed for 100000 entries 244250(ms)
File Size (dump.tcb): 197.5 MB (197,468,416 bytes)
Amazed? The 100,000 random entries are almost random order due to CRC32. Do notice the time and size difference; it takes us HALF the size and speed of uncompressed tweets. Lesson to be learned from this technique is to put our bet on a more powerful unit (CPU), rather than a weak-spot (HDD) of our machine. Which is safe and it simply works!
Taking a good care on size can give us good numbers, and they are applicable mostly everywhere. Compression can make even better sense when you have smartphone applications. In that case you can simply pull out data from your data-store and stream it to your smartphone to do the rest of the job. Saving space and time both is a rare combination and it can be possible if you use the compression technique wisely.