
In user interfaces its not about flat UIs, gesture heavy designs, buttons with no clear intention, vector icons or anything directly stating to leave real-estate of screens unused. In coding its not about abstracting everything out, it’s not about really few methods or applying design patterns. In your home its not about keeping your rooms empty!
Minimality has one solid purpose avoid distraction and stay focused on what’s important
I’ve many times covered how your SQL databases can be smartly used as schemaless store, today we will a specifically target a use-case and see how we can play a little smart to change SQLite3 into a schemaless store. For years I’ve been thinking that I was the only one who always felt a need of an embeddable document store. Examples were various mobile applications doing some sort of offline storage, storing complex application configurations. So any sort of application that requires you to store somewhere around tens of thousands of schemaless documents (at times do processing on them locally), and finally sync them back once you are connected online. Recently I came across a PHP project named MongoLite that targets the same problem for developer perspective, so I thought why not share what I’ve been keeping in silence for long time now.
Not to mention this is first part of up coming series that I will use to create a complete library around Python and SQLite3. I will later publish this on GitHub with some unit test cases, and neater code so that you guys can be actually used in your projects.
SQLite3 has this awesome thing called create_function. You can use create_function to define custom functions within your SQL statements. As an example from python documentation:
def md5sum(t):
return md5.md5(t).hexdigest()
con = sqlite3.connect(":memory:")
con.create_function("md5", 1, md5sum)
cur = con.cursor()
cur.execute("select md5(?)", ("foo",))
So create_function basically allows you to define any arbitrary function, that can take parameters and return values. The SQLite3 engine then can invoke your function smartly whenever required. Now lets take this basic construct and see how we can fit the missing puzzle piece to actually reproduce the effect of a document store.
So imagine a table with just two columns id that is INT, and a document field that is BLOB (will allow us to use various formats like msgpack, or cPickle rather than only JSON); storage part is breeze. Your dictionary or document can simply be serialized (in whatever format you choose) and simply inserted into table as row, so your statement:
collection.insert({'foo': 'bar', 'say': True})
will be effectively translate into (assuming I am dumping into JSON):
sqlite_cursor.execute('INSERT INTO doc_table(document) VALUES(?)',
(buffer(json.dumps({'foo': 'bar', 'say': True})),)
)
Pretty straight right? But now since you have stored everything as plain BLOB, SQLite has no idea how to look into field if you ask something like:
collection.find({'say': True}) #All documents with say = True
But we can use custom functions here to do that job for us. Look closely at problem and you see the SQLite3 doesn’t know how to parse your JSON and match the fields of JSON, but we can ask SQLite3 to invoke our function that in turn can do the comparison for SQLite. So imagine SQL statement:
SELECT * FROM doc_table WHERE json_matches(document, "{'say':true}");
Now we can implement the compare_match function in python, that can actually do the comparison and return True or False according to if we want the current (passed as parameter) document to be part of result set or not. Let’s have a quick look at a very simpler version of this registered function:
def doc_matches(self, doc, query):
try:
q = json.loads(query)
d = json.loads(doc)
return self.__is_sub_dict(d, q)
except Exception as e:
print e
def __is_sub_dict(self, d, q):
for k, q_val in q.items():
if not k in d:
return False
if q_val != d[k]:
return False
return True
Remember its really basic for giving you an idea, in reality this matching function will be much more complex. Now we can register the doc_matches function and actually do the comparison using python statement like this:
cursor.execute('SELECT * FROM doc_store WHERE json_matches(document, ?)',
[json.dumps(query_object)])
Simple and basic yet working. I’ve already done it in this gist. I’ve also implemented various serialization formats that you can try out your self by passing a different adapter to the constructor of the store (see gist’s main to see how I am doing that, by default it uses JSON as serialization format).
So our new toy document store and it works fine but it suffers with the great dread of complete table scan every time you do a query. We can avoid this nightmare by using indexes of SQLite and this is exactly what we will be doing in our next blog iteration.
Till then chill out and let me know your comments and suggestions.
Yesterday I did a post on the idea that Redis now needs a binary protocol. Seems like people are listening actively and I was followed up with a reply (which I am not ready to believe) in this post saying:
We actually found at Tumblr that the memcache binary protocol was less performant than the text protocol. I also know a number of other large web properties that use the memcache text protocol. I don’t think the ‘benefits’ of a binary protocol are super clear cut for applications like this.
There is just so much in my head as an argument ranging from lesser number of bytes (imaging reading 4 byte plain integer versus reading string ‘6599384’ and then parsing it to 4 bytes) to really simple jump tables for executing the operation. Then I thought let the evidence speak for itself and wrote some Go code available in this Gist to simply benchmark 100K, and 1M get operations (also tried various numbers various machines). For testing purposes I used Go ; with various reason like saving myself from complex configurations, avoid any VMs, simulate the complete client library written in purely in same language Go in this case (assuming the implementations were decent enough) and get the coding part done really quick.
It’s a really simple benchmark always getting same key (to remove any other variables) and simply discarding the results trying to benchmark pure get/parse time. Results were just what I expected; with 100K gets took 6.560225s in ASCII protocol and 5.288102s in binary protocol. As I scaled the number of gets up to 1M the time grew linearly (see gist).
In closing I think an ASCII protocol can never beat binary protocol (assuming you have not designed a stupid protocol). To me it sounds like interpreting bytecodes vs lines of code. There is a reason why many of NoSQL stores (e.g. Riak on Protobufs and HTTP) ship purely on binary protocol or an alternative binary protocol. I would love to know the libraries and methods Tumblr used to communicate with memcached over binary protocol. I am not convinced! If readers of this post have done some benchmarks, or anything that brings up a valid argument be free to share!
A news today was the post saying that Yahoo is using Redis as secondary index for serving its front-page. I think Redis is getting all the attention it deserves, and it’s time now for Redis to think in terms of binary protocol. Just like Memcached with its binary protocol Redis will have several advantages. As far as I know Redis community is working on it’s cluster support and introducing a Binary protocol won’t only reduce network traffic for Sentinel and upcoming cluster features; but it will also require less parsing time and performance optimizations (both in server and client libraries). We can either inspire our protocol from Memcached or something existing (msgpack). What do you think?
In my previous post I explored how to replace the storage engine of Whoosh to Redis. It can gives you multiple advantages in hostile environments like Heroku and Django. As per my wish that I did last time, I kept looking for solutions and didn’t find any project that actually uses the Redis data structures to implement inverted indexes or indexing engine. Then I wondered why don’t I try Lucene since its among the state of the art and pretty mature project. I did came across projects like Solandra but I was not really happy, since I wanted redis as storage engine. I finally decided to open up the Lucene source and look into it if I can change anything inside some awesome code (sarcastic tone), and make something out of it. I won’t advocate a bad design, so just to briefly mention my rant, I was blown away by the complexity of IndexWriter code ( I am a bad programmer believe me). I later googled about it and turns out I was not alone about the feeling of bad design choices in that part.
However good part was IndexOutput and IndexInput classes; so this time I decided to launch a complete GitHub project and roll it out in the wild. I’ve also done some heavy data loading and basic testing with sharding. I took the famous PirateBay data dump and indexed it on Lucene.
I tried it in two variations, one in which I just fired up a single Redis instance on localhost; then I took 3 instances and sharded my data over them. The source code used can be located in this Gist. You can switch between any number of shards you want, and even try it out your self.
On a single instance the database dump file was some where about 200MB. On 3 instance shard the size was 46MB, 83MB, and 104MB respectively. Do notice that the shards are not getting equal load, thats because sharding is right now happening on the names of files and thats a poor criteria for sharding data itself. In next iteration I am taking the sharding to file’s block level. Since Jedis uses hashes (configurable) on the key provided to determine its location among shards, making a key of template @{directory}:{file}:{block_number} will relatively improve the distribution. Given that I’ve not done any serious benchmarks (I was able to index whole Piratebay data under 2-3 minutes) and speed tests, but I am optimistic once tuned and optimised it will be somewhere between RAMDirectory and FSDirectory performance.
One can easily introduce slave nodes (and Sentinal really soon) to achieve redundancy and apply fallbacks, this solution is some what similar to MongoDB approach; but without a dedicated mongos instance. The clients themselves are intelligent and handle sharding.
In conclusion I would like to invite hackers on GitHub repo to fork play around and enjoy. I would be happy to accept pull requests. Happy hacking!
Update: I’ve just enabled Solr support checkout GitHub repo for details on how to set it up.
After my previous article the PostgreSQL landscape has changed totally. We have a total new version (9.2) with lots of speed optimisations and new features. That brings me out of my cave to visit the (promised) FriendFeed’s schema-less data in MySQL casestudy. I found FriendFeed technique to be one of the best examples of knowing your tools, and not following the buzz; even with very simple software stack they were able to create some great marvels (remember Python Tornado now being used by Facebook?). I will try to keep the whole thing really scientific and re-imagine how Postgres could have change the scenario completely. I will briefly explain the solution from FriendFeed, and then show how I would have done same thing using Postgres.
I am in no way trying to do a comparison of MySQL and PostgreSQL features. I am not trying to prove which RDBMS is better and which one you should choose over the other. It’s just imagining a solution of a problem with a different tool!
So just to revisit briefly FriendFeed was facing issues when adding new features due to its increasing user base, one of the biggest issue was schema changes. “In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time”, nobody would like to have Facebook account blocked just because a new timeline was introduced! The solution they produced was nifty, incremental and very inspiring. They stored JSON entities with 16-byte UUID, now JSON due to its schema-less nature can dynamically add or drop values from the entity (JSON object). You can simply choose a BLOB (even TEXT for simplicity) to store the JSON. For each property (JSON property) in entity that requires to be indexed; they created a separate table with primary key of {user_id, property_of_entity}; and this rule can be applied vitally everywhere. This allowed them to dynamically create and drop indexes on different fields of an entity. Since tables can be shraded, each index table can be sharded.
Now we can do the exact same thing PostgreSQL (we can choose to completely stick to FriendFeed solution and don’t anything special/specific to Postgres); but the good news is that Postgres has some really neat features that can help us improve, and remove extra coding overhead! Here is entity table’s schema redefined for PostgreSQL:
Notice nothing changed much except the BINARY going to BYTEA, and HSTORE for body. Yep you guessed it I am going to use HSTORE to actually store the body. Now if you don’t know what HSTORE is you can refer to my previous article. So representing an entity can actually be quite simple, consider entity (taken directly from blogpost):
Can be represented as
Inserting this entity into table is pretty straight forward:
INSERT INTO entities (id, updated, body)
VALUES(
’\x71f0c4d2291844cca2df6f486e96e37c’::bytea,
’2012-09-28T03:42:29.655011’::timestamp,
hstore(
ARRAY[‘updated’, ‘user_id’, ‘title’, ‘feed_id’, ‘link’, ‘published’, ‘id’],
ARRAY[
‘1235697046’,
’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
’We just launched a new backend system for FriendFeed!’,
’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
’http://friendfeed.com/e/71f0c4d2-2918-44cc-a2df-6f486e96e37c’,
’1235697046’,
’\x71f0c4d2291844cca2df6f486e96e37c’]
)
)
In Python using psycopg2 its even more pythonic. Here is just an example script to do that:
One can use SQLAlchemy to create even better looking code (using adapter like in this gist). Also you can use sharding just like FriendFeed guys, and yes SQLAlchemy supports sharding. That takes care of two major issues:
Now the last part is the indexing on the fields of our “structure free” body. FriendFeed used separate tables to do this and separate code in application code was maintaining that. Well good news PostgreSQL can do that form me without additional tables:
CREATE INDEX entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));
There are different options for types of index you can create (GIN, GIST, HASH, and BTREE), and the best part is usual rules for functional indexes apply on these indexes as well. I won’t go into details but you can look into documentation, and have a detailed look what does this precisely mean. Creating index like this will usually cause the same lock issue on complete table; and this is where PostgreSQL shines again. You can create index concurrently by simply adding CONCURRENTLY in your CREATE INDEX statement:
CREATE INDEX CONCURRENTLY entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));
Now what about the case when I don’t want to index a field any more? FriendFeed did it by dropping the index table and updating the code for not hitting the indexing table. What about PostgreSQL? You can either disable the index, or simply get rid of it and drop it:
DROP INDEX CONCURRENTLY IF EXISTS entities_index_user_id;
The above technique gives me multiple advantages; it helps me reduce code complexity by a huge margin! Imagine the pain of every time updating your code and dealing the complex architecture just because you introduced a field that needs indexing. Consider it against a simple CREATE INDEX statement. For me simplicity matters the most, and these builtin features are convincing enough for me to use PostgreSQL (even migrate to it). Secondly, if you look closely the index will lie on same shard where the tuple lies; this simply removes the possibility of accidentally moving the table used for indexing of field to a different shard (this may be done at times but it would compromise atomicity). Using the index created simply gives me all the powers that I would have on normal column index, which means inserting a row automatically hits the index removing the possibility of row being skipped due to buggy code. Schema updates become simpler and easier to maintain. Not to state the obvious its ACID! You can use modern JSON support to serialize your output directly to JSON. I am leaving the disadvantages portion to the audience.
Friendfeed is not the only case where such structure would have been required. At numerous occasions I’ve seen developers make a choice for the tools they know best, and same is true for Friendfeed.
Before I begin let me make one thing clear, I was just playing around and the code you are about to see is just something that came to me out of random,it’s not meant for production but its an idea I want to shootout. For me Redis can turn out to be an excellent inverted index storage system due to its available data-structures. I won’t be implementing anything from scratch or anything; but just to kick off and get some more minds into circle, I took whoosh (no particular reason) and tried to modify its storage for Redis, so that it becomes a semi equivalent of RamStorage. So without further delay here is the gist for my code:
Just to try it I’ve also made a python script to index the Piratebay’s data dump [Complete Gist is here]. Now just to mention I’ve indexed the Piratebay’s dump before in Lucene and it was disk stored index. I am not going to compare them today as I am not satisfied with the speed yet and I am planning to write my own IndexWrite and IndexReader classes. Let’s see if somebody can pull similar solution for Lucene in the mean time.
In my previous post I’ve shown how can you squeeze more out of your datastores by using compression. Similar rules apply to bandwidth and data transmission over the wire. There are various serialization formats you can choose from Protobuf, Thrift, Avro just to name few. I always prefer schema free serialization formats in case of key/value stores, this gives me error free data loading even when your data changes over time. The most favorite format for structure free schema is obviously JSON. However the new kind on the block is MsgPack. I’ve been playing around for a while with MsgPack in few projects of mine and it is actually good, compact, well documented and supports lots of platforms. But how does it compare to JSON (not in terms of performance or speed)? We all have a huge trust on JSON, its already a superhero who slayed XML despite its weaknesses. In this little adventure I set out to compare JSON vs MsgPack in terms of bytes when compressed! Lets get straight down to the business, here is the source code I used:
I am simply loading about 200 random tweets, then encoding those tweets to JSON, MsgPack, with Gzip and LZ4 compression. Results are pretty disturbing in case of GZip:
Now LZ4 looks quite normal and just as we expected with but GZIP just in 200 tweets MsgPack takes 189057 bytes and JSON takes only 177976 bytes. Bingo! now this is what I call a smart combination. You get 2 standard components that’s not only available for native applications you can write; but they are also available in modern browsers you are using today! You can use them in Javascript too with no special decoder to load data and simply use it. Now some of you may be wondering what’s the big deal? Here is the deal, if you can detect your browser supports GZIP Content-Encoding over XHR, you can serve gzipped JSON directly out of the data store to your clients (i.e. no fetch encode to JSON and stream Gzipped). You can use similar technique for cache systems like NGINX + Memcache [using HttpMemcachedModule] to serve some of your REST calls really quick (user profile, user info, etc).
One of the most awesome things about python is its ability to pass parameters by name to a function, for those of you who don’t know it look something like this:
my_foo_function(param_name="value", another_param_name="another value")
Today I wanted to do same in PHP 5.4 (can be easily ported to 5.3), so I thought of making a call_user_func_named just like the PHP native one call_user_func_array, here is the gist, you are open to contribute, modify or reuse in whatever way you like. Have fun!
Update: thanks to some nice contributors here are improvements in existing code
Since the usage Google’s LevelDB in Riak, and cases like BerkleyDB at Yammer we (developers) have realized how we can take a trivial building blocks, and build something really awesome out of them. One of the great lessons that I’ve learned from my experience of the wild (LevelDB, Pinterest just to name a few) can be summarized as
If possible at all try to minimize the length (bytes) of values you save (or write) in a data store.
Developers usually don’t care much about the values they are storing in a data store, which in turn results in performance issues. In this post I would be trying to convince you about keeping an eye on average length of values and for sake of a demo (also throwing out a new combination of tools Tokyocabinet and LZ4) I would be doing a benchmark.
As an example if we look at the JSON from twitter’s REST API; a tweet can be (see this for example) almost around 3K-4K (even longer in case of retweets). Let’s take Tokyocabinet and LZ4, and do some benchmarks proving how much difference we can make! One of the key points here is to choose speed efficient library rather than space (compressed size) efficient library (we don’t want to spend too much time on compressing the data). LevelDB uses Snappy to exploit the same principle for speeding itself up.
So for benchmarks I took some of my tweets (11 to be precise), dumped them into files. Later read those tweets in random order as value for storage to emulate some real world data, here is the link to gist of source code that I used. The last flag (true/false) to function put_in_db line 183 turns compression on/off. Compiling (command gcc -O2 testcab.c lz4.c -ltokyocabinet -o test) and running on my desktop machine gives me:
Uncompressed:
Total time consumed for 100000 entries 503514(ms)
File size (dump.tcb): 382.5 MB (382,522,880 bytes)Compressed:
Total time consumed for 100000 entries 244250(ms)
File Size (dump.tcb): 197.5 MB (197,468,416 bytes)
Amazed? The 100,000 random entries are almost random order due to CRC32. Do notice the time and size difference; it takes us HALF the size and speed of uncompressed tweets. Lesson to be learned from this technique is to put our bet on a more powerful unit (CPU), rather than a weak-spot (HDD) of our machine. Which is safe and it simply works!
Taking a good care on size can give us good numbers, and they are applicable mostly everywhere. Compression can make even better sense when you have smartphone applications. In that case you can simply pull out data from your data-store and stream it to your smartphone to do the rest of the job. Saving space and time both is a rare combination and it can be possible if you use the compression technique wisely.