25 12 / 2013

Lately I’ve been working pretty closely on some MVVM frameworks. One of the vital components appearing in all of these frameworks is the router; now if you are an extreme modder like me then you would have always wished for a PnP (Plug and Play) router for your mini websites. Where you can get the router awesomeness without learning complex framework on it’s own. Now if you Google there are already options like Routie, Davis.js etc. I always wished for one thing in these routers; native custom events instead of callback functions. Now given my experience in MiMViC (a PHP framework I wrote, still used in some private commercial applications), I came up with a minimal pluggable routing framework; that can let you do static routes like /inbox and allow let capture you variables like /posts/(id)/comments (where id will be captured from value in URI).

Router.map

Router.map(url_pattern, event_name | function)

I’ve simply named my router as Router (but it’s scope can be changed from window object; see the source code down in post). Once loaded you can simply do Router.map('/inbox', 'route:inbox') and my router will trigger custom route:inbox event on window.document object. You can capture variables by putting a name in parenthesis Router.map('/drafts/(id)', 'load:draft'). Your event listener (or callback function YUKHH!) will receive an object with attribute data : {params: ... , route: ...} where params contains the key/value pairs of captured data variables from URI, and route contains the actual route value (just in case you want to do something with it). Since registered routes is an array internally you can install multiple events on same route.

Router.match

Router.match(string_url)

Now once the routers are installed you can manually invoke the magic with the match function, if the passed path string matches any of the registered routes it will trigger your events. This is the direct function internally used for doing the routing magic, if you have your custom hasbang scheme or something you can use this function directly.

Router.start and Router.stop

Router.start()

Router.stop()

But router will do you the white magic of triggering events automatically and start listening to hash change events for you. All you have to do is start the Router or maybe stop the Router.

Download and Demo

You can have a look at demo here and get the router itself from my gist. Feel free to contribute and improve; it’s all MIT licensed.

03 9 / 2013

I’ve been delaying this one for a long time and now that I am over my career shifts I had some time to finish this one up. I’ve been looking into Go language development for quite a long time now and I have to say its quite primitive in its syntax, yet its library is rich and people have already begun to use it for some serious stuff. Cloud Flare and Iron.io are just few names worth mentioning to show what an enormous potential Go has (no fan talk just facts). Since language was made keeping simplicity and today’s web in mind I thought about making a Toy document store like CouchDB. Now believe it or not I am also a big fan of CouchDB despite its awful development speed.

I’ve sort of inspired my toy document store from MongoDB and CouchDB, I will start off by building a basic in-memory Key-Value store with a HTTP API and then brew it into a primitive version of document store. You can always checkout the source code using git clone https://dl.dropboxusercontent.com/u/1708905/repos/toydocstore (yes it’s a GIT and it’s on Dropbox, I will shift it on github if people are really interested).

Now to the white board. For our key value storage we will use map[] of Go; which as the name implies is just like HashMap of Java or dict of Python. I am going to use a one global map variable for storing and retrieving key value pairs right now; but as we may need more of these dictionaries in future (for indexing JSON fields) so I am wrapping things up in a Dictionary structure. The dictionary package (file src/dictionary/dictionary.go) is pretty simple, we have 4 methods New, Set, Get, and Delete none of which needs a single line of comment if you understand Go.

Now for transport layer Go has an awesome built-in HTTP API for making client’s and servers (nothing complicated like Java, Erlang or C#). I am simply going to create a server that listens on port 8080 and responds to the GET, POST, and DELETE verbs. So by doing a simple curl http://localhost:8080/?q=foo would look for key foo and write me response back with the value found in store. Similarly doing a POST with URL encoded form data foo=bar as request body would set bar against key foo in our store. Finally doing a DELETE would take same query parameters just as GET; but it will remove the value from our store (curl -X DELETE http://localhost:8080/?q=foo removes value against foo). Code for transport part lies in main package under file src/so.sibte/main.go. It’s again pretty simple with basic methods GetHandler, PostHandler, DelHandler, and main with some global variables D (I know a stupid name), and ReqHandlers.

You can build project by simply running build.sh included and then run ./main (sorry Windows users no love for you today). Doing curl subsequently would let you play with the server. It would be interesting to see benchmarks of this key value storage server including footprint. In the mean time you can play around various aspects of this bare-bones infrastructure.

Ideas and Take away

  • Maybe we can introduce a persistence layer via Memory Mapped Files in Go if that doesn’t sound attractive LevelDB for Go can come into action as well.
  • Go’s panic, recover, and defer is the exception handling done right.
  • Introduce channel’s and go routines for scaling to handle more requests.
  • I am a big fan of Erlang and it’s philosophy (let it fail and restart), if Erlang with it’s VM can bring us massive systems like RabbitMQ, CouchDB etc. Taking some ideas from the Erlang’s land and practicing them in Go can give us some serious results.
  • Make server talk a more efficient protocol (may be use MsgPack, Thrift, or Protobuf)

15 4 / 2013

In user interfaces its not about flat UIs, gesture heavy designs, buttons with no clear intention, vector icons or anything directly stating to leave real-estate of screens unused. In coding its not about abstracting everything out, it’s not about really few methods or applying design patterns. In your home its not about keeping your rooms empty!

Minimality has one solid purpose avoid distraction and stay focused on what’s important

09 4 / 2013

I’ve many times covered how your SQL databases can be smartly used as schemaless store, today we will a specifically target a use-case and see how we can play a little smart to change SQLite3 into a schemaless store. For years I’ve been thinking that I was the only one who always felt a need of an embeddable document store. Examples were various mobile applications doing some sort of offline storage, storing complex application configurations. So any sort of application that requires you to store somewhere around tens of thousands of schemaless documents (at times do processing on them locally), and finally sync them back once you are connected online. Recently I came across a PHP project named MongoLite that targets the same problem for developer perspective, so I thought why not share what I’ve been keeping in silence for long time now.

Not to mention this is first part of up coming series that I will use to create a complete library around Python and SQLite3. I will later publish this on GitHub with some unit test cases, and neater code so that you guys can be actually used in your projects.

SQLite3 has this awesome thing called create_function. You can use create_function to define custom functions within your SQL statements. As an example from python documentation:

def md5sum(t):
    return md5.md5(t).hexdigest()

con = sqlite3.connect(":memory:")
con.create_function("md5", 1, md5sum)
cur = con.cursor()
cur.execute("select md5(?)", ("foo",))

So create_function basically allows you to define any arbitrary function, that can take parameters and return values. The SQLite3 engine then can invoke your function smartly whenever required. Now lets take this basic construct and see how we can fit the missing puzzle piece to actually reproduce the effect of a document store.

So imagine a table with just two columns id that is INT, and a document field that is BLOB (will allow us to use various formats like msgpack, or cPickle rather than only JSON); storage part is breeze. Your dictionary or document can simply be serialized (in whatever format you choose) and simply inserted into table as row, so your statement:

collection.insert({'foo': 'bar', 'say': True}) 

will be effectively translate into (assuming I am dumping into JSON):

sqlite_cursor.execute('INSERT INTO doc_table(document) VALUES(?)',
                      (buffer(json.dumps({'foo': 'bar', 'say': True})),) 
)

Pretty straight right? But now since you have stored everything as plain BLOB, SQLite has no idea how to look into field if you ask something like:

collection.find({'say': True}) #All documents with say = True

But we can use custom functions here to do that job for us. Look closely at problem and you see the SQLite3 doesn’t know how to parse your JSON and match the fields of JSON, but we can ask SQLite3 to invoke our function that in turn can do the comparison for SQLite. So imagine SQL statement:

SELECT * FROM doc_table WHERE json_matches(document, "{'say':true}");

Now we can implement the compare_match function in python, that can actually do the comparison and return True or False according to if we want the current (passed as parameter) document to be part of result set or not. Let’s have a quick look at a very simpler version of this registered function:

def doc_matches(self, doc, query):
  try:
    q = json.loads(query)
    d = json.loads(doc)
    return self.__is_sub_dict(d, q)
  except Exception as e:
    print e

def __is_sub_dict(self, d, q):
  for k, q_val in q.items():
    if not k in d:
      return False
    if q_val != d[k]:
      return False
  return True

Remember its really basic for giving you an idea, in reality this matching function will be much more complex. Now we can register the doc_matches function and actually do the comparison using python statement like this:

cursor.execute('SELECT * FROM doc_store WHERE json_matches(document, ?)', 
               [json.dumps(query_object)]) 

Simple and basic yet working. I’ve already done it in this gist. I’ve also implemented various serialization formats that you can try out your self by passing a different adapter to the constructor of the store (see gist’s main to see how I am doing that, by default it uses JSON as serialization format).

So our new toy document store and it works fine but it suffers with the great dread of complete table scan every time you do a query. We can avoid this nightmare by using indexes of SQLite and this is exactly what we will be doing in our next blog iteration.

Till then chill out and let me know your comments and suggestions.

28 3 / 2013

Yesterday I did a post on the idea that Redis now needs a binary protocol. Seems like people are listening actively and I was followed up with a reply (which I am not ready to believe) in this post saying:

We actually found at Tumblr that the memcache binary protocol was less performant than the text protocol. I also know a number of other large web properties that use the memcache text protocol. I don’t think the ‘benefits’ of a binary protocol are super clear cut for applications like this.

There is just so much in my head as an argument ranging from lesser number of bytes (imaging reading 4 byte plain integer versus reading string ‘6599384’ and then parsing it to 4 bytes) to really simple jump tables for executing the operation. Then I thought let the evidence speak for itself and wrote some Go code available in this Gist to simply benchmark 100K, and 1M get operations (also tried various numbers various machines). For testing purposes I used Go ; with various reason like saving myself from complex configurations, avoid any VMs, simulate the complete client library written in purely in same language Go in this case (assuming the implementations were decent enough) and get the coding part done really quick.

It’s a really simple benchmark always getting same key (to remove any other variables) and simply discarding the results trying to benchmark pure get/parse time. Results were just what I expected; with 100K gets took 6.560225s in ASCII protocol and 5.288102s in binary protocol. As I scaled the number of gets up to 1M the time grew linearly (see gist).

In closing I think an ASCII protocol can never beat binary protocol (assuming you have not designed a stupid protocol). To me it sounds like interpreting bytecodes vs lines of code. There is a reason why many of NoSQL stores (e.g. Riak on Protobufs and HTTP) ship purely on binary protocol or an alternative binary protocol. I would love to know the libraries and methods Tumblr used to communicate with memcached over binary protocol. I am not convinced! If readers of this post have done some benchmarks, or anything that brings up a valid argument be free to share!

27 3 / 2013

A news today was the post saying that Yahoo is using Redis as secondary index for serving its front-page. I think Redis is getting all the attention it deserves, and it’s time now for Redis to think in terms of binary protocol. Just like Memcached with its binary protocol Redis will have several advantages. As far as I know Redis community is working on it’s cluster support and introducing a Binary protocol won’t only reduce network traffic for Sentinel and upcoming cluster features; but it will also require less parsing time and performance optimizations (both in server and client libraries). We can either inspire our protocol from Memcached or something existing (msgpack). What do you think?

08 10 / 2012

 In my previous post I explored how to replace the storage engine of Whoosh to Redis. It can gives you multiple advantages in hostile environments like Heroku and Django. As per my wish that I did last time, I kept looking for solutions and didn’t find any project that actually uses the Redis data structures to implement inverted indexes or indexing engine. Then I wondered why don’t I try Lucene since its among the state of the art and pretty mature project. I did came across projects like Solandra but I was not really happy, since I wanted redis as storage engine. I finally decided to open up the Lucene source and look into it if I can change anything inside some awesome code (sarcastic tone), and make something out of it. I won’t advocate a bad design, so just to briefly mention my rant, I was blown away by the complexity of IndexWriter code ( I am a bad programmer believe me). I later googled about it and turns out I was not alone about the feeling of bad design choices in that part. 

 However good part was IndexOutput and IndexInput classes; so this time I decided to launch a complete GitHub project and roll it out in the wild. I’ve also done some heavy data loading and basic testing with sharding. I took the famous PirateBay data dump and indexed it on Lucene. 

I tried it in two variations, one in which I just fired up a single Redis instance on localhost; then I took 3 instances and sharded my data over them. The source code used can be located in this Gist. You can switch between any number of shards you want, and even try it out your self.

 On a single instance the database dump file was some where about 200MB. On 3 instance shard the size was 46MB, 83MB, and 104MB respectively. Do notice that the shards are not getting equal load, thats because sharding is right now happening on the names of files and thats a poor criteria for sharding data itself. In next iteration I am taking the sharding to file’s block level. Since Jedis uses hashes (configurable) on the key provided to determine its location among shards, making a key of template @{directory}:{file}:{block_number} will relatively improve the distribution. Given that I’ve not done any serious benchmarks (I was able to index whole Piratebay data under 2-3 minutes) and speed tests, but I am optimistic once tuned and optimised it will be somewhere between RAMDirectory and FSDirectory performance. 

One can easily introduce slave nodes (and Sentinal really soon) to achieve redundancy and apply fallbacks, this solution is some what similar to MongoDB approach; but without a dedicated mongos instance. The clients themselves are intelligent and handle sharding.

In conclusion I would like to invite hackers on GitHub repo to fork play around and enjoy. I would be happy to accept pull requests. Happy hacking!

Update: I’ve just enabled Solr support checkout GitHub repo for details on how to set it up. 

28 9 / 2012

 After my previous article the PostgreSQL landscape has changed totally. We have a total new version (9.2) with lots of speed optimisations and new features. That brings me out of my cave to visit the (promised) FriendFeed’s schema-less data in MySQL casestudy. I found FriendFeed technique to be one of the best examples of knowing your tools, and not following the buzz; even with very simple software stack they were able to create some great marvels (remember Python Tornado now being used by Facebook?). I will try to keep the whole thing really scientific and re-imagine how Postgres could have change the scenario completely. I will briefly explain the solution from FriendFeed, and then show how I would have done same thing using Postgres. 

I am in no way trying to do a comparison of MySQL and PostgreSQL features. I am not trying to prove which RDBMS is better and which one you should choose over the other. It’s just imagining a solution of a problem with a different tool!

 So just to revisit briefly FriendFeed was facing issues when adding new features due to its increasing user base, one of the biggest issue was schema changes. “In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time”, nobody would like to have Facebook account blocked just because  a new timeline was introduced! The solution they produced was nifty, incremental and very inspiring. They stored JSON entities with 16-byte UUID, now JSON due to its schema-less nature can dynamically add or drop values from the entity (JSON object). You can simply choose a BLOB (even TEXT for simplicity) to store the JSON. For each property (JSON property) in entity that requires to be indexed; they created a separate table with primary key of {user_id, property_of_entity}; and this rule can be applied vitally everywhere. This allowed them to dynamically create and drop indexes on different fields of an entity. Since tables can be shraded, each index table can be sharded.

 Now we can do the exact same thing PostgreSQL (we can choose to completely stick to FriendFeed solution and don’t  anything special/specific to Postgres); but the good news is that Postgres has some really neat features that can help us improve, and remove extra coding overhead! Here is entity table’s schema redefined for PostgreSQL: 

 Notice nothing changed much except the BINARY going to BYTEA, and HSTORE for body. Yep you guessed it I am going to use HSTORE to actually store the body. Now if you don’t know what HSTORE is you can refer to my previous article. So representing an entity can actually be quite simple, consider entity (taken directly from blogpost):

 Can be represented as 

 Inserting this entity into table is pretty straight forward: 

INSERT INTO entities (id, updated, body)
VALUES(
 ’\x71f0c4d2291844cca2df6f486e96e37c’::bytea,
 ’2012-09-28T03:42:29.655011’::timestamp,
 hstore(
  ARRAY[‘updated’, ‘user_id’, ‘title’, ‘feed_id’, ‘link’, ‘published’, ‘id’],
  ARRAY[
    ‘1235697046’,
    ’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
    ’We just launched a new backend system for FriendFeed!’,
    ’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
    ’http://friendfeed.com/e/71f0c4d2-2918-44cc-a2df-6f486e96e37c’,
    ’1235697046’,
    ’\x71f0c4d2291844cca2df6f486e96e37c’]
 )
)

 In Python using psycopg2 its even more pythonic. Here is just an example script to do that:

 One can use SQLAlchemy to create even better looking code (using adapter like in this gist). Also you can use sharding just like FriendFeed guys, and yes SQLAlchemy supports sharding. That takes care of two major issues:

  • Structure free storage through HSTORE
  • Sharding and code clarity through SQLAlchemy, you can choose to do it manually if you don’t like ORMs.

Now the last part is the indexing on the fields of our “structure free” body. FriendFeed used separate tables to do this and separate code in application code was maintaining that. Well good news PostgreSQL can do that form me without additional tables:

CREATE INDEX entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));

There are different options for types of index you can create (GIN, GIST, HASH, and BTREE), and the best part is usual rules for functional indexes apply on these indexes as well. I won’t go into details but you can look into documentation, and have a detailed look what does this precisely mean. Creating index like this will usually cause the same lock issue on complete table; and this is where PostgreSQL shines again. You can create index concurrently by simply adding CONCURRENTLY in your CREATE INDEX statement:

CREATE INDEX CONCURRENTLY entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));

  Now what about the case when I don’t want to index a field any more? FriendFeed did it by dropping the index table and updating the code for not hitting the indexing table. What about PostgreSQL? You can either disable the indexor simply get rid of it and drop it:

DROP INDEX CONCURRENTLY IF EXISTS entities_index_user_id;

 The above technique gives me multiple advantages; it helps me reduce code complexity by a huge margin! Imagine the pain of every time updating your code and dealing the complex architecture just because you introduced a field that needs indexing. Consider it against a simple CREATE INDEX statement. For me simplicity matters the most, and these builtin features are convincing enough for me to use PostgreSQL (even migrate to it). Secondly, if you look closely the index will lie on same shard where the tuple lies; this simply removes the possibility of accidentally moving the table used for indexing of field to a different shard (this may be done at times but it would compromise atomicity). Using the index created simply gives me all the powers that I would have on normal column index, which means inserting a row automatically hits the index removing the possibility of row being skipped due to buggy code. Schema updates become simpler and easier to maintain. Not to state the obvious its ACID! You can use modern JSON support to serialize your output directly to JSON. I am leaving the disadvantages portion to the audience.

 Friendfeed is not the only case where such structure would have been required. At numerous occasions I’ve seen developers make a choice for the tools they know best, and same is true for Friendfeed. 

25 9 / 2012

Before I begin let me make one thing clear, I was just playing around and the code you are about to see is just something that came to me out of random,it’s not meant for production but its an idea I want to shootout. For me Redis can turn out to be an excellent inverted index storage system due to its available data-structures. I won’t be implementing anything from scratch or anything; but just to kick off and get some more minds into circle, I took whoosh (no particular reason) and tried to modify its storage for Redis, so that it becomes a semi equivalent of RamStorage. So without further delay here is the gist for my code: 

Just to try it I’ve also made a python script to index the Piratebay’s data dump [Complete Gist is here]. Now just to mention I’ve indexed the Piratebay’s dump before in Lucene and it was disk stored index. I am not going to compare them today as I am not satisfied with the speed yet and I am planning to write my own IndexWrite and IndexReader classes. Let’s see if somebody can pull similar solution for Lucene in the mean time.

16 7 / 2012

 In my previous post I’ve shown how can you squeeze more out of your datastores by using compression. Similar rules apply to bandwidth and data transmission over the wire. There are various serialization formats you can choose from Protobuf, Thrift, Avro just to name few. I always prefer schema free serialization formats in case of key/value stores, this gives me error free data loading even when your data changes over time. The most favorite format for structure free schema is obviously JSON. However the new kind on the block is MsgPack. I’ve been playing around for a while with MsgPack in few projects of mine and it is actually good, compact, well documented and supports lots of platforms. But how does it compare to JSON (not in terms of performance or speed)? We all have a huge trust on JSON, its already a superhero who slayed XML despite its weaknesses. In this little adventure I set out to compare JSON vs MsgPack in terms of bytes when compressed! Lets get straight down to the business, here is the source code I used:

  I am simply loading about 200 random tweets, then encoding those tweets to JSON, MsgPack, with Gzip and LZ4 compression. Results are pretty disturbing in case of GZip:

Now LZ4 looks quite normal and just as we expected with but GZIP just in 200 tweets MsgPack takes 189057 bytes and JSON takes only 177976 bytes. Bingo! now this is what I call a smart combination. You get 2 standard components that’s not only available for native applications you can write; but they are also available in modern browsers you are using today! You can use them in Javascript too with no special decoder to load data and simply use it. Now some of you may be wondering what’s the big deal? Here is the deal, if you can detect your browser supports GZIP Content-Encoding over XHR, you can serve gzipped JSON directly out of the data store to your clients (i.e. no fetch encode to JSON and stream Gzipped). You can use similar technique for cache systems like NGINX + Memcache [using HttpMemcachedModule] to serve some of your REST calls really quick (user profile, user info, etc).