28 9 / 2012

 After my previous article the PostgreSQL landscape has changed totally. We have a total new version (9.2) with lots of speed optimisations and new features. That brings me out of my cave to visit the (promised) FriendFeed’s schema-less data in MySQL casestudy. I found FriendFeed technique to be one of the best examples of knowing your tools, and not following the buzz; even with very simple software stack they were able to create some great marvels (remember Python Tornado now being used by Facebook?). I will try to keep the whole thing really scientific and re-imagine how Postgres could have change the scenario completely. I will briefly explain the solution from FriendFeed, and then show how I would have done same thing using Postgres. 

I am in no way trying to do a comparison of MySQL and PostgreSQL features. I am not trying to prove which RDBMS is better and which one you should choose over the other. It’s just imagining a solution of a problem with a different tool!

 So just to revisit briefly FriendFeed was facing issues when adding new features due to its increasing user base, one of the biggest issue was schema changes. “In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time”, nobody would like to have Facebook account blocked just because  a new timeline was introduced! The solution they produced was nifty, incremental and very inspiring. They stored JSON entities with 16-byte UUID, now JSON due to its schema-less nature can dynamically add or drop values from the entity (JSON object). You can simply choose a BLOB (even TEXT for simplicity) to store the JSON. For each property (JSON property) in entity that requires to be indexed; they created a separate table with primary key of {user_id, property_of_entity}; and this rule can be applied vitally everywhere. This allowed them to dynamically create and drop indexes on different fields of an entity. Since tables can be shraded, each index table can be sharded.

 Now we can do the exact same thing PostgreSQL (we can choose to completely stick to FriendFeed solution and don’t  anything special/specific to Postgres); but the good news is that Postgres has some really neat features that can help us improve, and remove extra coding overhead! Here is entity table’s schema redefined for PostgreSQL: 

 Notice nothing changed much except the BINARY going to BYTEA, and HSTORE for body. Yep you guessed it I am going to use HSTORE to actually store the body. Now if you don’t know what HSTORE is you can refer to my previous article. So representing an entity can actually be quite simple, consider entity (taken directly from blogpost):

 Can be represented as 

 Inserting this entity into table is pretty straight forward: 

INSERT INTO entities (id, updated, body)
VALUES(
 ’\x71f0c4d2291844cca2df6f486e96e37c’::bytea,
 ’2012-09-28T03:42:29.655011’::timestamp,
 hstore(
  ARRAY[‘updated’, ‘user_id’, ‘title’, ‘feed_id’, ‘link’, ‘published’, ‘id’],
  ARRAY[
    ‘1235697046’,
    ’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
    ’We just launched a new backend system for FriendFeed!’,
    ’\xf48b0440ca0c4f66991c4d5f6a078eaf’,
    ’http://friendfeed.com/e/71f0c4d2-2918-44cc-a2df-6f486e96e37c’,
    ’1235697046’,
    ’\x71f0c4d2291844cca2df6f486e96e37c’]
 )
)

 In Python using psycopg2 its even more pythonic. Here is just an example script to do that:

 One can use SQLAlchemy to create even better looking code (using adapter like in this gist). Also you can use sharding just like FriendFeed guys, and yes SQLAlchemy supports sharding. That takes care of two major issues:

  • Structure free storage through HSTORE
  • Sharding and code clarity through SQLAlchemy, you can choose to do it manually if you don’t like ORMs.

Now the last part is the indexing on the fields of our “structure free” body. FriendFeed used separate tables to do this and separate code in application code was maintaining that. Well good news PostgreSQL can do that form me without additional tables:

CREATE INDEX entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));

There are different options for types of index you can create (GIN, GIST, HASH, and BTREE), and the best part is usual rules for functional indexes apply on these indexes as well. I won’t go into details but you can look into documentation, and have a detailed look what does this precisely mean. Creating index like this will usually cause the same lock issue on complete table; and this is where PostgreSQL shines again. You can create index concurrently by simply adding CONCURRENTLY in your CREATE INDEX statement:

CREATE INDEX CONCURRENTLY entities_index_user_id
ON entities USING BTREE(((body->’user_id’)::bytea));

  Now what about the case when I don’t want to index a field any more? FriendFeed did it by dropping the index table and updating the code for not hitting the indexing table. What about PostgreSQL? You can either disable the indexor simply get rid of it and drop it:

DROP INDEX CONCURRENTLY IF EXISTS entities_index_user_id;

 The above technique gives me multiple advantages; it helps me reduce code complexity by a huge margin! Imagine the pain of every time updating your code and dealing the complex architecture just because you introduced a field that needs indexing. Consider it against a simple CREATE INDEX statement. For me simplicity matters the most, and these builtin features are convincing enough for me to use PostgreSQL (even migrate to it). Secondly, if you look closely the index will lie on same shard where the tuple lies; this simply removes the possibility of accidentally moving the table used for indexing of field to a different shard (this may be done at times but it would compromise atomicity). Using the index created simply gives me all the powers that I would have on normal column index, which means inserting a row automatically hits the index removing the possibility of row being skipped due to buggy code. Schema updates become simpler and easier to maintain. Not to state the obvious its ACID! You can use modern JSON support to serialize your output directly to JSON. I am leaving the disadvantages portion to the audience.

 Friendfeed is not the only case where such structure would have been required. At numerous occasions I’ve seen developers make a choice for the tools they know best, and same is true for Friendfeed.