What would it take to fold the ad-hoc sharding that people do with postgres into postgres? Or, what would it take to make postgres scale like riak and cassandra?

  • A shard routing server kind of like mongos. This could be implemented as a foreign data wrapper that holds a connection pool and routes queries to shards based on qualifiers it has with respect to the column being sharded on. It could get fancy by supporting scatter-gather aggregation and joins but that probably isn’t necessary because by the point you need to shard doing aggregation and joins on the production database is already too dangerous (I assert). Updates to the hash ring happen over paxos (or zab or raft or whatever).
  • A server responsible for annointing new users and assigning them to shards. This could happen at random or it could be aware of geographic locality and things.
  • The multicast views described below

Multicast Views

Problem: You’re tumblr. You have your data sharded on user id (I think they call these shards “pods”). When a request comes in it has a single user associated with it; let’s call this user “the client”. Each user has a feed of the k most recent posts by a set of users they choose to follow. You could, when the client asks for this feed, ask all the shards that are responsible for users that the client is subscribed to for the k most recent posts, and merge the posts together sorting by timestamp. But this is problematic because the distribution of subscribers per user is highly skew (inverse exponential, even), so some shards are going to get a lot more requests than others, and fall over.

So what you want to do is have the client’s shard keep a materialized view of their feed, which is updated by pulling updates off some kind of message bus. Whenever a node recieves a new post from one of its clients, it puts that post on the bus so all the other nodes can see it and possibly update some of their views. This bus could be a central message broker (eg: rabbitmq), or a gossip protocol. Let’s say you can’t afford the central choke-point of a message broker so you use gossip.

Sending every post to every shard results in a lot of network traffic, especially if the posts are big (tumblr doesn’t have a character limit like twitter does). So you would like to send them only to the nodes that want them. The definition of a user’s feed is something like select * from posts where author_id in (select author_id from subscriptions where subscriber_id = :client_id) order by created_on desc limit 1000. The order by created_on part we get basically for free by virtue of nodes only broadcasting new posts. So we have to worry about two things: materializing the sets of users that all of this shard’s clients are subscribed to, and sending new posts to as few nodes as possible that don’t have the author in one of their subscription sets.

Here’s the property we want: When a node gets a new post, it looks at its neighbors, can tell which neighbors (and the spanning trees rooted at them) don’t care about it, and forwards it only to the ones that might. Ideally we’d like to send the update to at most log n nodes that don’t care about it, where n is the number of nodes in the network.

Implementation: This is some kind of multicast. Assuming we can’t use IP multicast (which a lot of networks don’t support), we need to build an overlay. There has been some work on using Kademlia for this, and an implementation based on Pastry. So to maintain a materialized view:

  • Turn query qualifiers into hashes
  • Subscribe in the multicast system to the set of hashes for the qualifiers our materialized view uses
  • On insert, translate the row into the hashes of its columns and brodcast it once for each of them
  • When a node recieves a message, it re-checks the qualifiers for each materialized view, and inserts it into the view if they match

I think that the vast majority of web applications could work under this regime, and many are already implementing this in an ad-hoc way.


  • How to translate qualifiers into hashes without enumerating entire ranges of values (is this necessary? could we just enumerate all the possibilities?)
  • This system doesn’t say anything about order or convergence; it’s assumed that these views are sets and that insertion is commutative, associative, and idempotent
  • Deletion needs to work by inserting tombstones, which has a well-known set of (solvable) problems. Edit: We’re only allowing the creator of a row to delete it, so no, we can actually just delete it
  • A given row can only be updated by the shard that created it or we don’t converge. If updates can only be updated by the shard that created them then we just need to annotate each row with author-shard and a version counter and we’re good. I think this is fine.
  • Do we need to support aggregations in the queries that generate these views? If yes, can we require the aggregation functions to be commutative, associative, or idempotent? If we do and we can’t then we need synchronous replication and we’re screwed.