Apr 05, 2018
- I want to buy something from someone, which requires sending them money
- I don't know them, and they don't know me
- They're too far away for me to walk over and hand them cash (or rice or a goat or whatever)
- Neither they nor I have a credit card or a bank account
- I can't rely on having access to a central broker or ledger. Even if I had access to such a thing I wouldn't trust it. For example, I might live in a place where all the central banking systems are controlled by the government and have problems with graft and rent-seeking.
In other words, I need an actually distributed payment system. Bitcoin won't work for the obvious reason that it relies on a single ledger that every participant in the system can read.
So, can we solve this? Well, first let's see if we can solve an easier sub-problem and then build up from there.
- I want to buy something from someone, which requires sending them money
- I know the person I want to send money to
- Because I know them they have a pretty good idea of how creditworthy I am
- We're phiscally close enough that I can hand them cash
- I can't go over to their house and hand them cash right now, because I'm busy
I think the solution to this is pretty obvious. If they trust me enough they can accept an IOU where I agree to pay them in cash the next time I'm in the neighborhood, as long as I promise that this'll be within a certain time window (eg: a month). In other words, we're separating payment and settlement. I'm paying them for the thing now, that payment is taking the form of a debt on my balance sheet and an asset on theirs, and that debt will get settled at a specified time (unless I default). This is how a lot of the grown-up banking system works.
What if the person I'm paying is a computer?
They're not a computer, but this blog-post is building towards an automated payment system that people don't have to think too much about, so we'd like for things like creditworthiness to be determined automatically. One way to automatically determine my creditworthiness is to use Bayesian updating. Let's define creditworthiness mathematically as the inverse of the probability of default. We start with a prior probability of how likely people we know nothing about are to default. Then, when we extend credit to someone, we update that probability for them based on whether they paid or not. The more of the time they pay, the more confident we are that they will pay if we lend money to them again. If the probability of default is above some threshold, we ask them to pay extra (a risk premium) so that, in a pool of risky people, the extra money we get from the ones who pay makes up for the money we lose from the ones who don't. If the probability of default is above some higher threshold we refuse to lend them money.
What if the person I'm paying doesn't know me?
If the person I'm paying doesn't know anything about me, then they don't know whether I'm going to give them the money or not. However, if they know someone who knows me, that person can "vouch for" me by passing along their estimate of my probability of default. The person I'm paying can get probability estimates about me from everyone they know who has them, take an average of those estimates weighted by how reliable the person that gave the estimate is, and use that as the probability I will default. And if someone they know doesn't know me, they can ask the people they know about me in turn. This will get information about me if it exists in the network even if it's far away. This is similar to the web of trust concept in cryptography, and to this paper in p2p file-sharing networks.
This system has a bootstrapping problem: what do you do if no one in the network knows you? I don't have a good general solution to this, but since anyone can generate probabilities of default any way they want, you could have participants in the network that acted like rating agencies. If you want to join the network you can pay them, and they'll use outside information to provide a probability of default. Since you're paying them they have an incentive to give you a good rating, but that incentive is balanced by the reliability rating the rest of the network places on them. If a rating agency consistently gives good ratings to people who end up defaulting the rest of the network won't count their ratings for very much.
What if the person I'm paying is too far away to deliver cash to them?
It's likely that if I don't know someone, someone I know knows someone who knows someone who knows them. By "knows" I mean both knows and is physically close enough to hand-deliver cash to. Another way of saying this is that it's likely that there's a path in the network between me and the person I'm transacting with. If that's the case, then all we have to do is find one of those paths, have each party along the path pass the debt on to the next step in the path, and pay each of them something to make it worth the default risk. This works as follows:
- I ask everyone I know to transfer a certain amount of money to a certain recipient
- If any of them knows the recipient, they send back a bid for what fee they want to charge to make the transfer.
- The ones that don't know the recipient send the same transfer request to everyone they know, and send back a bid which is the lowest bid they got plus some premium for themselves. Because steps 2 and 3 are recursive they amount to doing a search across the whole network.
- I get back bids from the people I know and pick the lowest one.
- The information that I accepted that bid travels along the path in the network it represents and reaches the person I want to pay.
What about systemic risk?
Remember the 2008 financial crisis? That was partly caused by banks borrowing money with the intention of paying the loans back with money that they were owed. Enough banks were doing this that when a few big banks defaulted, the banks they owed money to couldn't pay their debts, and this created a chain reaction of defaults. The standard solution to this in a traditional banking system is for a state regulator to require that banks only borrow a certain percentage of the money they have in cash, so even if a large percentage of the loans they made go bad they'll still be solvent. Doing this requires being able to verify the amount of money that people have. I don't know how you'd do that in the sort of system I'm proposing, especially because the system doesn't say how settlements should happen; anything that's acceptable to both parties is acceptable to the system. So I don't have a good solution to this. Suggestions welcome.
With the pieces I described above we have a sketch of a system that solves the problem posed at the beginning. This system is completely peer-to-peer; every participant in the system has the same status as every other participant. It works over large distances while still being based on physical currency. It doesn't require everyone in the network to have information about everyone else. It doesn't have a central ledger.
Click to read and post comments
Apr 20, 2017
"Federation" is a word people use when they mean interoperability in the context of internet communication. It means that you can talk to people who aren't in the same network as you. If I use fastmail and you use gmail, I can still email you.
Most social networks aren't federated. If I'm mutuals with you on twitter that doesn't mean I can message you on facebook, and you don't see my instagram posts unless I explicitly cross-post them.
There's a pretty obvious incentive for companies that run these networks not to federate. Their primary asset is their users and the network effects that keep them from leaving. If the only social network you use is facebook and I want to read what you post online I need to make a facebook account and read your posts on facebook. Once I've done that facebook can show me ads, generating revenue for themselves, and show me other content to try to get me to spend more time on facebook. If I could follow your facebook account from twitter, facebook would have no way of making money off of me, and I would have little attachment to facebook.
Those same network effects also make it difficult for people to start new social networks. A user gets nothing out of a social network unless there are other people on that social network to talk to. If a new social network can't interoperate with existing social networks then it has to have a pool of users who want to talk to each other on day one. This is hard to do.
You can also try to piggy-back on top of one of the larger social networks by making it easy for your users to share content on those networks and having links back to you. But this is awkward because each of the larger networks knows that you're a potential threat to them, as they were to networks that came before them, and they also know that at the moment they're in a position of power over you. So as soon as it looks like you might be growing in a way that could cause a problem, they change the rules. The feed ranking algorithm changes. Your media embed or autoplay priveleges are revoked. The way your feed items are displayed is changed so they're not as prominent. The larger network usually justifies these changes as combating spam and obnoxious content, and they partly are, but there's also a clear competitive incentive behind them.
The result of this difficulty is that there aren't that many social networks. If you were on the internet in the early-to-mid 2000's you probably remember phpbb and vbulletin forums. These forums typically had tiny numbers of users, maybe in the thousands, and there were a lot of them. Because the software that powered the forums was easy to modify each one was hacked in some way that made it a little different from the default installation. Custom moderation tools would get built in response to things that happened on the forum; tools that only make sense in the context of the people who use that particular forum. In-jokes, memes, and local flavor would get added to the software. Features would get built that catered to things people did on that forum which were different from what people did on other forums.
Contrast this with networks like facebook, twitter, or snapchat. A centralized design team decides what is or isn't worthy to get built for everyone who uses the network. This means that they have to cater to the average of all their millions of users. Of course everyone is different from that average in some way, so really that means that it's an equally awkward fit for everyone.
If there's a critical mass of users that use social networks that other networks can interoperate with, then the micro-localization (localization down to the level of a social clique, not just a country) of the old forum model becomes viable again. The problem of getting a critical mass of users goes away because your users can talk to the critical mass that already exists in the rest of the federation.
Various attempts at this have been made, with varying levels of failure. identi.ca was the first one that I know of (unless you want to count old-school things like Usenet). identi.ca evolved into gnusocial, which evolved into qvitter and postactiv. mastodon is a new social network that's interoperable with gnusocial and friends with a slicker, more dynamic interface. Diaspora is probably the server that's gotten the most press, though it's only interoperable with other Diaspora installations.
I'm interested in this now because mastodon is the first of these that enough people in my social circle started using for me to get any use out of it. It's still tiny; as of this writing there are fewer than 400k mastodon accounts registered on all public servers combined, and I don't know what the number of MAUs is but it's going to be significantly less than even that.
By the numbers I shouldn't care about this thing. It is growing exponentially at the moment, but that's only been happening for the past month or so, and it could run out of gas at any point (that's just how these things are). But there's something about being able to fix the house you live in that makes you want to do it, even if you're never going to make that investment back.
And I think micro-localized, custom-tailored social networks are something worth having. It's fun for tech-savvy people in developed countries to be able to customize their social media, but it's absolutely essential for the countries that are just now joining the internet. As I said in this post, taking facebook and just translating the strings isn't going to cut it when you're in a country that doesn't just speak a different language but has different cultural assumptions and norms. Nigeria, Vietnam, Bangladesh need their own social networks the same as China, Russia, and Japan have their own social networks. If history repeats itself those networks will be isolated islands cut off from the rest of the world.
If Nigerians prefer Nigerian VK over Facebook (and Americans prefer American social networks over Nigerian VK), and Nigerian VK allows federation, then we'll be in a new equilibrium where it makes more sense to federate than it does not to, because there are big pockets of users that you can't capture even if you're the biggest player.
Do I think that will happen? Not really, no. It didn't in China, Russia, or Japan, and history repeats itself more often than not. But the fact that it could happen is making me exited about the internet again in a way that I haven't been in five years.
Click to read and post comments
Jan 15, 2017
From this twitter thread, which people seemed to like.
Click to read and post comments
a lot of the way our technology works is because it was developed with department of defense research grants
the "technology industry" (read: internet) was never about technology, it's about developing new markets
Milton Friedman was a lot farther to the left than his popular image would suggest
there's a huge amount of value to be unlocked from getting people to stop buying new products and to buy used instead
telepresence is a psychology problem, not a tech problem. we don't know what a representation of another person has to have to be seamless
nobody's done anything really new with a computer since around 2010
globalization has pulled hundreds of millions of people out of poverty, and it can pull a billion more out
if there's going to be a "next big thing" in terms of hardware it's going to be Raman spectrometers integrated into phone cameras
the single best thing the government could do for the economy is making it low-risk for low-income people to start businesses
since you can get a consistent investment return higher than inflation wealth will consolidate unless the government intervenes
there's more funding for cancer research than there are good opportunities to spend that funding so spending more money doesn't help much
the current arrangement of american firms designing products and chinese firms making them only has a few years left
the only sector that the "sweatshop" image still applies to is textiles; electronics has been very automated for years
what culture you grew up in, what language you speak, and how much money your parents have matter more for diversity than race or gender
most online ad spending that isn't going to google or facebook is advertisers experimenting with new things and is temporary
sneakernet is seriously underrated
moving lifestyle good consumption from physical to digital goods would be hugely beneficial to the environment
ground robots with locomotion systems that can go up and down stairs make more sense for urban environments than drones
alphago is just a souped up version of td-gammon from 1992
most people don't want immersive media experiences; they want to be able to have IRL conversations at the same time
augmented reality should go back to its airforce roots and try to provide "situational awareness" tools (letting you know what's behind you)
q-learning is a lot more useful for ad targeting than it is for robots or playing video games (which is why google/facebook are funding it)
SQL is great. I love SQL.
nuclear is safer than coal even if you ignore climate change
driving a car is borderline immoral
regulatory capture is a big enough problem that liability is a more effective way of imposing regulation than professional regulators
all of the "machine learning"/"algorithms" that it's sensical to talk about being biased are rebranded actuarial science
despite the fact that on paper the web (as a platform) did everything wrong it's faster to build things in than any other major platform
making robots look/act like humans is silly. if you're making a robot it's because you don't want a person, you want a thing that does a job
capital gains from appreciation on the value of land (not improvements built on land) should be taxed at 100%
inheritance should also be taxed at 100% (though this would be hard to do because people would just put their money in trusts)
compound interest is extremely powerful. debt is someone else's compound interest that you're paying for.
nobody would borrow money on a credit card if they actually understood how much it cost
internet connectivity is probably never going to be cheaper than it is now
there are very few applications where computing power is a constraint, so moore's law doesn't matter much
tech company obsession with "culture" is cultish and creepy
someone needs to rethink textile manufacturing from the ground up so that it's automatable
Jan 12, 2017
I got into a discussion about epistemology on twitter (this is a lot more fun than it sounds), and that made me want to write down my epistemological framework. So here it is, in five bullet points:
- There is a black box ultimate reality, but we have no way to directly observe it
- We have indirect observations that we get through our senses, augmented with tools we make
- Since the only tools/senses we have to observe reality are contaminated with humanness, and the thinking tools we might try to use to minimize the contamination are also contaminated with humanness, we can't say anything about ultimate reality
- The job of science is to make models that take observations of the present and the past and predict observations they weren't given
- There are two quality metrics for models: accuracy and usefulness. Accuracy is the inverse of the difference between predictions and observations (eg: RMSE). Usefulness is the number of times someone asks the model for a prediction for something other than to test the model.
This is influenced by American Pragmatism (Dewey, William James, Popper, Rorty) and statistics (especially George E P Box).
Click to read and post comments
Aug 23, 2016
I've seen a lot of attempts to explain neural networks to general audiences lately, and I've been disappointed with the amount of mysticism in them. People talk about "brains" and "artificial intelligence" as though deep learning were summoning some dark force from the neterworld to rise up into our plane of existence and take all our jobs.
But neural nets aren't magic.
Neural networks, like all machine learning techniques, are a way to write computer programs where you don't know exactly all the rules. Normally when you program a computer you have to write down exactly what you want it to do in every contingency. "Generate a random number and put it in a memory cell called random_number. Print 'Guess a number' on the screen. Make a memory cell called user_input. Read the characters the user typed until they press enter, convert that to a number, and save it in user_input. If user_input equals random_number, print 'Good job', otherwise print 'Guess again'."
Everything has to be spelled out precisely, and no background knowledge can be assumed. That's fine if you (the programmer) know exactly what you want the computer to do. If you're writing a piece of banking software you know exactly what the rules for tabulating balances are so programming is just a matter of writing them down. If you're doing something like route planning you might have to think a little harder about what the rules should be, but a lot of smart people have spent a long time thinking up rules that you can look up on wikipedia. All is well.
But what if you want the computer to do something where nobody knows the rules, and the best minds have failed in their attempts to come up with them? Machine translation is a great example. As much as you might admire Chomsky there still isn't a complete system of rules to parse and translate between languages, despite decades of research.
So, being a pragmatic engineer with computing power to burn, what do you do? Well, while you might not know what the rules to translate text are, you do know how to write a computer program that can tell whether another computer program is able to translate text. You write some software that goes out on the web and slurps down millions of web pages that have been translated into multiple languages, and pulls out all the text. Then, you write a program that tests a translation program by giving it text it saw in one language, and comparing the translation program's output to the translation of that text from the webpage. The more of the words that are the same between the program output and the human translation, the better the program is doing. (This is a massive oversimplification and sentence alignment is really tricky but let's ignore that for the purposes of this post.) In the jargon this test program is called a loss function.
There's another thing that works to your advantage. While computers might be simplistic, they're blidningly fast. And what's a technique that gets you a solution if you're simple but fast? Trial and error! So you need two pieces; you need a way for the computer to make tweaks to a broken program, and you need a way for it to keep the tweaks that made it work better and throw away the ones that made it work worse.
The algorithm for doing that tweaking and checking is called Stochastic Gradient Descent. You represent your program as a massive grid of millions of numbers (called a weight matrix). You can think of it as a giant grid of knobs arranged in layers, where each layer is wired into the next one. Input goes into the bottom layer, and signals flow through until they get to the top, which is where you see the output. The positions of the knobs determine how the input gets transformed into the output. If you have enough knobs (and you have millions of them) that grid could do some very complicated things.
You start by setting the knobs to random positions. You run the loss function on the knobs and get a score. You make a random tweak to some of the knobs, run it again and get another score. You then compare the new score to the old score. If the score is better (I'm saying "better" and not "higher" because lower scores are better and that's confusing), you tweak more in the same direction as last time. If the score is worse, you go in the opposite direction. In math terms you move iteratively along the gradient of the loss with respect to the weight matrix.
You have your computer (or computers, as the case may be) make many many many tweaks and if you're lucky you wind up with a program that does what you want.
That's really it. Neural networks are just computer programs that computers come up with by trial and error given a testing program written by a programmer. No magic here.
If you want a more in-depth discussion of neural nets that goes into the math I highly recommend this lecture.
Click to read and post comments
Jul 10, 2016
Bad Reasons to Start A Company
- You want people to like you / want to be in charge of people / think you're a "leader"
- You think you're smarter than everyone else
- You think you have some technical "secret sauce" that no one else has and that this has intrinsic value
- You think your PhD thesis is a product
- Mummy and Daddy are giving you a seed round because they want to get you out of the house
- You live in LA or New York and you're jealous of San Francisco
- You heard that $fad (Big Data/IoT/Adtech/Fintech/whatever) was big
*handwave* machine learning
- You want to write code and not talk to people
- You want to change the world
Good Reasons to Start A Company
Click to read and post comments
- You know someone with purchasing power at a big company or government who has a problem you can solve and who is willing to pay you real money to solve it. Building the product is essentially doing consulting for them.
- You know someone on the CorpDev team at a big company who says something like "we'd really like to buy a company like X but there just aren't any we can buy"
- You see a market niche that either nobody else sees or everybody else undervalues (eg: because you have experience that's unusual among the sort of people that start companies), and you can build something yourself that fills this niche. Note: don't look for a technical cofounder, become one. Even if you only learn a little of the tecnical stuff it will make finding people who know more than you much easier.
- You made something as a hobby project that a lot of people are using, and at least some of these people have money
- Someone (either a VC or the government) is willing to give you money to screw around, and you have nothing to lose. That someone should not be family, because you might not be on speaking terms with them in a few years.
- You started doing consulting, and then you started hiring people because you were getting more work than you could do youreself, and one day you realized that you were running a company.
Feb 15, 2016
The internet is now a mature industry in developed countries. By that I mean that the number of people on the internet, the amount of time spent on the internet, and the amount of money spent on and through the internet, is sigmoiding out. We're moving from the pioneering days of Alexander Graham Bell to the Bell Telephone Company. That means that the frontier spirit that made a lot of people, including myself, interested in the internet, no longer fits the needs of the industry.
But that doesn't mean there isn't still unexplored territory that software can reach. Here is a list of things that I think will positively transform the way most people live and work. Not all of these are entirely software, but all of them have a significant software or data modeling component.
Making the internet work for the whole world
This problem has two pieces: connectivity, and localization.
There are some groups with very deep pockets trying to solve the connectivity piece, but their interests are misaligned with the interests of the people they're serving. Both Google and Facebook make their money from controlling near-monopolies, and we're already seeing them try to cement these monopolies early in new markets. However, there are some long-established technologies that have been deployed for 14 years that solve this problem without central coordination (and without central control). These could be built into the low-cost android tablets being made by small producers in China.
The second piece is localization. The only way that this has been shown to work reliably is for the software to be made in-country. Yandex displaces Google in Russia, Qzone and Weibo displace Facebook in China, etc. This is something that will happen organically if the people in the countries have the tools to make their own tools, and the distribution channels to give them to others. App stores pose an obvious barrier here because the biggest app stores are run by American companies, and the scenario I'm describing completely cuts them out of these markets. But the hardware makers have no interest in ceding control over what people can run to Google, so I hope that they'll make their own app stores.
In terms of stuff to build: make a variant of android specifically designed for people who don't have reliable connectivity, who don't have reliable or cheap electricity (eg: they have to walk up the road to the guy with the car-battery-recharging business to get power), and who don't speak English. That means real localization (not just translating strings), mesh networking, applications designed to work with unreliable connectivity (maybe with CRDTs), local app stores.
There's also the problem of how to incentivize people to connect their local mesh to the broader internet (especially in places where that connection is very expensive). That might be something that happens by people self-organizing and pooling money, but it would be nice if there was a mechanism that made that happen automatically.
Self-driving cars (and, more importantly, trucks)
The benefits of being able to transport cargo over roads (paved or otherwise) without requiring human labor are obvious, and there are loads of people working on making that happen. This is important but I don't have anything to say about it that you haven't already heard.
Seamless machine translation
People have been trying to do machine translation for as long as computers have existed, and Google Translate and Fanyi kind-of work, so why is this interesting now? Recurrent neural networks (LSTM and GRU) just got cheap to train, and just started getting used for sequence-to-sequence learning. This allows building much richer language models than the ngram models people were using before; models which capture linguistic structure and not just surface tokens.
And a nice thing about neural nets is that you can take nets trained on different datasets for different tasks, stack them on top of each other, and train that new net on a dataset that represents the two simpler tasks composed. A great illustration of this is sticking a neural net trained to identify objects in images into a neural net trained to compress text and training it to generate captions for images. You can do the same trick with a net trained to do speech recognition, a net trained to do translation, and a net trained to do text-to-speech to get an automated version of the headphones they have at the UN.
I think improved machine translation is going to be the single best result of the recent progress in deep neural networks (more than computer vision, which is where the most action has been so far).
Making virtual organizations (or at least telecommuting) the default
One of the great dreams when the internet was first starting to become widespread in the 1990s was that everyone doing knowledge work would be working from home, saving on transport costs and pollution and giving people in rural Idaho or India the same opportunities as people in San Francisco.
Related to that was the following argument:
- If information and coordination costs did not exist, then it would be more profitable for people to offer their services on an open market than to be an employee of a company
- But information and coordination costs do exist, and companies exist because they lower them for their employees by gathering information and using it to coordinate workers
- Communication technologies lower information and coordination costs
- Therefore, it will be more profitable to be a free agent in a computer-mediated market than to be an employee, and corporations will disappear
This hasn't happened.
I don't know whether if we had telecommuting technology that actually worked it would end the 20th-century-style corporation. Of course they wouldn't completely disappear, and they almost certainly would be displaced some, but how much I don't know.
But that's hypothetical because we don't have telecommuting technology that actually works. We have text chat and email, but they miss nuance and social cues. We have videoconferencing, but that still misses nuance and when you're on a videoconference you're very much aware that you're on a videoconference. We have telepresence robots but they're very cumbersome to move around.
I think the best hope we have for solving this problem is figuring out what important information the communication media we have now are missing, and then try to build media that capture those. This is as much a psychology problem as a technical one. Maybe we need to capture people's faces and joint movements in three dimensions and project them into the room (through AR or something else). Maybe we need 3D sound. I don't know what a solution would look like, but solving this will be hugely beneficial.
Automated synthetic chemistry
Modern NLP techniques should make it possible to extract chemical syntheses from academic papers as statements of the form: "mixing X with Y produces Z". Once you have a database of those statements figuring out how to make something is just graph search. Once that tool exists everyone can be a synthetic chemist without knowing what they're doing. People are working on this but it seems like a backwater compared to genomics.
Click to read and post comments
Feb 05, 2016
Here shown is a chart of the top ten thousand websites linked to on twitter. You can play with an interactive version of it here.
Before I get into the details of how I made the chart, here are some interesting things that fall out from it:
- Spambot networks are really obvious. It's pretty embarrassing for twitter that having such blatant spambot networks is viable
- Conservative twitter is much more densely connected than liberal twitter. That means that people who tweet links to places like The Drudge Report are less likely to tweet links to other sorts of sites than people who tweet links to places like The Daily Kos
- Liberal twitter is closer to mainstream twitter than conservative twitter is. This might just be a consequence of conservative twitter being more densely connected
- Japanese social media twitter (which I'm labelling as "2ch", though it's not just 2ch) is almost completely distinct from what I'm calling "upstanding japanese twitter" (links to mainstream news sites like news24)
- Hacker News is in the 2ch cluster. It's much closer to nicovideo.jp and pixiv.net than it is to techcrunch
- Quran quotes are very popular on twitter, but the quote sites are frequented by a relatively tight-knit corner of twitter
- archive.org (not including the wayback machine, which is web.archive.org) is in the cluster with the Quran quote sites
- What I'm calling "mommy blog twitter" (containing sites like bayareamommy.net and everythingmommyhood.com) is pretty isolated from the rest of twitter. People who tweet links to mommy blogs are less likely to tweet links to other things.
- Livejournal is (or was) a popular place to host spam blogs to link to from your twitter spam bots
Here's how to make the chart for those wanting to play along at home:
Click to read and post comments
- Record the twitter sample stream over the course of about a year (in 2013 and 2014 in my case)
- For each tweet that contains a link, pull out the website that the link was pointing to and the user id of the tweet author
- For each pair of websites, count the number of people that tweeted at least one link to both of them
- Take that co-occurernce matrix, treating each row as a point in a high-dimensional space, and run it through t-SNE to generate a 2d embedding that minimizes the change in the distances between the points from high dimensions to two
- That embedding is what you're looking at
Feb 04, 2016
This is a map of the ten thousand most popular domains on twitter as of 2013. Each point is a domain. Points that are closer together are tweeted more often by the same people. Mouse over a point to see what domain it represents.
Click to read and post comments
Jan 28, 2016
What would it take to fold the ad-hoc sharding that people do with postgres into postgres? Or, what would it take to make postgres scale like riak and cassandra?
- A shard routing server kind of like mongos. This could be implemented as a foreign data wrapper that holds a connection pool and routes queries to shards based on qualifiers it has with respect to the column being sharded on. It could get fancy by supporting scatter-gather aggregation and joins but that probably isn't necessary because by the point you need to shard doing aggregation and joins on the production database is already too dangerous (I assert). Updates to the hash ring happen over paxos (or zab or raft or whatever).
- A server responsible for annointing new users and assigning them to shards. This could happen at random or it could be aware of geographic locality and things.
- The multicast views described below
Problem: You're tumblr. You have your data sharded on user id (I think they call these shards "pods"). When a request comes in it has a single user associated with it; let's call this user "the client". Each user has a feed of the k most recent posts by a set of users they choose to follow. You could, when the client asks for this feed, ask all the shards that are responsible for users that the client is subscribed to for the k most recent posts, and merge the posts together sorting by timestamp. But this is problematic because the distribution of subscribers per user is highly skew (inverse exponential, even), so some shards are going to get a lot more requests than others, and fall over.
So what you want to do is have the client's shard keep a materialized view of their feed, which is updated by pulling updates off some kind of message bus. Whenever a node recieves a new post from one of its clients, it puts that post on the bus so all the other nodes can see it and possibly update some of their views. This bus could be a central message broker (eg: rabbitmq), or a gossip protocol. Let's say you can't afford the central choke-point of a message broker so you use gossip.
Sending every post to every shard results in a lot of network traffic, especially if the posts are big (tumblr doesn't have a character limit like twitter does). So you would like to send them only to the nodes that want them. The definition of a user's feed is something like
select * from posts where author_id in (select author_id from subscriptions where subscriber_id = :client_id) order by created_on desc limit 1000. The
order by created_on part we get basically for free by virtue of nodes only broadcasting new posts. So we have to worry about two things: materializing the sets of users that all of this shard's clients are subscribed to, and sending new posts to as few nodes as possible that don't have the author in one of their subscription sets.
Here's the property we want: When a node gets a new post, it looks at its neighbors, can tell which neighbors (and the spanning trees rooted at them) don't care about it, and forwards it only to the ones that might. Ideally we'd like to send the update to at most log n nodes that don't care about it, where n is the number of nodes in the network.
Implementation: This is some kind of multicast. Assuming we can't use IP multicast (which a lot of networks don't support), we need to build an overlay. There has been some work on using Kademlia for this, and an implementation based on Pastry. So to maintain a materialized view:
- Turn query qualifiers into hashes
- Subscribe in the multicast system to the set of hashes for the qualifiers our materialized view uses
- On insert, translate the row into the hashes of its columns and brodcast it once for each of them
- When a node recieves a message, it re-checks the qualifiers for each materialized view, and inserts it into the view if they match
I think that the vast majority of web applications could work under this regime, and many are already implementing this in an ad-hoc way.
Click to read and post comments
- How to translate qualifiers into hashes without enumerating entire ranges of values (is this necessary? could we just enumerate all the possibilities?)
- This system doesn't say anything about order or convergence; it's assumed that these views are sets and that insertion is commutative, associative, and idempotent
- Deletion needs to work by inserting tombstones, which has a well-known set of (solvable) problems. Edit: We're only allowing the creator of a row to delete it, so no, we can actually just delete it
- A given row can only be updated by the shard that created it or we don't converge. If updates can only be updated by the shard that created them then we just need to annotate each row with author-shard and a version counter and we're good. I think this is fine.
- Do we need to support aggregations in the queries that generate these views? If yes, can we require the aggregation functions to be commutative, associative, or idempotent? If we do and we can't then we need synchronous replication and we're screwed.