The other day Twitter was down and I had no place to comment on Twitter being down. This got us to talking about scaling at work. I was reminded of the recent slides posted from Instagram about their scaling journey. They are great slides. There is only one problem I have with them. They are just the same slides that you would find from 2000 about scaling.

I have to say, I like Instagram. My daughter has something like 1,000 followers on Instagram. And good for them for being bought by Facebook for a bajillion dollars. This is not a dig on them really. This is a dig on our industry. Why did Instagram have to learn the hard way how to scale their app? I want to point out some of their issues and discuss why its silly they had to learn this the hard way.

Single machine somewhere in LA

Why would anyone deploy an app to the app store when the backend is all on one server in this day and age? I am not big poponent of the cloud, but that has to be better than a single server in a rack somewhere. And L.A.? Go for Dallas or somewhere geographically neutral.

favicon.ico

So, this is one of the biggest mistakes I see in all of web application developement. People use Apache or Nginx or whatever and have a mod_rewrite command that sends ALL requsets into their appliction stack. They do this because they are lazy. They want to write whatever code they want later and just have the request picked up without any work. At dealnews, we don't do that. We have controllers. But, we specify in our Apache config what paths are routed to those controllers. The most general we have is:

    RewriteCond %{REQUEST_URI} (\/|.html|.php)$
    RewriteRule ^.+$ /torpedo.php/$1 [L]

So, if the request ends with / or .html or .php, send it to the controller. This is the controller our proxy servers use. Any other requests like robots.txt, images, etc. are all served off disk by Apache. No sense having a PHP process handle that. It's crazy. This is not Instagram's fault. They likely followed some examples they found that other developers before them put on the internet. Why are we doing this to people? Is it some rite of passage? I suspect that the problem is actually that 90% of the web does not require any scaling. Only 10% of the sites and services out there have to actually worry about load. So, this never comes up.

So, the next part of their slides basically glorify the scaling process. This is another problem with people in our field. We thrive on the chaos. These are our glory days. Let's face it, most geeks never won the high school football championship. The days when we are faced with a huge scaling challenge are our glory days. I know I have had that problem. And I played sports as a youth. But, nothing is better than my 2006 war story about a Yahoo front page link. Man, I rocked that shit. But, you know what. The fact that I had to struggle through that means I did not do my job. We should have never been facing that issue. It should have just all worked. That is what it does now. It just works. We don't even know when we get a spike now. It just works. The last thing I want in my life now is an unexpected outage that I have to RAGE on to get the site up again. That leaves me feeling like a failure.

Then they realize they are out of the depths. They need to do things they never thought they would need to do. Why not? Why did they not know they would need to do these things? Unexpected growth? Maybe. Why is this not common knowledge though? With all the talk of the cloud being awesome for scaling, why was this not a button on the AWS dashboard that said [SCALE] that you just push? That is what the cloud does right?

In the end, Instagram learned, the hard way again, that you have to build your own architecture to solve your problem. I learned it the hard way. LiveJournal learned it the hard way. Facebook, Twitter, etc. etc. They have all learned the hard way that there is no single solution for massive scale. You have to be prepared to build the architecture that solves your problem in a way that you can manage. But, there are basic building blocks of all scaling that need to be in place. You should never, ever, ever start with an application on a single server that reads and writes directly to the database with no cache in place. Couch, Mongo, blah blah blah. Whatever. They all need cache in front of them to scale. Just build it in from the start.

Instagram was storing images. Why were they surprised when they ran out of room for the images? I just can't fathom that. Its images. They don't compress. You have to put them somewhere. This has to be an education issue. LiveJournal solved this in 2003 with MogileFS and Gearman. Why did they not build their arch on top of that to start with? Poor education, that is why.

One thing they bring up is one that has me bugged is monitoring. There is no good solution for this. Everyone ends up rolling their own solutions using a few different tools. The tools are often the same, but the metrics, and how they are reported and monitored are all different. I think there is a clear need in the industry for some standards in this area.

if you’re tempted to reinvent the wheel ... don't

I did find this slide funny. Reading these slides for me is like seeing them reinvent the whole wheel that is scaling. This has all been done before. Why are they having to learn it the hard way?

don’t over-optimize or expect to know ahead of time how site will scale

I take exception to this slide however. You should have some idea how to scale your app before you deploy it. It is irresponsible and cowboy to deploy and think "oh, we will fix it later". That is true no matter what you are doing. Don't give me the lines about being a start up and all that. It is just irresponsible to deploy something you know won't hold up. For what it is worth, I think these guys really had no clue. And that is because we, as an industry, did not make it known to them what they were up against.

How do we fix this?