Taking Your Eyes Off The Road

Source: http://www.flickr.com/photos/viernest/3380560365/

About a year ago, I had a wreck. I totaled my car. I took my eyes off the road to look at my son in the back seat. I put two of my children in danger. Luckily, everything turned out alright.

This week, I have attended the Velocity Conference. It’s not my first time. I have attended all of them but the one last year. Velocity is all about Web Performance and Operations. I attended mostly web and mobile performance tracks. I was quickly reminded (like, first day, first session) of many things I have been wanting to implement to help me know how DealNews.com is doing performance wise. So, like I often do at conferences, I started hacking. This was Tuesday.

By Wednesday morning, I had some stats. Those stats led to more questions. I refactored some of the stats I was collecting. By dinner, I had good data about our page performance. I was pissed.... at myself. As I said before, I didn’t attend Velocity in 2012. In 2012, I attended other things not related to web performance. In doing so, I took my eye off the road. Or in this case, off the performance of DealNews.com.

Now, we still get an A from WebPageTest for first byte. We don’t get any bad scores really. We aren’t doing poorly. The site performance is just no where near where I want it to be. And it is nowhere near where I have been telling people it was. We deliver the first byte in around 500ms for a request that can use cache well. We draw the above the fold in about 1.5 seconds. I have seen way worse sites out there. But, some times, its about yard sticks.

Source: http://www.flickr.com/photos/billhd/3048457153/

You see, if you are measuring using a broken, worn out yard stick, it may not be an actual yard. You need to measure using the the latest greatest, laser cut yard stick. So, when I compare DealNews performance with others, I look to the best of the best.

, ShopZilla, and others have openly talked about performance and business success being directly correlated. If that is true for DealNews, there is low hanging fruit to improve our business. And apparently that fruit is rotting its been hanging so long.

I have already found 480ms in the header I can trim down. I am not sure yet how much I can reduce it, but it can be faster. I am hoping I can get it down to 100ms. That would be a huge savings as our header currently finishes in about 980ms on average. That would be cutting more than 25% of our header load time completely out. And that is just the first thing I have found.

I saw other good talks that will help me get back on track as well. One talked about premature optimization. Before I put in the new metrics, I had a theory on what was taking up that time. I was wrong. Not totally wrong. That thing is still taking 150ms, so it is next on the list. But, the other issue is clearly more problematic to me since I assumed it was a non-issue and it caught me by surprise.

If you are asking “Brian, how are you doing this?” I am glad you asked. I am using the window.performance.timing object available in new browers. After the onload event fires a script gathers up this data and send it back to our servers in an XHR request. Server side code then takes that data, does a little math where needed and sends it all through StatsD which in turn shoves it in Graphite. That lets me build graphs and get the data as JSON. That second part is key as I will want to put some automated monitoring on this data to keep an eye on when it may go bad again. There were a lot of talks this week about detecting fault or detecting anomalies as well. So, I will put that to good use with the help of a coworker who loves the hard math problems. If you don't have those things in your stack already, SOASTA mPulse appears to be a good option. I was impressed with Philip Tellis from SOASTA in his talk about JavaScript load blocking. Since the mPulse code runs in a JavaScript tag, I was happy to hear he was so concerned with how it affected their user's performance.

I will post anything I think is useful to the general public. Right now, it looks like code and feature bloat. That is not all that interesting.

Being a Better Manager: Communication

As I have worked on being a better manager, I have been trying to determine what our strengths and weakness are as a development team. Communication is once place I think we can do a better job.

The problem we have had was how other departments communicated with the development team. The company had been small and now it is large. When the company was small, anyone could talk to anyone. If Bob knows that Tom works on his revenue report, he could just ask him about it. The problem comes when there are 10 Bob’s and 5 Tom’s and people are switching roles in the development team and in other teams. Bob is working on something else and is still coming to Tom with his problems. Tom has no idea how to help him. But, he is a good employee so he tries. He gets distracted from his project and probably does not help Bob all that much in the end. No one is at fault here really. Tom just wants to get his job done. Bob just wants to help a coworker.

When teams start to grow, communication needs to be directed. It’s not some hard and fast rule and people are not punished for talking outside of the chain. However, it helps that when Tom in marketing has a problem, he knows who to talk to every time. By default that is me. It is then my job to know what development resource to tap to solve it. On the other side, the developers have to feel comfortable with telling people “I don’t know. Will you file a ticket or talk to Brian about that?” People are generally helpful. They want to help. I have tried to let people know it is OK to not help if they can’t help or perhaps they are worried about the scope of the problem. I have found that some people click and natural connections will happen. I have no problem with that. There are some people that I expect to reach out to the person they are ultimately writing code for to get feedback. I also stress to them that if the scope of what the person needs changes, we need to talk about it.

From my position, I have to be ready to listen. I want the communication going through me. If I am a jerk, don’t reply to email, or tell people no all day long, that will probably not help me achieve my goals or help the people that need development resources. I have learned to keep a more open mind. I have learned to not say no, but instead say “I don’t think this is a good idea because of these reasons.” And if I think I can modify the idea to something that is workable, I will offer that as an alternative. I feel like this is working well for me and I have been told by others that it is well received.

One place I can not seem to get right is quarterly managers’ meetings. Everyone takes a turn talking about what is going on in their department. When marketing talks, everyone is really interested. Same with financial and sales. When it comes time for me to talk, everyone seems to glass over. I firmly believe it is the content that is the problem. I am going to try a new tactic as recommended by our CEO. Rather than talk about what we did, talk about why we did it. For example, instead of saying we made a change to our code to do X I should talk about the business reasons we spent time on making that change. Instead of “We rolled a new release of the app.” I tell them “Our new app is catching up with the features of the web site.” I am still working on it. Even that still sounds boring to me. The only good news is that technical operations follows me. It is really hard to talk about the fact that we didn’t go down and that we installed new servers sound interesting to non-geeks.

Test Driven Development Conversion

The concept of unit testing is not new, however it has only gained popularity in the last 10 years or so. When I was asked about it for the first time, I was unsure where it would fit in to my development life. I dismissed it for a long time. I have begun to come around however. I wanted to share how and why.

I have been a professional software developer since 1996. Neither in school nor in practice in the early days of my career was I exposed to the concept of  writing tests to confirm portions of my code worked as expected. Testing was  always done via user testing. You or someone else would use the program to ensure it produced the expected result. That is how I learned to do my job. And I am good at my job. 

My initial reaction  to unit testing was negative. I know professional developers that sit in cubicles all day long and write code without any knowledge of the overall application. They are asked to write a function that takes a certain input and returns a certain output. They neither have nor desire additional knowledge about the use of their code. I find those people to be very unattractive as coworkers. They are often out of  touch with technology. When I told someone in this line of work I used a Mac, he asked "Do they still make software for those?" He was dead serious. This was in 2005. My initial feelings about unit testing reminded me of these people.  Just sitting in cubes, all day, writing functions that took X and returned Y.  They required no knowledge about where X came from or where Y was going. Furthermore, I felt like people that must have to have that kind of structure in place to  do their jobs must not be capable of understanding the big picture. If they did understand the big picture, they would not need to have these things. They should somehow KNOW how the application worked.

The language I use most (and have for the last 17 years) is PHP. PHP development has always had a haphazard culture around it. I don’t see that as a bad thing. Lots of people all doing things their way. That is often a criticism. People in other language communities where things are a bit more dictated and regimented find it disturbing. I find it refreshing. It’s more real world than the rest to me. But, that is a different blog post.

PHP didn’t come with any sort of testing framework out of the box. And none of the core developers put out a way to test your PHP code. The first tests of any kind that I was aware of for PHP were those built by the PHP QA team. It was targeted at testing the PHP engine by writing PHP code that used the engine. That made sense to me since there are so many moving parts in there and you have lots of people all making changes. It makes sense to have some sanity checks.

The first time I heard about testing your own PHP code was PHPUnit. My first impression was that it was an imitation of JUnit which is the de facto standard way to test Java. I don’t really care for Java. I don’t care for the language. I don’t care for the culture. I don’t care for how a java daemon behaves on my server. In general, I don’t care for the platform. It is my personal preference. It is not right or wrong. It is just how I feel. So, when I see something being copied from Java into PHP, I retreat. That probably makes me a bad person. It is a personality flaw. The end result is that I ignored PHPUnit no matter who said they were using it or why. And the only thing I ever heard about when people talked about testing PHP applications was using PHPUnit.

So, how does someone like me start advocating test driven development as a useful tool? It snuck up on me. A few things had to happen.

The first thing that happened was that we started growing our team. In 2006, we were back down to a two person development team. We had been as large as 5 during the dotcom boom. But, we slowly dwindled back down to two people doing development and systems administration every day as their full time jobs. When there are only two of you, you kind of have to know how everything works. If you don’t, you can’t keep running. As we started to add developers, I realized it was increasingly harder for other people to grok the whole application they way I did. And at first, I thought that was just a matter of time. After a year or so they would get it all. So, we hired more people. We are now at six developers, plus myself and one new one starting in two weeks. On top of that we are actively looking to hire two more as soon as possible. Along the way there have been some others come and go as well. Now, to people in the VC bubbles, this may sound boring. I know places where the team grows 20x in 6 months. I can’t imagine. Our application and code base is huge though. Experienced engineers that are now on the team have told me they have never worked with a code base this large. In the last couple of rounds of hiring, I was asked about testing. My standard answer was that if you wanted to write tests for your code and could develop a sane strategy for deploying those tests, I had no problem with it. As a team however, there was no official policy or systems in place for testing. One of the new hires pointed me to a something that peaked my interest. It was doctest for Python. Essentially, you put a bit of Python code in the doc block for the code you are writing. doctest will scan files for these tests and run them. Partly out of curiosity and partly to hopefully satisfy the guys wanting some way to test their code, I wrote a version of doctest for PHP. To prove it worked, I added a test block to all the methods of an array manipulation class I had worked on recently. I was still not convinced that this was a good use of my time however.

The second thing that happened was I met up with Chris Hartjes (@grmpyprogrammer). Chris is an avid supporter of test driven development. I had heard the term test driven development before meeting Chris, but my testing bias had shielded me from really understanding the concept. He and I talked and later I was interviewed by Chris and Ed Finkler (@funkatron) on the /dev/hell podcast. On that podcast we talked about testing as well as other things. Looking back on those two conversations, I now know what Chris did that most changed how I looked at people who are adamant believers in dynamic unit testing. Chris didn’t make it a personal thing. He acknowledged my perspective. We do have a code base and business that has been working for 10+ years. We do have extensive monitoring and graphing in place that tells us when things are going wrong. He told me that if that was working for me, more power to me. Wait, wasn’t he supposed to tell me how I was doing it wrong? Wasn’t he supposed to tell me I was a cowboy and how unprofessional I am? He did none of those things. So, you know what happened? I kept listening to him.

The third thing that happened was also because of Chris (just so you know now, most of this blog post is indirect or direct praising of Chris). Once I started listening, there was something else I was not hearing from him. I was not hearing how I should be using PHPUnit or any particular product for that matter. He only said that code should be written to satisfy a test. He didn’t seem to care how it was tested. He only cared that it was tested. This was again a breathe of fresh air. In this business where every group likes to tell the other group how wrong they are doing things, having someone that didn’t seem to care how you did it, as long as you did it was awesome. So, again, I kept listening to him.

The fourth thing that happened was when Chris made me realize that I already was a test driven developer. I don't think he realized that he did it. He likely does not remember the conversation. And I probably remember it poorly. For all I know, there were beers and/or whiskey involved. It went something like “So, how do you test your code?” said Chris. I replied “Well, if the web page loads, it worked.” To which he asked “So, you don’t test changes as you make them?” I said “Well, yeah, I write some test code and run it till it works while I am making changes. I just don’t save that code. Once its working I don’t need it.” I don’t think it hit me right then. I think it took days or weeks to sink in. Mother F--ker, I already write code in a test driven manner a lot of the time. I have a test.php script in my home directory on my development server. I reuse that thing all the time to test the code I am working on. Instead of assertions, lots of the time I will just print_r() the output and assert with my eyes. But, the end result is the same. So, why not take that extra step? I had the doctest stuff in place. I could just add my test code there and run the test over and over until it worked like I wanted it to. And what do you know, I kind of liked it.

And what I think was the last straw was when I was merging code onto staging and a test failed. Oh my gosh! I made a change to our code base and my all knowing, all seeing eye did not realize that my changes were going to break something. How is this possible? How did I not see this coming? I know everything in the application don’t I? I architected the whole thing. The reality is, this likely (and by likely I mean, it did) happened before. However, sometime very soon, either a monitor or a human would have noticed and a bug report would have been filed. I would have then fixed it. I would have likely justified the bug as the cost of progress. Now I have a way to help prevent these small bugs from rolling out in the first place. And now that I have this tool, there is no excuse. The only reason this should happen now is that we don’t have enough tests.

We are working on improving our test coverage. I am not to the point yet where I require tests for all code. Perhaps we will get to that point. I don’t know. My team is aware of the tool now. And they are aware that I use it and think it is wise to use it. I told them recently that I am not telling them to use it. But, if they roll a bug, and there was not a test, the fix should probably be written using test driven development.

I have intentionally not talked about what we use other than the a fore mentioned doctest. I hope the message that you take away from what I have learned is that you should be using SOMETHING to test code. That is more important than how you test your code. I also have not talked about how you write code that can be tested. That is another hurdle I have had to (still am) get over. I thought a lot of our code was not testable. Because you know, we are solving problems that no one ever in the history of the world has had to solve, like retrieving a web page, hard stuff. There are better resources to learn how to write code to be more testable, especially for PHP. I highly recommend everything Chris Hartjes has ever said or written about the subject. You can find his thoughts on the topic at http://grumpy-learning.com/, http://grumpy-phpunit.com/, and https://leanpub.com/grumpy-testing.

We will be improving doctest.php. We have lots of ideas. It just fits better for us. If PHPUnit works for you, great, use it. Just search for `unit testing PHP`and find something that works for you. In addition, we are starting to work with Selenium for interactive testing of our site and some commercial products that are really good at testing APIs. I am getting a hard time from some guys at work. I have been critical of testing in the past. I just needed to understand that I already worked this way or thought this way.

College Basketball in 2013

I am a huge sports fan. I particularly love college sports. We don't have any pro sports teams in Alabama. So, we take our college sports very seriously. Like many sports fans I have watched the NCAA tournament this year. I have to say, if the current trend continues, I don't think I will in the future. I don't like the product.

In particular, I pretty much hate the way Louisville plays basketball. Yes, the won. Kudos to them. I don't blame them really. No, the real problem is the lack of offensive foul calls. Players push and shove each other. Guys going up for a lay up from under the back board are free to bully their way up to the rim including jumping backwards into a well placed defender. The only thing that matters to the NCAA is scoring, at any cost.

Luke Hancock stealing the game from Witchita State by fouling although it was called a jump ball. / Jaime Green/MCT via freep.com
So, getting back to Louisville, Rick Petino and his players have figured out just how far they can push these limits. They are almost playing hockey. But, in basketball, there is no penalty box. You get 5 fouls. If you have a deep bench like Petino, you just plug in another guy and they keep fouling. On the other side you have Luke Hancock, named MVP of the Final Four (geez, talk about bad role models) throwing a pump fake and then leaning into a player in a direction that is not toward the basket or any kind of natural shooting motion, just to draw the foul. Those should be no calls or offensive fouls, IMO.

I coach youth basketball. I teach my kids to not foul. I teach my kids to not touch each other when at all possible. On defense, they need to be in position and ready to move their feet. On offense, don't try and run over people. If you have to foul to stop someone, then we need a better game plan. Or maybe they are just better than we are. I have on occasion praised a kid for being aggressive which led to a foul. These are eight year olds. And some of them are still in a shell about being athletic. But, I never say "foul him". And I never have my kid try and draw a foul the way Luke Hancock does. That is just dirty basketball. I will tell my kids to drive the lane and go for the goal. And "if" you get fouled it's OK. But, never go looking for the foul.

Contrastingly, I watched some of the New York Knicks vs. Oklahoma City Thunder game Sunday. These guys hardly ever touch each other. They play within, what I believe, are the real rules of basketball. The game is a little fast and there is a lot of one on one play that can be tedious. But, it was more fun to watch than NCAA basketball. Maybe all the good basketball players go to the NBA and we are left with scrubs in the NCAA. They are too small (size wise) or to slow to play corner back in football so they end up playing basketball as an "athlete". I hope something is done about this. The game is just getting trashy.

The Web We Lost

I was reading Chris Shiflett's blog and he mentioned a blog post about the web the way it used to be before Facebook and Twitter. I almost tweeted it. Then I thought, nah, a good old fashion blog post that linked to it was way better.

The Web We Lost - Anil Dash

The tech industry and its press have treated the rise of billion-scale social networks and ubiquitous smartphone apps as an unadulterated win for regular people, a triumph of usability and empowerment. They seldom talk about what we've lost along the way in this transition, and I find that younger folks may not even know how the web used to be.

It is a good read for all us "old timers". He ironically uses Facebook for comments. One really good one that I can't actually link to afaik because... Facebook said:

No sarcasm here.. I legitimately miss the "web-rings" of old. With niche interests, it was a great way to find like minded sites, and I found many of my still bookmarked favorites with those old crappy left and right arrows :)

Yeah, you know, Web Rings weren't that bad. I had sites on a couple of those things. I think they are still around. And I think you could still pull them off and not worry about Google juice and all that. 302 redirects don't pass that crap. Maybe I should write up some quick javascript that indexs the Planet PHP blogs and adds a Planet PHP web ring to your page. Hmmmm.

Reliable Delivery

There is plenty to read about continuous delivery in terms of rolling out code. In my journey to be a better leader and manager, I have realized there is something we are doing badly. While we are continuously integrating, testing, and deploying code to production, we are not reliably delivering a product for our client. I use client loosely here. The client for the development team at dealnews is the company itself. With every project however, there are people interested in its progress. We were (still are some) underserving those people. Those interested often have no clue of when something will be done. Sometimes things would are done and they do not know it. So, how do we solve this problem?

There was a time when we were a team of two people. If some server issue popped up, it totally derailed whatever project was being worked on. So, we grew accustomed to missing deadlines out of necessity. Now that we are a real team, that excuse is no longer valid.

Part of the problem also lies in the team’s (i.e. me) OSS roots. I really got started doing web development by writing Phorum. In fact, my first job when I was hired full time for, then, dealmac.com was to update the Phorum software to scale better. Banner ads were at an all time high in the late 90s. We made a lot of money off those page views. In OSS, the answer to “When will X be done?” is often “it’s done when it’s done”. Blizzard Entertainment, creators of World of Warcraft, have been quoted as saying that about their products.

Telling someone it will be done when it is done sounds really cool. I feel like a bad ass. It’s art! It’s not about a timeline. I can’t be bothered with your pesky expectations. Except, that is bullshit. In reality, there are people depending on me and my team to get work done. So, that was one of the first things I wanted to change. We have gotten better. Here are some things I have learned.

Think before speaking. If I am in a meeting and talking about a new task or project, I try not to throw out a time frame for completion. I tell them I will get back to them. I try and tell them when I will get back with them. I then gather anyone on the team that needs to have input and evaluate the changes. Once I have a solid answer, I report back to the other department. Very often I worry people will be mad when I say “two weeks” so I say “one week” and hope for a miracle. But if it is going to be two weeks, I need to tell them that. That may be too late. Or it may be not worth it to them to take that much of our time. Of course not all tasks need this kind of time commitment to deciding a time frame. Deciding which do and which don’t is tough sometimes.

I wanted to start communicating deadlines to our developers. We had never had a ticketing system that supported due dates. We had the “done when done” philosophy. I was really worried about adding them. I didn’t want people to feel like they were being micro-managed. After a couple of months, everyone is much happier. Turns out developers really like knowing when things are expected to be done. It also helps to prioritize different tasks. If a developer has 5 things assigned to them, they can look at the due dates to decide which is more important. Because, you know, they are all marked “highest” priority. Developers have the freedom to speak up and say “there is no way I can finish this by that date”. It’s possible I completely misjudged the scale of the change. It is also possible I wrote a horrible ticket and the developer is confused by my 2AM stream of consciousness.

There is another hurdle for me. I have gotten better at managing expectations of other departments and helping developers know what is expected of them. It is better. It is not perfect. It may never be. I am still struggling with doing the same with my own development tasks. I catch myself thinking “Well, that is just how it is when I have to manage and develop.” But that is a total excuse and a cop out. I have to learn to do that better. Managing my own time may be the toughest of all. If and when I figure something out, I will write about it.

Becoming a Better Manager

I have typically blogged on this site about things I have learned in the web
application world that may help others. In the last year or so, I have been
learning a lot of new things. Most of them are not technical in nature however.
You see, I have moved into a role of being a manager. I am a developing manager.
I still write code. And a lot of my time is dedicated to management as well. I
think this has caused me to stop blogging as much. My mind didn't see these
topics as interesting to what I perceive to be the audience of my blog as things
I have blogged about in the past. The problem is, I miss blogging.

So, going forward, there may be some non-technical things on this blog. My hope
is that someone out there finds them as useful as some of my more technical
blogs posts have been.

Developers and Entropy

This is a selfish blog post. I read a great blog post titled "Why You Need To Hire Great Developers" but I could not find it in my browser history or chat history. It talks about entropy creators versus entropy reducers and how bad we are at knowing which one someone is during the hiring process. I wanted to mention it here so my followers could read it and so I could find it again when I was looking for it.

Lock Wait Timeout Errors or Leave Your Data on the Server

If you use MySQL with InnoDB (most everyone) then you will likely see this error at some point. There is some confusion sometimes about what this means. Let me try and explain it.

Let's say we have a connection called A to the database. Connection A tries to update a row. But, it receives a lock wait timeout error. That does not mean that connection A did anything wrong. It means that another connection, call it B, is also updating a row that connection A wants to update. But, connection B has an open transaction that has not been committed yet. So, MySQL won't let you update that row from connection A. Make sense?

The first mistake people may make is looking at the code that throws the error to find a solution. It is hardly ever the code that throws the error that is the problem. In our case, it was code that was doing a simple insert into a table. I had a look at our processing logs around the time that the errors were thrown and I found a job that was running during that time. I then looked for code in that job that updates the table that was locked. This was where the problem lied.

So, why does this happen? Well, there can be very legitimate reasons. There can also be very careless reasons. The genesis of this blog post was some code that appeared to be legitimate at first, but upon further inspection was careless. This is basically what the code did.

  1. Start Transaction on database1
  2. Clear out some old data from the table
  3. Select a bunch of data from database2.table
  4. Loop in PHP, updating each row in its own query to update one column
  5. Select a bunch of data from database2.other_table
  6. Loop in PHP, updating each row in its own query to update another column
  7. Commit database1

This code ran in about 20 minutes on the data set we had. It kept a transaction open the whole time. It appeared legit at first because you can't join the data as there are sums and counts going on that have a one to many relationship which would cause some duplication of the sums and counts. It also looks legit because you are having to pull data from one database into another. However, there is a solution for this. We need to stop pulling all this data into PHP land and let it stay on the server where it lives. So, I changed it to this.

  1. Create temp table on database2 to hold mydata
  2. Select data from database2.table into my temp table
  3. Select data from database2.other_table into my temp table
  4. Move my temp table using extended inserts via PHP from database2 to database1
  5. Start Transaction on database1
  6. Clear out some old data from the table
  7. Do a multi-table bulk update of my real table using the temp table
  8. Commit database1

This runs in 3 minutes and only requires a 90 second transaction lock. Our lock wait timeout on this server is 50 seconds though. However, we have a 3 time retry rule for any lock wait timeout in our DB code. So, this should allow for our current workload to be processed without any data loss.

So, why did this help so much? We are not moving data from MySQL to PHP over and over. This applies to any language, not just PHP. The extended inserts for moving the temp table from one db to another really help. That is the fastest part of the whole thing. It moves about 2 million records from one to the other in about 1.5 seconds.

So, if you see a lock wait timeout, don't think you should sleep longer between retries. And don't dissect the code that is throwing the error. You have to dig in and find what else is running when it happens. Good luck.

Bonus: If you have memory issues in your application code, these techniques can help with those too.

Scaling 101 - We are Failing the Next Generation

The other day Twitter was down and I had no place to comment on Twitter being down. This got us to talking about scaling at work. I was reminded of the recent slides posted from Instagram about their scaling journey. They are great slides. There is only one problem I have with them. They are just the same slides that you would find from 2000 about scaling.

I have to say, I like Instagram. My daughter has something like 1,000 followers on Instagram. And good for them for being bought by Facebook for a bajillion dollars. This is not a dig on them really. This is a dig on our industry. Why did Instagram have to learn the hard way how to scale their app? I want to point out some of their issues and discuss why its silly they had to learn this the hard way.

Single machine somewhere in LA

Why would anyone deploy an app to the app store when the backend is all on one server in this day and age? I am not big poponent of the cloud, but that has to be better than a single server in a rack somewhere. And L.A.? Go for Dallas or somewhere geographically neutral.


So, this is one of the biggest mistakes I see in all of web application developement. People use Apache or Nginx or whatever and have a mod_rewrite command that sends ALL requsets into their appliction stack. They do this because they are lazy. They want to write whatever code they want later and just have the request picked up without any work. At dealnews, we don't do that. We have controllers. But, we specify in our Apache config what paths are routed to those controllers. The most general we have is:

    RewriteCond %{REQUEST_URI} (\/|.html|.php)$
    RewriteRule ^.+$ /torpedo.php/$1 [L]

So, if the request ends with / or .html or .php, send it to the controller. This is the controller our proxy servers use. Any other requests like robots.txt, images, etc. are all served off disk by Apache. No sense having a PHP process handle that. It's crazy. This is not Instagram's fault. They likely followed some examples they found that other developers before them put on the internet. Why are we doing this to people? Is it some rite of passage? I suspect that the problem is actually that 90% of the web does not require any scaling. Only 10% of the sites and services out there have to actually worry about load. So, this never comes up.

So, the next part of their slides basically glorify the scaling process. This is another problem with people in our field. We thrive on the chaos. These are our glory days. Let's face it, most geeks never won the high school football championship. The days when we are faced with a huge scaling challenge are our glory days. I know I have had that problem. And I played sports as a youth. But, nothing is better than my 2006 war story about a Yahoo front page link. Man, I rocked that shit. But, you know what. The fact that I had to struggle through that means I did not do my job. We should have never been facing that issue. It should have just all worked. That is what it does now. It just works. We don't even know when we get a spike now. It just works. The last thing I want in my life now is an unexpected outage that I have to RAGE on to get the site up again. That leaves me feeling like a failure.

Then they realize they are out of the depths. They need to do things they never thought they would need to do. Why not? Why did they not know they would need to do these things? Unexpected growth? Maybe. Why is this not common knowledge though? With all the talk of the cloud being awesome for scaling, why was this not a button on the AWS dashboard that said [SCALE] that you just push? That is what the cloud does right?

In the end, Instagram learned, the hard way again, that you have to build your own architecture to solve your problem. I learned it the hard way. LiveJournal learned it the hard way. Facebook, Twitter, etc. etc. They have all learned the hard way that there is no single solution for massive scale. You have to be prepared to build the architecture that solves your problem in a way that you can manage. But, there are basic building blocks of all scaling that need to be in place. You should never, ever, ever start with an application on a single server that reads and writes directly to the database with no cache in place. Couch, Mongo, blah blah blah. Whatever. They all need cache in front of them to scale. Just build it in from the start.

Instagram was storing images. Why were they surprised when they ran out of room for the images? I just can't fathom that. Its images. They don't compress. You have to put them somewhere. This has to be an education issue. LiveJournal solved this in 2003 with MogileFS and Gearman. Why did they not build their arch on top of that to start with? Poor education, that is why.

One thing they bring up is one that has me bugged is monitoring. There is no good solution for this. Everyone ends up rolling their own solutions using a few different tools. The tools are often the same, but the metrics, and how they are reported and monitored are all different. I think there is a clear need in the industry for some standards in this area.

if you’re tempted to reinvent the wheel ... don't

I did find this slide funny. Reading these slides for me is like seeing them reinvent the whole wheel that is scaling. This has all been done before. Why are they having to learn it the hard way?

don’t over-optimize or expect to know ahead of time how site will scale

I take exception to this slide however. You should have some idea how to scale your app before you deploy it. It is irresponsible and cowboy to deploy and think "oh, we will fix it later". That is true no matter what you are doing. Don't give me the lines about being a start up and all that. It is just irresponsible to deploy something you know won't hold up. For what it is worth, I think these guys really had no clue. And that is because we, as an industry, did not make it known to them what they were up against.

How do we fix this?