HTML5 Experiment

I was looking through dealnews.com browser stats the other day to see how many of our visitors had browsers that could use CSS3. This was how it broke down.



58% of dealnews visitors have browsers that support some CSS3 elements like border-radius and box-shadow. This includes recent Webkit, Firefox 3.5+, Opera 10+ and IE9. Awesome! Another 37% fully support CSS2. This includes IE7, IE8 and Firefox 2 - 3.0.x.

We have started to sneak in some CSS3 elements into the CSS on dealnews.com. In places where rounded corners are optional (like the lightbox I created), we use border-radius instead of a lot of images. The same for shadows. We have started using box-shadow instead of, well, nothing. We just don't have shadows if the browser does not support box-shadow. In our recent redesign of our mobile site, we used all CSS3 for shadows, corners and gradients. But these were all places where things were optional.

The header of dealnews.com, on the other hand, requires a certain consistency. It is the first thing you see. If it looks largely different on your work computer running IE7 than it does on your Mac at home using Safari, that may be confusing. So, we have stuck with images for rounded corners, shadows and gradients. The images have problems though. On the iPad for instance, the page starts zoomed out a bit. The elements holding the rounded corner images don't always line up well at different zoom levels. Its a math thing. When zooming, you end up with an odd number of pixels at times which causes pixels to shift. So, we get gaps in our tabs or buttons. Not pretty. This has been bugging me, but its really just the iPad. Its not mission critical to fix a pixel on the iPad. Armed with the numbers above, I decided to try and reproduce the dealnews.com header using the most modern techniques possible and see how well I could degrade the older browsers.

Here is a screen shot in Firefox 4 of the current dealnews header. (You can click any of these images to see them full size.)



Now, here is the HTML5/CSS3 header in all the browsers I tried it in. I developed in Firefox 4 and tested/tweaked in others.

Firefox 4 (Mac)


Firefox 3.6 (Windows)


Chrome 10


Opera 11


Internet Explorer 9


Internet Explorer 8 (via IE9 Dev Tools)


Internet Explorer 7 (via IE9 Dev Tools)


Internet Explorer 6 (ZOMG!!1!)


Wow! This turned out way better than I expected. Even IE6 renders nicely. I did have to degrade in some older versions of Internet Explorer. This gets me 97% coverage for all the dealnews.com visitors. And, IMO, degrading this way is not all that bad. I am seeing it more and more around the internet. CNET is using border-radius and CSS3 gradients in their header. In Internet Explorer you see square corners and no gradient. Let's look a little deeper into what I used here.

Rounded corners

For all the rounded corners, I used a border-radius of 5px. I used CSS that would be most compatible. For border-radius that means a few lines just to get the one effect across browsers. The CSS for the tabs looks like this.

-moz-border-radius-topleft: 5px;
-moz-border-radius-topright: 5px;
border-top-left-radius: 5px;
border-top-right-radius: 5px;


The -moz is to support Firefox 3.5+. Everything else that supports border-radius recognizes the non-prefixed style.

For the older browsers, they just use square corners. I think they still look nice. The one thing we lose with this solution is the styled gradient border on the brown tabs. It fades to a silver near the bottom of the tabs. There is no solution for that in CSS at this time. That is a small price to pay to skip loading all those images IMO.

Shadows

For the shadows on the left and right side of the page (sorry, not real visible in the screen shots), I used box-shadow. This requires CSS such as:

-webkit-box-shadow: 0 0 0 transparent, 0 2px 2px #B2B1A6;
-moz-box-shadow: 0 0 0 transparent, 0 2px 2px #B2B1A6;
box-shadow: 0 0 0 transparent, 0 2px 2px #B2B1A6;


Again, -moz for Mozilla browsers and -webkit for WebKit based browsers. Now, there are some gotchas with box-shadow. The first is that Firefox requires a color. The other supporting browsers will simply use an alpha blended darker gradient. This makes it a little more work to get it all right in Firefox. The other tricky part was getting the vertical shadow to work. The shadow you make actual curves around the entire element. It has the same z-index as the element itself. So, a shadow on an element will appear on top of other elements around it with a lower z-index. I had issues keeping the shadow from the lower area (the vertical shadow) from appearing above the element and on top of the tabs. If you look really, really, really close where the vertical and horizontal shadows meet, you will see a tiny gap in the color. Luckily for me, it works due to all the blue around it. That helps mask that small flaw.

For the older IE browsers, I used conditional IE HTML blocks and added a light gray border to the elements. It is a bit more degraded than I like, but as time passes, those browsers will stop being used.

Gradient Backgrounds

Completing the holy trinity of CSS3 features that make life easier is gradient backgrounds. This is the least unified and most complex of the three features I have used in this experiement. For starters, no two browser use the same syntax for gradient backgrounds. Firefox does use the W3C recommended syntax. Webkit uses something they came up with and Internet Explorer uses its super proprietary filter CSS property and not the background property. The biggest problem with the filter property is that it makes IE9 not work with border-radius. The tabs could not use a gradient in IE9 because it applied a rectangular gradient to the element that exceeded the bounds of the rounded top corners. Bummer. Once the standard is set this should all clear itself up. As things stand today, I had to use the following syntax for the shadows.

background: #4b4ba8; /* old browsers */
background: -moz-linear-gradient(top, #4b4ba8 0%, #3F3F9A 50%, #303089 93%); /* firefox */
background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#4b4ba8), color-stop(50%,#3F3F9A), color-stop(93%,#303089)); /* webkit */
filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#4b4ba8', endColorstr='#303089',GradientType=0 ); /* ie */


As you can see, this is pretty complex. See the next section for a quick way to configure that block of CSS.

Another thing that makes working with these gradients more complex than images is a lack of rendering control. When you are making images, you can control the exact RGB values of the colors in the images. When you leave it up to the browser to render the gradient, sometimes they don't agree. I had to do a lot of fiddling with the RGB values to get all the browsers to render the gradient just right. Some of this had to do with the short vertical area I am using in the elements. That limits the number of colors that can be used to make the transition. So, just be careful when you are lining up two elements with gradients that transition from one element to the other like I have here with the blue tab and the blue navigation bar. 

CSS3 Online tools

Remembering all the CSS3 syntax is a little daunting. Luckily, there are some cool online tools to generate some of this stuff. Here are few I have used.


HTML5 Elements

To make this truly an HTML5 page, I wanted to use the new doctype and some of the new elements. I make use of the <header> and <nav> tags in this page. The <header> tag is just what you would think it is. It surrounds your header. This is all part of the new semantic logic behind HTML 5. The <nav> tag surrounds major site navigation. Not just any navigaion, major navigation. The HTML5 Doctor has more on that.

To make IE support them, I used a bit of javascript I found on the internet along with some CSS to set them to display as block level elements. The CSS is actually used by older Firefox versions as well.

A couple of non-CSS3 techniques

I did a couple of things that are not CSS3 or HTML5 in this page. One is that I put the CSS into the page and not in its own file. With modern broadband, the biggest issue in delivering pages fast is the number of HTTP requests, not (always) the total size of the data. The more HTTP requests required to start rendering your page, the longer it will take. Your optimization goals will determine if this technique is right for you. I currently include all CSS needed to render the header in the page so that the rendering can start without any additional HTTP requests. The CSS for the rest of the page is included via a link tag.

The other technique I used in this page that is not new, but is not widely used. Again, depending on your needs, it may be a possible win when trying to reduce HTTP requests. I used embedded image data URIs in the CSS for the three images I still needed for this page. Basically, you base64 encode the actual image file and put it into the CSS. The benefit is that this page is fully rendered with just one HTTP request. The downside is that the base size for the page (or CSS if it is external) that has to be downloaded every time is much larger. Probably a good compromise would be to put the CSS into an external file. This would mean just two HTTP requests would be needed and the CSS could be cached. For IE6, I just used conditional HTML to include an actual URL to the background images.

Th data URI technique is a little bit of a mysterious technique in that it does make for a larger page, but can help with render time on first load. This really comes down to what you are optimizing for. If you are optimizing for repeat visitors, it may be that the images are better off being separate requests. If you are optimizing for new visitors, this technique will yield a faster rendering page. In Chrome and IE9, the onLoad event fired much sooner (as much as half the time) using this technique than having the images as a separate reqeust. In Firefox, something else is going on. I am not sure what. The onLoad event still fires sooner, but not a whole lot sooner. The DOMContentLoaded event in Firefox however fires later with this technique than with the images in a separate request. Firefox was the only browser that showed this pattern.

OOCSS

It is not really HTML5 or CSS3 related, but I do want to give some credit to OOCSS. I used it for much of the layout. It makes laying out elements in different browsers very easy. I am using it in the current site as well as the HTML5 experiment. You should use it. It is awesome.

Conclusion

HTML5 and CSS3 have a lot to offer. And if you have a user base that is fairly modern, you can start using things now. While I may not redo the current dealnews web site using HTML5 and CSS3, our next redesign and any upcoming new designs will definitely include aspects of HTML5 and CSS3 where we can. It can save time and resources when you use these new techniques.

Here is a link to the HTML5 source. I also created a stripped down version of the HTML4 in use on the site now.

Lies, Damned Lies and Google Analytics

Google Analytics (GA) has changed the world of web analytics. It used to be that you only had applications that analyzed logs from your web server. Those were OK for the first few years. But, with bots (and especially ones that lied about being a bot) and more complex web architectures, those logs became less useful for understanding how your users used your web site.

Once such product was Urchin. Years ago, we were users of Urchin, the product and company that Google purchased to create GA. It was the last really good log analyzer out there. We were kind of excited when Google bought them as we were looking at going to their hosted solution which eventually became GA. At the time however they were young and we decided to go with Omniture, the 800 pound gorilla in the space. However, Omniture prices were such that increased traffic meant increased cost with no additional importance in their numbers and no new features from their product. So, we left them. We turned back to GA. In addition we started using Yahoo's recently acquired product, Yahoo Analytics (formerly IndexTools). They were both free so we figured why not use them both. All this was over 2 years ago.

The one big lacking thing for us with all these tools was being able to tie actual user activity in the analytics to revenue. You see, dealnews does some affiliate based business. This means that we get a percentage of sales after the transactions are all closed. So, we may not see revenue for a user action for days or even weeks. All the analytics packages are made for shopping carts. In a shopping cart, you know at the checkout how much the user spent so you can inject that into the system along with their page view. We don't have that luxury. To try and fill this gap, we started keeping our own javascript based session data in logs for various purposes in addition to the two analytics systems.

Recently we decided to try and use the GA API to get page views, visits and unique visitor data about parts of our site so we could, at least at a high level, tie the numbers all together in one report. This gets me to the meat of this post. What we found was quite disturbing. This is a query for a single day for a particular segment of our site. We used the Data Feed Query Explorer to craft our queries for the data we wanted. This was the result.



As you can see, some pages received visitors but no visits. Pages consistently received more visitors than visits which I find quite odd. Sometimes we found that our internal logging would show activity on pages and Google would simply not show any activity for that page for an entire month. Now, I know they have a cutoff. I believe it is still 10,000, the old Urchin number. (C'mon Google, you are Google. You still have this?) They will only store up to 10,000 unique items (page URLs in this case) for a given segment of their reporting data. So, 10k referring domains, 10k pages, etc. etc. Perhaps that is what happened? We dug in and found that to not be the case. And besides, it shows visitors but not visits. What is up with that? Maybe its a date thing. We should look at more days. This is an entire month for the same filters.



Whoa, still have visitors and no visits. In addition, there are lots of multiples of 17 there. 17, 34 both appear a lot. We found that littered throughout our results when filtering the page path. This has to be made up data or something. I don't see how it could be based in any reality.

The odd data was not limited to this apparent missing data. There is apparently extra data in there too. In another discussion about the impact of recent social marketing efforts we went to the analytics to see what kind of traffic social networking sites were sending our way. When comparing GA to YA and our internal numbers, we were left baffled. Sorry, no numbers allowed here - super corporate secrets and all that. But, I can tell you that Google Analytics claims that several large, well known referring sites send 5x to 10x the traffic to us that our other tracking reports. I even wrote custom code to try and follow a user through our logs to see if maybe they were attributing referrers to a visitor on a second visit in the day as belonging to the earlier referrer but could never find any pattern that matched their data. There simply is no way this is right.

There is good news. If you stick with the high level numbers like overall page views, visitors, visits and things like new vs. old visitors, the numbers appear to line up well. In all cases, the differences of all 3 of our resources are within acceptable ranges. But, apparently, drilling down too deep in the data with GA will yield some very unsavory answers.

Questions for Git users

So, we have been using Subversion for a long time. We are very comfortable with it. We know how it works. Even when things go sideways, we have people here that understand it and can set things straight again. But, we do find some things lacking. I had an SVN conflict today where the diff showed the problem to be that I was replacing nothing with something. That was a conflict. Sigh. Also, it is getting slower and slower as time passes. Committing or updating my code base can take long enough that I start checking email and forget I was committing something. So, we are thinking about a switch. I use GitHub for some of my OSS software I have released. I like GitHub a lot. I find the Git command line tool a little confusing (commit -a seems silly for example). But, I figure I can get over that. I am going to end up wrapping it up in our merge and deploy tools anyway. But, I have some questions for people that use Git on a daily basis. I have searched around and there is either more than one answer or no answers to these things.

How big is your repository?

We have 1,610 directories and 10,215 files in our code library. How does Git deal with repositories that size?

What are you using for a Git server?

From what I can tell, there is no server component to Git. I think I understand that you can use a shared Git use and allow ssh keys to access it that way? Or there are some other 3rd party projects that act as a Git server. We currently use Apache+SVN which lets us A) Use our LDAP server and B) control access to parts of the repository using Apache configuration. It looks like we would lose the LDAP for sure but maybe something like Gitolite would allow us to do some auth/acl stuff. Maybe we could dump our LDAP data on a schedule to something it can read.

What is your commit, push, deploy procedure?

This is really aimed at the guys doing continuous deployment in a web application with Git. Our current methodology is that each developer has a branch. They commit to their branch. When a set of commits is ready for release, they push it to trunk. In a staging environment, the changeset in trunk is merged with the production branch. That branch is then rolled to the servers. We like having the 3 layers. It lets us review changes in one place (trunk) for large commits. Meaning there can be 10 commits in a developer's branch, but when all the changes are merged into trunk (using one SVN merge of course) you get one nice neat commit in the trunk branch that can easily be evaluated. It also lets users collaborate easily by committing changes to trunk that others may need. Other users can just merge back from trunk to get the changes. From what I have read or heard people talking about, this seems to fly in the face of how Git works. Or, at least what Git is good at. Also, I think the way Git works, 10 commits in a branch would merge as 10 commits in trunk. So, we would lose the unification of a lot of little changes getting merged into one change. Some of our developers will use their branch to commit things in progress and not at a finished point. So, by the time they are done, we will have 10+ commits for one task. Any input here?

What are you using for visualization?

We use Trac. We love and hate Trac. Its good enough. I know Redmine supports Git natively and I know there is a Trac Hack for making Trac support it. Anything else? Any comments on Redmine?

Any tips for importing from Subversion?

I tried importing our Subversion repos into Git using svn2git. It got hung up on some circular branch reference or something. The error message was confusing. So, I guess if we do migrate, we will be starting with little to no history. Perhaps we just import our current production branch and start from there. Any tips?

What is next for message board software?

When I was hired at dealnews.com in 1998, my primary focus was to get our message board (Phorum) up to speed. I had written the first version as a side project for the site. Message boards were a lot simpler back then. Matt's WWWBoard was the gold standard of the time. And really, the functionality has been only evolutionary since. We added attachments to Phorum in 2003 or something. That was a major new feature. In Phorum 5 we added a module system that was awesome. But, that was just about the admin and not the user. From the user's perspective, message boards have not changed much since 1997. I saw this tweet from Amy Hoy and it got me to thinking about how message boards work. Here is the typical user experience:
  1. Go to the message board
  2. See a list of categories
  3. Drill down to the category they want to read
  4. Scroll through a list of messages that are in reverse cronological order by original post date or most recent post date
  5. Click a message and read it.
  6. Go to #3, repeat
Every message board software package pretty much works like that and has for over 10 years. And it kind of sucks. What a user would probably rather experience is:
  1. Go to the message board
  2. The most interesting things (to this user) are listed right there on the page. No drill down needed.
  3. Click one and read it.
  4. Goto #2, repeat.
Sounds easy? That #2 is easy to type but very hard to accomplish. I think it is conceivably doable if you are running a site that has all the data. Stackoverflow comes close. When you land on the site, they default the page to the "interesting" posts. However, they are not always interesting to me. They are making general assumptions about their audience. For example, right now, the first one is tagged "delphi". I could care less about that language and any posts about it. Its a good try, but misses by oh so far. This is not a Stackoverflow hate post. They are doing a good job. So, what do I do when I land there? I ignore the front page and click Tags (#2 in the first list), then pick a tag I want to read about (#3 in the first list). Low and behold the page I get is "newest". So, I end up doing exactly what is in the first list I mentioned. They do offer other sort options. But, they chose newest as the default. And from years of watching user behavior, 80% - 90% of people go with the good ol' default. This kind of brings me to another point though about the types of message boards there are.

Stackoverflow is a classic example of a help message board. People come there and ask a question. Other people come along and answer the question. Then more people come along and vote on whether the answers (and questions) are any good. This is one really nice feature that I think will have to become a core feature in any message board of the future. The signal to noise ratio can get so out of whack, you need human input to help decide what is good and what is noise. I think the core of the application has to rely on that if we are ever going to achieve the desired experience.

The second type of message board is a conversational system. It is almost like a delayed chat room. People come to a message board and post about their cat or asking who watched a TV show, that kind of thing. This has a completely different dynamic to it than the help message board. You can't really vote if a post is good or bad. The obvious exception being spam would of course want to be recognized and dealt with.

So, how do you know what content is desirable for the user that is entering the site right now? This concept has already been laid out for us: the social graph. You have to give users a way to associate with other users. If Bob really likes Tom's posts, he is probably more interested to read Tom's post from 30 minutes ago than some new guy that just joined the site and posted 1 minute ago. The challenge here is getting people to interconnect...but not too much. Everyone has that aunt on Facebook that follows you, your roommate and anybody else she can. She would follow your dog if he had a Facebook account. So, those people would still get a crappy experience if the whole system relied on the social graph. The other side is the people that will never "follow", "like" or whatever you call it another person. Their experience would lack as well. One key ingredient here is that you need to own this data. You can't just throw like buttons and Facebook connect on your message board and think you can leverage that data. That data is for Facebook, not you. I think the help message boards could benefit from the social graph as well.

Another aspect of what is most important to a user is discussions they are involved in. That could mean ones they started, ones they have replied to or simply ones they have read. Which of those aspects are more important than the others? Clearly if you started a discussion and someone has replied, that is going to interest you. If you posted a reply, you may be done with the topic or you may be waiting on a response. It would take some serious natural language algorithms to decide which is the case. For things you have read, I think you have to consider how many times the user has read the discussion. If every time it is updated they read it, they probably will want to read it again the next time it is updated. If they have only read it once, maybe they are not as interested.

The last aspect of message boards is grouping things. This is the part I actually struggle with the most. The easy first answer is tagging. Don't force the user down a single trail, let them tag posts instead of only posting them in one neat contained area. That gets you half way there. Let's use Stackoverflow (I really do like the site) as an example again. The first thing I do is go to Tags and click on PHP. I like helping people with PHP problems. So,  is that really any different from categorization? Sure, there could be someone out there that really likes helping with Javascript. And if the same post was tagged with both tags then their coverage of potential help is larger. But, some of the time those tags are wrong when they tag it with more than one tag. The problem they need help with is either PHP or Javascript, most likely not both. They just don't know what they are doing. For example, there is this post on Stackoverflow. The user tagged it PHP and database-design. There is no PHP in the question. I am guessing he is using PHP for the app. But, it really never comes up and he is only talking about database design. So, who did the PHP tag help there? I don't think it helped him. And it only wasted my time. Having written all that, a free-for-all approach where there is no filtering sucks too. ARGH! It just all sucks. That brings us back to what Amy said in a way. Perhaps moderated tagging is an answer. I have not seen a way on Stackoverflow to untag a post. That would let people correct others. I am gonna write that down. If you work at Stackoverflow and are reading this, you can use that idea. Just put a comment in the code about how brilliant I am or something that aliens will find one day.

So, I am done. I know exactly what to do right? I just have to make code that does everything I put in the previous paragraphs. Man I wish it were that easy. When you want to write a distributed application to do it, the task is even more daunting. If I controlled the data and the servers and the code, I could do crazy things that would make great conference talks. But, it kind of falls apart when I want to give this code to a 60 year old retired guy that is starting a hobby site for watching humming birds on a crappy GoDaddy account. Yeah, he is not installing Sphinx or HandlerSocket or Gearman. Those are all things I would want to use to solve this problem in a scalable fashion. At that point you have two choices. Aim for the small time or the big time. If you aim for the small time, you may get lots of installs, but, you will be hamstrung. If you aim for the big time, you may be the only guy that ever uses the code. That is a tough decision.

What have I missed? I know I missed something. Are there other types of message boards? I can definitely see some sub-types. Perhaps a board where ideas instead of help messages are posted. Or maybe the conversations are more show off based as in a user posting pictures or videos for comment. Is there already something out there doing this and I have just missed it? Let me know what I have missed please.

How does a geek end up doing public speaking?

I just finished reading Confessions of a Public Speaker (O'Reilly) by Scott Berkun. It is the first book I have read cover to cover in 16 years. I even read the colophon (which I had never seen in a book before). I found the entire book to be very engaging. While reading it, I kept reviewing every mistake I had ever made in every talk I have given. Even if you are not a regular public speaker, there is great stuff in the book that can help you with talking to your coworkers, boss, etc. There is a chapter that is more about teaching than speaking that I found really good. I can really apply a lot of that knowledge to working with my team on new technologies.

Public speaking was never something I thought I would do when I was younger. I was not on the debate team in high school. I was never the leader of any clubs. The most I ever did was run for class president my senior year. I did not win. The eventual valedictorian who had been the president of our class every year of high school won. That was probably for the best. In fact, I was considered a "rebellious", quiet outsider I later discovered. Ha, a rebel? Me. That seems quite far fetched.

I first felt like I wanted to share my knowledge with my peers was when I went to Apachecon in 1999. We were a 4 or 5 person company at the time. But, our CTO had been to regular conferences in his past jobs and found value in the experience. So, he and I took off for Orlando, FL. At the time, Apachecon was really the only well rounded web conference that existed (I find it is not as much this way now). I was excited by the environment I found at the conference. The speakers were not talking heads for faceless corporate giants. They were guys that, like us, were trying to make the web work. Also like us, a lot of them did not have massive investment cash, but were operating on shoestring budgets. Performance and availability mattered. Coming up with new ideas really mattered. We found that we had ideas that others had concurrently. And in some cases, we had ideas that some had not thought about. I would wait another two years to have my turn on stage.

The web dev/ops team had grown to 5 people. I received an email for the CFP (Call for Proposals) for Apachecon 2001. I thought I had a good idea for a session. The reaction we received from people when we talked about our caching scheme was interesting. Most people had one of two reactions. They either thought the idea was very interesting or they thought we were just dumb. But, the key was that almost everyone had a reaction. So, I thought it would make a good talk. I made every rookie mistake there was. I made my slides the night before the talk. I did not go to the room I was speaking in until it was time to talk. You name it, I did it for the actual presentation. The talk felt like a train wreck. The only good thing about it was, at that time, Apachecon required the presenters to submit a full written version of the session for printing in a book that attendees received. So, we (I was doing the work along with a coworker) had become very familiar with the content. The response we received from the people that attended was incredible. They probably had no idea that the slides were written the night before. But, I have to say, from that point on, I was kind of hooked.

It would be a few years before I spoke on that type of stage again (MySQL Conference, 2008). I seem to have found a home at O'Reilly conferences. Like Apachecon, they embrace people that are doing the work in the field. Sure, you see some talks that feel like sales pitches, but the community quickly exposes those sessions. In the last 5 years, I have spoken at several O'Reilly conferences, another Apachecon and several regional conferences.

All of this public speaking helped me in a way I never expected. In late 2009, my grandfather passed away. My grandmother, his wife, passed away in 1995. I remember at then having the urge to get and and talk about her. She was a fascinating woman. And there were so many things in my head. But, I lacked the confidence to just get up and say them. So, when my grandfather passed away, I had a completely different mindset. I made it known that I would like to say some things about him. When trying to decide what to say, I found myself using techniques I use for preparing a presentation. I started with several points to make, worked on the right order for the points. Then I went back through and filled in the details of each point. Practicing that "talk" was the hardest practice I have ever had to do for public speaking. No practice or preparation has seemed hard sense. It was an odd clash of my worlds. But, I am so happy I did it. I still have the notes on my iPhone. When I miss him, I get them out and read them.

Lately, I have branched out from talking to geeks. O'Reilly's Ignite series of events leap out of the geek culture and into your community. We are lucky to have a very good group of people organizing Ignite Birmingham in my home town. And while my talks thus far with Ignite Birmingham have had a tech slant, they were not for geeks. I was speaking to a room full of smart, but not nesicarily technical people. It's fun to get outside of my comfort zone. Also, they video all the talks. So, I get to see myself and judge my performance. It's great. Speaking to a wider audience is a fun journey that I am finding very exciting and look forward to exploring more. My next challenge for myself is to find a completely non-technical topic for Ignite Birmingham.

Sharing gotchas on Facebook and Twitter

I have been working on adding some sharing features to dealnews.com. Dealing with Facebook and Twitter has been nothing if not frustrating. Neither one seems to understand how to properly deal with escaping a URL. At best they do it one way, but not all ways. At worst, they flat out don't do it right. I thought I would share what we found out so that someone else my be helped by our research.

Facebook

Facebook has two main ways to encourage sharing of your site on Facebook. The older way is to "Share" a page. The second, newer, cooler way to promote your page/site on Facebook is with Facebook's Like Button. Both have the same bug. I will focus on Share as it is easier to show examples of sharing. To do this, you make a link and send it to a special landing page on Facebook's site. But, lets say my URL has a comma in it. If it does, Facebook just blows up in horrible fashion. The users of Phorum have run into this problem too. In Phorum, we dealt with register_globals in a unique way long ago. We just don't use traditional query strings on our URLs. Instead of the traditional var1=1&var2=2 format, we decided to use a comma delimited query string. 1,2,3,var4=4 is a valid Phorum URL query string.

According to RFC 3986, a query string is made up of:
query = *( pchar / "/" / "?" )
where pchar is defined as:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
and finally, sub-delims is defined as:
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / 
"*" / "+" / "," / ";" / "="
That is RFC talk for "A query string can have an un-encoded comma in it as a delimiter." So, in Phorum we have URLs like http://www.phorum.org/phorum5/read.php?61,145041,145045. That is the post in Phorum talking about Facebook's problem. It is a valid URL. The commas do not need to be escaped. They are delimiters much like an & would be in a traditional URL. So, what happens when you share this URL on Facebook? Well, a share link would look like http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.phorum.org%2Fphorum5%2Fread.php%3F61%2C146887%2C146887. If I go to that share page and then look in my Apache logs I see this:
66.220.149.247 - - [18/Nov/2010:00:47:51 -0600] "GET /phorum5/read.php?61%2C146887%2C146887 HTTP/1.1" 302 26 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
Facebook sent %2C instead of comma? It decoded the other stuff in the URL. The slashes, the question mark, all of it. So, what is their deal with commas? Well, maybe I can hack Facebook and not send an encoded URL to the share page. Nope, same thing. So, they are proactively encoding commas in URL's query strings.

This has two effects. The first is that the share app attempts to pull in the title, description, etc. from the page. In this case, we redirect the request as the query string is invalid for a Phorum message page. So, they end up getting the main Phorum page. In the case of dealnews, we usually throw a 400 HTTP error when we get invalid query strings. Neither of these get the user what he wanted. The second problem is that the URL that is clickable when the user has shared the URL is not valid. So, the whole thing was just a huge waste of time.

I have submitted this to the Facebook Bugzilla. The only work around is to use a URL shortener or don't use commas in your URLs. Just make sure the shortener does not use commas. I guess you could use special URLs for Facebook that used something besides comma that are then redirected to the real URL with commas. I don't know what that character is, I am just guessing.

Twitter

Twitter's issues deal with their transition from their old interface to their new interface. Twitter is in the process of (or is done with) rolling a new UI on their site. The link in the old site to share something on Twitter was something like: http://twitter.com/home?status=[URL encoded text here]. This worked pretty darn well. You could put any valid URL encoded text in there and it worked. However, that now redirects you to their new interface's way of updating your status and they don't encode things right.

If I want to tweet "I love to eat pork & beans" I would make the URL http://twitter.com/home?status=I+love+to+eat+pork+%26+beans. Twitter then takes that, decodes the query string and redirects me to http://twitter.com/?status=I%20love%20to%20eat%20pork%20&%20beans. The problem is that they did not re-encode the &. It is in the bare URL. So, when I land on my twitter page, my status box just says "I love to eat pork ". Which while true, is not what I mean to tweet. This bug has been submitted to Twitter, but has yet to be fixed.

The second problem is with the new site and how they deal with validly encoded spaces. Spaces can be escaped two ways in a URL. The first, older way (which the PHP function urlencode uses) is to encode spaces as a plus (+) sign. This comes from the standard for how forms submit (or used to submit) data. It is understood by all browsers. The second way comes from the later RFC's written about URLs. They state that spaces in a URL should be escape like other characters by replacing a space with %20. The old Twitter UI would accept either one just fine. And, if you send that to the old status update URL it will redirect you (see above) with %20 in the URL instead of +. However, if you send + to the new Twitter UI, as above, you get "I+love+to+eat+pork+&+beans" in your status box. The only solution is to not send + has an encoding for space to Twitter. In PHP you can use the function rawurlencode to do this. It conforms to the RFC(s) on URL encoding. Doing so, with thew new linking pattern generates the URL http://twitter.com/?status=I%20love%20to%20eat%20pork%20%26%20beans which works great. This was also reported to Twitter as a bug by our team.

So, maybe that will help someone out that is having issues with sharing your site on the two largest social networks. Good luck with your social media development.

Monitoring PHP Errors

PHP errors are just part of the language. Some internal functions throw warnings or notices and seem unavoidable. A good case is parse_url. The point of parse_url is to take apart a URL and tell me the parts. Until recently, the only way to validate a URL was a regex. You can now use filter_var with the FILTER_VALIDATE_URL filter. But, in the past, I would use parse_url to validate the URL. It worked as the function returns false if the value is not a URL. But, if you give parse_url something that is not a URL, it throws a PHP Warning error message. The result is I would use the evil @ to suppress errors from parse_url. Long story short, you get errors on PHP systems. And you don't need to ignore them.

In the past we just logged them and then had the log emailed daily. On some unregular schedule we would look through them and fix stuff to be more solid. But, when you start having lots of traffic, one notice error on one line could cause 1,000 error messages in a short time. So, we had to find a better way to deal with them.

Step 1: Managing the aggregation

The first thing we did was modify our error handler to log both human readable logs to the defined php error log and to write a serialzed (json_encode actually) version of the error to a second file. This second file makes parsing of the error data super easy. Now we could aggregate the logs on a schedule, parse them and send reports as needed. We have fatal errors monitored more aggressively than non-fatal errors. We also get daily summaries. The one gotcha is that PHP fatal errors do not get sent to the error handler. So, we still have to parse the human readable file for fatal errors. A next step may be to have the logs aggregated to a central server via syslog or something. Not sure where I want to go with that yet.

Step 2: Monitoring

Even with the logging, I had no visibility into how often we were logging errors. I mean, I could grep logs, do some scripting to calculate dates, etc. Bleh, lots of work. We recently started using Circonus for all sorts of metric tracking. So, why not PHP errors? Circonus has a data format that allows us to send any random metrics we want to track to them. They will then graph them for us. So, in my error handler, I just increment a counter in memached every time I log an error. When Circonus polls me for data, I just grab that value out of memcached and give it to them. I can then have really cool graphs that show me my error rate. In addition I can set alerts up. If we start seeing more than 1 error per second, we get an email. If we start seeing more than 5 per second, we get paged. You get the idea.

The Twitterverse:

I polled my friends on Twitter and two different services were mentioned. I don't know that I would actually send my errors to a service, but it sounds like an interesting idea. One was HopToad. It appears to be specifically for application error handling. The other was loggly which appears to be a more generic log service. Both very interesing concepts. The control freak in me would have issues with it. But, that is a personal problem.

Most others either manually reviewed things or had some similar notification system in place. Anyone else have a brilliant solution to monitoring PHP errors? Or any application level errors for that matter?

P.S. For what it is worth, our PHP error rate is about .004 per second. This includes some notices. Way too high for me. I want to approach absolute zero. Lets see if all these new tools can help us get there.

Boycott NuSphere. They are spammy spammers

It took a lot for me to finally write this post. I tweeted about it a while back. It seems NuSphere needs more business. They have resorted to spamming people to promote their PhpED product. I have gotten emails to email addresses that:
  1. I know are not on any mailing list
  2. Are on web pages as plain mailto: anchor tags for good reasons.
In one case, it was the security@phorum.org address we have on the site to make it easy for people to report any security related issues. We get all kind of spam because of this, but it is worth it to have an easy access address for security issues.

In the other case, the email addresses used were on the dealnews.com jobs page. It was the addresses that are used to accept resumes.

Now, today, I started getting them to my personal inboxes.



Apparently, they sent it out the first time with a misspelling, so they had to send it out again!?!?

Now, the NuSphere people posted a tweet that claims the emails are not coming from them. That it's an independent marketer that "have permission to sell our products". First, the links in the email to go the NuSphere site, not a 3rd party. I would think an independent would want to take me to their site. Second, if that is the case, revoke the permission you have given them to sell your products. Don't use a marketing/sales agreement as a shield to allow people in the PHP Community to be spammed in your name.

PHP Community, please do no support NuSphere. They are spammers, directly or indirectly. If they do this to promote their product, how will they find a way to hose their user base later on?

PHP 5.3 and mysqlnd - Unexpected results

I have started seriously using PHP 5.3 recently due to it finally making it into Portage. (Gentoo really isn't full of bleeding edge packages people.) I have used mysqlnd a little here and there in the past, but until it was really coming to my servers I did not put too much time into it.

What is mysqlnd?

mysqlnd is short for MySQL Native Driver. In short, it is a driver for MySQL for PHP that uses internal functions of the PHP engine rather than using the externally linked libmysqlclient that has been used in the past. There are two reasons for this. The first reason is licensing. MySQL is a GPL project. The GPL and the PHP License don't play well together. The second is better memory management and hopefully more performance. Being a performance junky, this is what peaked my interests. Enabling mysqlnd means it is used by the older MySQL extension, the newer MySQLi extension and the MySQL PDO driver.

New Key Feature - fetch_all

One new feature of mysqlnd was the fetch_all method on MySQLi Result objects. At both dealnews.com and in Phorum I have written a function to simply run a query and fetch all the results into an array and return it. It is a common operation when writing API or ORM layers. mysqlnd introduces a native fetch_all method that does this all in the extension. No PHP code needed. PDO already offers a fetchAll method, but PDO comes with a little more overhead than the native extensions and I have been using mysql functions for 14 years. I am very happy using them.

Store Result vs. Use Result

I have spoken in the past (see my slides and interview: MySQL Tips and Tricks) about using mysql_unbuffered_query or using mysqli_query with the MYSQLI_USE_RESULT flag. Without going into a whole post about that topic, it basically allows you to stream the results from MySQL back into your PHP code rather than having them buffered in memory. In the case of libmysqlclient, they could be buffered twice. So, my natural thought was that using MYSQLI_USE_RESULT with fetch_all would yield the most awesome performance ever. The data would not be buffered and it would get put into a PHP array in C instead of native code. The code I had hoped to use would look like:
$res = $db->query($sql, MYSQLI_USE_RESULT);
$rows = $res->fetch_all(MYSQLI_ASSOC);
But, I quickly found out that this does not work. For some reason, this is not supported. fetch_all only works with the default which is MYSQLI_STORE_RESULT. I filed a bug which was marked bogus. Which I put back to new because I really don't see a reason this should not work other than a complete oversight by the mysqlnd developers. So, I started doing some tests in hopes I could show the developers how much faster using MYSQLI_USE_RESULT could be. What happened next was not expected. I ended up benchmarking several different options for fetching all the rows of a result into an array.

Test Data

I tested using PHP 5.3.3 and MySQL 5.1.44 using InnoDB tables. For test data I made a table that has one varchar(255) column. I filled that table with 30k rows of random lengths between 10 and 255 characters. I then selected all rows and fetched them using 4 different methods.
  1. mysqli_result::fetch_all*
  2. PDOStatement::fetchAll
  3. mysqli_query with MYSQLI_STORE_RESULT followed by a loop
  4. mysqli_query with MYSQLI_USE_RESULT followed by a loop
In addition, I ran this test with mysqlnd enabled and disabled. For mysqli_result::fetch_all, only mysqlnd was tested as it is only available with mysqlnd. I ran each test 6 times and threw out the worst and best result for each test. FWIW, the best and worst did not show any major deviation for any of the tests. For measuring memory usage, I read the VmRSS value from Linux's /proc data. memory_get_usage() does not show the hidden memory used by libmysqlclient and does not seem to show all the memory used by mysqlnd either.



So, that is what I found. The memory usage graphs are all what I thought they would be. PDO has more overhead by its nature. Storing the result always uses more memory than using it. mysqli_result::fetch_all uses less memory than the loop, but more than directly using the results.

There are some very surprising things in the timing graphs however. First, the tried and true method of using the result followed by a loop is clearly still the right choice in libmysqlclient. However, it is a horrible choice for mysqlnd. I don't really see why this is so. It is nearly twice as slow. There is something really, really wrong with MYSQLI_USE_RESULT in mysqlnd. There is no reason it should ever be slower than storing the result and then reading it again. This is also evidenced in the poor performance of PDO (since even PDO uses mysqlnd when enabled). PDO uses an unbuffered query for its fetchAll method and it too got slower. It is noticably slower than libmysqlclient. The good news I guess is that if you are using mysqlnd, the fetch_all method is the best option for getting all the data back.

Next Steps

My next steps from here will be to find some real workloads that I can test this on. Phorum has several places where I can apply real world pages loads to these different methods and see how they perform. Perhaps the test data is too small. Perhaps the number of columns would have a different effect. I am not sure.

If you are reading this and have worked on or looked at the mysqlnd code and can explain any of it, please feel free to comment.