Google is breaking the Web

So, Google announced today that they would start doing a couple of new things. First, they are going to start sending all logged in users to their SSL enabled search page. Secondly, they claimed they are going to stop sending search terms to the sites in the referring URL. They claim all of this for your security. The first one I buy. SSL is better for those people on public wifi, no doubt. The second, I don't think so.

Let's back up a step. For those that don't know how the web works, this is a quick lesson. When you click on a link on a site, your browser connects to that new site to get the page. Part of that communication is to tell the site you are going to what site the user was on when they clicked on your URL. This is a good thing. There is never any information passed between Google and your site. Its all between your browser on your computer and the site you are asking your compter to load. It helps site owners know who is linking to them. In the case of search engines, the referring URLs often contain the search terms someone typed in to find their site. This is also helpful for lots of reasons. None of them involve a user's security.

Ok, so, Google claims they are going to remove your search terms. But, my tests show they are removing the whole referring URL. Yes, you will not know what users are coming from Google. Let me show you. This is what I did.
  1. I typed http://www.google.com/ into my browser
  2. I searched for dealnews
  3. I clicked on the first link, which is the dealnews.com front page.
Using a tool called HTTPFox I am able to see what information is being passed between my computer and the web sites. This is what I see:
  1. http://www.google.com/ with no referring URL
  2. http://www.google.com/#hl=en&sugexp=kjrmc&cp=5&gs_id=j&xhr=t&q=dealnews&qe=ZGVhbG4&qesig=YtB_HodN2qCOIiqwx_wetA&pkc=AFgZ2tlle01GJ99f38Ol-HvrY0sbiq4vzJfAPDSXGQ2js5QqyHGJ9-5HIgoFXbUujrU81pfyhEVO8jpmFouC09MG1fRbqd0GVA&pf=p&sclient=psy-ab&site=&source=hp&pbx=1&oq=dealn&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=7b65204da701ddb7&biw=1295&bih=1406 with no referring URL because Google use javascript to load the search results.
  3. http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CCwQFjAA&url=http%3A%2F%2Fdealnews.com%2F&rct=j&q=dealnews&ei=EPOdTtaUN4XOiAKZlIntCQ&usg=AFQjCNEN2YJ8XgSAJm6FOUqK2PuBUOkfxA&sig2=N2jBSsJb8sgPsrTkGgFCfw&cad=rja with a referrring URL of http://www.google.com/
  4. http://dealnews.com/ with a referring URL of http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CCwQFjAA&url=http%3A%2F%2Fdealnews.com%2F&rct=j&q=dealnews&ei=EPOdTtaUN4XOiAKZlIntCQ&usg=AFQjCNEN2YJ8XgSAJm6FOUqK2PuBUOkfxA&sig2=N2jBSsJb8sgPsrTkGgFCfw
As you can see, the request to http://dealnews.com/ was sent a URL by my browser telling that site that Google was linking to dealnews. In that URL you will see q=dealnews. That is the search term I typed into Google. Now, lets see what happens when I do the same thing on SSL.
  1. https://www.google.com/ with no referring URL
  2. Redirected to https://encrypted.google.com/ with no referring URL
  3. https://encrypted.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=f&xhr=t&q=dealnews&tok=wzChADhZTTjwPuXR1iOwSA&pf=p&sclient=psy-ab&site=&source=hp&pbx=1&oq=dealnews&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.,cf.osb&fp=47f2f62d0e6da959&biw=1295&bih=1406 with no referring URL because Google use javascript to load the search results.
  4. https://encrypted.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2Fdealnews.com%2F&rct=j&q=dealnews&ei=x_edTvjlGeKviQKzmdHqCQ&usg=AFQjCNEN2YJ8XgSAJm6FOUqK2PuBUOkfxA&sig2=OEhW8Z_BhHcCboIzu_Z2zQ with a referring URL of https://encrypted.google.com/
  5. http://dealnews.com/ with no referring URL.
So, if dealnews.com does not get a referring URL, how do we know this came from Google. This is the quote from the Google blog post:
When you search from https://www.google.com, websites you visit from our organic search listings will still know that you came from Google, but won't receive information about each individual query.
I ask you how the site will know that if there is no referring URL? Referring URLs are a fundamental part of the web. If Google wants to strip data off the URL, that is one thing. It is not great IMO, but whatever. But, not sending referrers at all is  just wrong and should be changed.

If you care, please share this post. Tweet it, +1 it, whatever. This is just bad news for the web.

Edit: I wanted to make sure everyone knew, I observed the same behavior in both Firefox 7 and latest Google Chrome

Edit 2: I have also confirmed with the Apache access logs that no referring URL was sent.

Sharing gotchas on Facebook and Twitter

I have been working on adding some sharing features to dealnews.com. Dealing with Facebook and Twitter has been nothing if not frustrating. Neither one seems to understand how to properly deal with escaping a URL. At best they do it one way, but not all ways. At worst, they flat out don't do it right. I thought I would share what we found out so that someone else my be helped by our research.

Facebook

Facebook has two main ways to encourage sharing of your site on Facebook. The older way is to "Share" a page. The second, newer, cooler way to promote your page/site on Facebook is with Facebook's Like Button. Both have the same bug. I will focus on Share as it is easier to show examples of sharing. To do this, you make a link and send it to a special landing page on Facebook's site. But, lets say my URL has a comma in it. If it does, Facebook just blows up in horrible fashion. The users of Phorum have run into this problem too. In Phorum, we dealt with register_globals in a unique way long ago. We just don't use traditional query strings on our URLs. Instead of the traditional var1=1&var2=2 format, we decided to use a comma delimited query string. 1,2,3,var4=4 is a valid Phorum URL query string.

According to RFC 3986, a query string is made up of:
query = *( pchar / "/" / "?" )
where pchar is defined as:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
and finally, sub-delims is defined as:
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / 
"*" / "+" / "," / ";" / "="
That is RFC talk for "A query string can have an un-encoded comma in it as a delimiter." So, in Phorum we have URLs like http://www.phorum.org/phorum5/read.php?61,145041,145045. That is the post in Phorum talking about Facebook's problem. It is a valid URL. The commas do not need to be escaped. They are delimiters much like an & would be in a traditional URL. So, what happens when you share this URL on Facebook? Well, a share link would look like http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.phorum.org%2Fphorum5%2Fread.php%3F61%2C146887%2C146887. If I go to that share page and then look in my Apache logs I see this:
66.220.149.247 - - [18/Nov/2010:00:47:51 -0600] "GET /phorum5/read.php?61%2C146887%2C146887 HTTP/1.1" 302 26 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
Facebook sent %2C instead of comma? It decoded the other stuff in the URL. The slashes, the question mark, all of it. So, what is their deal with commas? Well, maybe I can hack Facebook and not send an encoded URL to the share page. Nope, same thing. So, they are proactively encoding commas in URL's query strings.

This has two effects. The first is that the share app attempts to pull in the title, description, etc. from the page. In this case, we redirect the request as the query string is invalid for a Phorum message page. So, they end up getting the main Phorum page. In the case of dealnews, we usually throw a 400 HTTP error when we get invalid query strings. Neither of these get the user what he wanted. The second problem is that the URL that is clickable when the user has shared the URL is not valid. So, the whole thing was just a huge waste of time.

I have submitted this to the Facebook Bugzilla. The only work around is to use a URL shortener or don't use commas in your URLs. Just make sure the shortener does not use commas. I guess you could use special URLs for Facebook that used something besides comma that are then redirected to the real URL with commas. I don't know what that character is, I am just guessing.

Twitter

Twitter's issues deal with their transition from their old interface to their new interface. Twitter is in the process of (or is done with) rolling a new UI on their site. The link in the old site to share something on Twitter was something like: http://twitter.com/home?status=[URL encoded text here]. This worked pretty darn well. You could put any valid URL encoded text in there and it worked. However, that now redirects you to their new interface's way of updating your status and they don't encode things right.

If I want to tweet "I love to eat pork & beans" I would make the URL http://twitter.com/home?status=I+love+to+eat+pork+%26+beans. Twitter then takes that, decodes the query string and redirects me to http://twitter.com/?status=I%20love%20to%20eat%20pork%20&%20beans. The problem is that they did not re-encode the &. It is in the bare URL. So, when I land on my twitter page, my status box just says "I love to eat pork ". Which while true, is not what I mean to tweet. This bug has been submitted to Twitter, but has yet to be fixed.

The second problem is with the new site and how they deal with validly encoded spaces. Spaces can be escaped two ways in a URL. The first, older way (which the PHP function urlencode uses) is to encode spaces as a plus (+) sign. This comes from the standard for how forms submit (or used to submit) data. It is understood by all browsers. The second way comes from the later RFC's written about URLs. They state that spaces in a URL should be escape like other characters by replacing a space with %20. The old Twitter UI would accept either one just fine. And, if you send that to the old status update URL it will redirect you (see above) with %20 in the URL instead of +. However, if you send + to the new Twitter UI, as above, you get "I+love+to+eat+pork+&+beans" in your status box. The only solution is to not send + has an encoding for space to Twitter. In PHP you can use the function rawurlencode to do this. It conforms to the RFC(s) on URL encoding. Doing so, with thew new linking pattern generates the URL http://twitter.com/?status=I%20love%20to%20eat%20pork%20%26%20beans which works great. This was also reported to Twitter as a bug by our team.

So, maybe that will help someone out that is having issues with sharing your site on the two largest social networks. Good luck with your social media development.

ob_start and HTTP headers

I was helping someone in IRC deal with some "headers already sent" issues and told them to use ob_start. Very diligently, the person went looking for why that was the right answer. He did not find a good explination. I looked around and I did not either. So, here is why this happens and why ob_start can fix it.

How HTTP works

HTTP is the communication protocol that happens between your web server and the user's browser.  Without too much detail, this is broken into two pieces of data: headers and the body.  The body is the HTML you send. But, before the body is sent, the HTTP headers are sent. Here is an example of an HTTP request response including headers:
HTTP/1.1 200 OK
Date: Fri, 29 Jan 2010 15:30:34 GMT
Server: Apache
X-Powered-By: PHP/5.2.12-pl0-gentoo
Set-Cookie: WCSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxx; expires=Sun, 28-Feb-2010 15:30:34 GMT; path=/
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Ramblings of a web guy</title>
.
.
So, all those lines before the HTML starts have to come first. HTTP headers are where things like cookies and redirection occur. When a PHP script starts to send HTML out to the browser, the headers are stopped and the body begins. When your code tries to set a cookie after this has started, you get the "headers already sent" error message.

How ob_start works

So, how does ob_start help? The ob in ob_start stands for output buffering. ob_start will buffer the output (HTML) until the page is completely done. Once the page is completely done, the headers are sent and then the output is sent. This means any calls to setcookie or the header function will not cause an error and will be sent to the browser properly. You do need to call ob_start before any output occurs. If you start output, it is too late.

The down side

The down side of doing this is that the output is buffered and sent all at once. That means that the time between the user request and the time the first byte gets back to the user is longer than it has to be. However, in modern PHP application design, this is often already the case. An MVC framework for example would do all the data gathering before any presentation is done. So, your application may not have any issue with this.

Another down side is that you (or someone) could get lazy and start throwing setcookie calls in any old place. This should be avoided. It is simply not good programming design. In a perfect world, we would not need output buffering to solve this problem for us.

Enhance your 404 Pages with Google

I was just poking around the Google Webmaster Tools and came across a neat idea.  Google has a snippet of javascript you can put on your 404 error pages that will give the user more information based on their indexing of your site.  It can offer a user things like a closest match to the URL they were trying to find, an alternative URL, a link to the site's site map and last but not least, a Google search box that searches the site in question.  You have to have your site setup in Webmaster Tools to use it.

ForceType for nice URLs with PHP

This has been covered before, but I was just setting up a new force type on our servers and thought I would mention it for the fun of it. You see lots of stuff about using mod_rewrite to make friendly URLs or SEO friendly URLs. But, if you are using PHP (and I guess other Apache modules) you can do it without mod_rewrite.  We have been doing this for a while at dealnews.  Even before SEO was an issue.

Setting up Apache

From the docs, the ForceType directive "forces all matching files to be served as the content type given by media type." Here is an example configuration:

<Location /deals>
ForceType application/x-httpd-php
</Location>


Now any URL like http://dealnews.com/deals/Cubicle-Warfare/186443.html will attempt to run a file called deals that is in your document root.

Making the script

First save a file called deals witout the .php extension. Modern editors will look for the <?php tag at the first and will color it right. Normally you take input to your PHP scripts with the $_SERVER["QUERY_STRING"] or the $_GET variables. But, in this case, those are not filled by the URL above. They will still be filled if there is a query string, but the path part is not included.  We need to use $_SERVER["PATH_INFO"]. In the case above, $_SERVER["PATH_INFO"] will be filled with /Cubicle-Warfare/186443.html. So, you will have to parse the data yourself. In my case, all I need is the numeric ID toward the end.

$id = (int)basename($_SERVER["PATH_INFO"]);

Now I have an id that I can use to query a database or whatever to get my content.

Avoid "duplicate content"

The bad part of my use case is that any URL that starts with /deals/ and ends in 186443.html will work. So, now we have duplicate content on our site. You may have a more exact URL pattern and not have this issue.  But, to work around this in my case, we should verify that the $_SERVER["PATH_INFO"] is the proper data for the content requested. This code will vary depending on your URLs. In my code, I generate the URL for the content and see if it matches. Kind of a reverse lookup on the URI.  If it does not match, I issue a 301 redirect to the proper location.

header("HTTP/1.1 301 Moved Permanently");
header("Location: $new_url");
exit();


Returning 404

Now, you have to be careful to always return meaningful data when using this technique. Search engines won't like you if you return status 200 for every possible random URL that falls under /deals. I know that Yahoo! will put random things on your URLs to see if you are doing the right thing. So, if you get your id and decide this is not a valid URL, you can return a 404.  In my case, I have a 404 file in my document root.  So, I just send the proper headers and include my regular 404 page.

header('HTTP/1.1 404 Not Found');
header('Status: 404 Not Found');
include $_SERVER["DOCUMENT_ROOT"]."/404.html";
exit();