HTML vs. XHTML and validation

Tue, Mar 10, 2009 06:08 PM
There is no shortage on the pages on the internet that talk about HTML vs. XHTML.  The vast majority of these (in the first few pages of Google) seem to favor XHTML.  I don't really have an agenda, so I thought I would post my thoughts on the topic.

I have stated on this blog that I use HTML 4.01 Transistional.  I do so because it is easiest for me.  Some people argue that XHMTL is easier because there are set rules and if you violate those rules, the documents will not render.  Is that a good thing?  Perhaps my time in the late 90's has made my mind work differently than newcomers to the World Wide Web.

The browser wars were ugly.  And I mean literally ugly.  If you wanted to do anything fancy, it required lots of images or compromise.  I learned early on that it was ok that the spacing in IE on my PC was larger than IE on the Mac.  The fonts were all different sizes from browser to browser and OS to OS.  I learned that graceful fallback was part of the web.  Even now, dealnews.com looks "adequate" in IE 6.  I could make it look perfect.  But, the declining traffic from IE6 does not merit my time to fix the errors in IE 6.

So, when I start thinking about HTML vs. XHTML, I want the more flexible of the two.  I find syntax like nowrap='nowrap' very annoying in XHTML.  Especially since I can't say nowrap='yeswrap' and it mean anything.  nowrap=1 I could handle.  But, no, it has to be nowrap='nowrap'.  Geez.

Ok, ok, this is turning into an XHTML hate post.  I don't want to do that.  There are some things about XHTML that I do like.  I like the self closing tags.  My OCD (which I have brought up before) has never liked having an open tag without a closing tag.  so, the <br /> format is appealing to me in that sense.  I love that XHTML elements should always be lower case.  I hate upper case HTML.  It just reads funny.  Like camel case function names.  Some folks on our content team used to use Adobe PageMaker to write up deals.  They would copy and paste the HTML from there into our CMS.  The output would be pretty ugly.

So, I like parts of both.  What is interesting to me is the fact that the "big sites" on the internet don't seem concerned with document types or validation.

Site DocType Validates
Google None No
Yahoo HTML 4.01 Strict No
Live.com (Microsoft) XHTML 1.0 Transitional No
MSN.com XHTML 1.0 Strict Yes
Facebook XHTML 1.0 Strict No
eBay HTML 4.01 Transitional No
YouTube HTML 4.01 Transitional No
Amazon.com None No
Wikipedia XHTML 1.0 Strict Yes
MySpace XHTML 1.0 Transitional No

So, of the 10 most popular sites on the internet (according to Compete.com), two don't include a document type in their front page at all.  Only two of the sites validate according to the W3C.  MSN and Wikipedia both validated on their front page with XHTML 1.0 Strict.  However, neither is sending a Content-Type of application/xhtml+xml.  According to this page, that is a bad thing.  And the search results page for XHTML on MSN.com did not validate.  Kudos to Wikipedia.  Their page on XHTML does validate.  Interestingly, they switch to XHTML 1.0 Transitional for that page.

So, is the internet broken?  No.  The most important validation is that of your users.  Can they use the site?  Does the site look right in their browser?  Most sites have much bigger navigation and content issues than they do document structure.

So, my idea of validation is this:   Does it render the same (or damn near) in the browsers that cover 90% of the internet users?  If so, then your page validates.  The only way to check that is (most likely without SkyNet) the human eye.
6 comments
Gravatar for Daniel Laughland

Daniel Laughland Says:

The nowrap command should be in your CSS anyways. ;-)

Gravatar for Chris Shiflett

Chris Shiflett Says:

One good reason to adhere to standards is that modern browsers will render your pages using standards mode, making things like XSS so much easier to protect against. Almost all of the weird browser bugs that make XSS such a difficult problem are only applicable when a page is rendered in quirks mode.

Also, this is our craft, so we should be good at it. Markup is pretty easy, so I've never understood why people feel the need to justify not learning it. It's probably easier to adhere to standards than it is to explain why you don't. :-)

Gravatar for Brian Moon

Brian Moon Says:

@Daniel: get cellspacing in CSS and you can pry table attributes from my hands.

@Chris: Yes, it is a craft and not a perfect science. Unfortunately, while XSS may be helped by standards mode, browser rendering is not. Markup is not easy. Personal web sites are easy. Brochure sites with 5 pages are easy. Combining the HTML hackerings of 30 laymen HTML writers (when their primary focus is getting content out and not your sacred document type) and the combination of 11 years of a code base (that you would love to redo if only time allowed) is not easy. I would love to start fresh every year on a new code base. Contractors and consultants have that luxury. Lead architects that have to answer to CEO and ultimately to share holders do not. I think you see that same sentiment reflected in the list of sites I mention above. The for profit sites are getting stuff done and not worrying with validation. The hobbyist site has perfect validation (but I know the shoestring they run those servers on).

FWIW, FF does use Standards Compliance Mode for dealnews.com. So, I am not saying one should not have goals. But, it is not as simple in the real world as wanting it. Damn, I wish it was.

Gravatar for Chris Shiflett

Chris Shiflett Says:

I don't disagree with what you're saying, except I do think markup is easy to get right.

I try to instill a sense of responsibility in the people I work with by setting a good example. Making sure a page validates is no more difficult than making sure code compiles. Markup isn't my primary skill, and even I can manage. It makes browsers render using standards mode, which is easier for both the security-conscious developer and the designer striving for consistency.

True expertise is required to maintain consistency among the various rendering engines while adhering to best practices and standards. (I would argue that it also takes true expertise to be a good PHP developer.) I admire such expertise and will never belittle it in the guise of pragmatism.

Fixing legacy markup is just as hard as fixing legacy code, so no arguments there. However, consider this one final thought:

Would you be as adamant against someone fixing notices in their PHP code as you are against someone fixing errors in their markup? Isn't it the least we can do, barring obstacles that require an undesirable concession?

Gravatar for Brian Moon

Brian Moon Says:

I am in no way against any of it. Perhaps I came off that way. I seem to have a knack for making people think I am against something. I need to work on that. As you can see, I just commited validation fixes for this very blogging application: http://code.google.com/p/wordcraft/source/detail?r=158.

As for PHP notice errors. uh, lets just don't go there. I keep swearing that by PHP 6 dealnews will be PHP Notice error free at dealnews. Again, years and years of coding and a culture that for a long time thought notices were dumb. You like that technical term? dumb?

My only point is that the quest for HTML/XHTML perfection can, like all passions, lead people away from the importance of getting things done. Sometimes getting things done is more important. I really meant the post as more of an observation than an opinion piece. But, my dumb ass opinion always pops its head in. I swore 2009 was going to be the kinder, gentler Brian. I am failing so far.

In a perfect world, all the code I touch would be PHP Notice free and validate perfectly. FWIW, my new CMS interface for the dealnews writers will use YUI's editor that outputs HTML 4.01 and I will run it through Tidy for good measure. So, that problem will finally go away. The notice errors from years of ignoring notices will take time however. I am slowly turning them on in Apache configs for new projects all throughout our codebase.

Gravatar for Ben Ramsey

Ben Ramsey Says:

The major religious wars I know of don't really center on HTML vs. XHTML but rather on the media types text/html vs. application/xhtml+xml. So, you'll see people refer to "real" XHTML as XHTML that is properly served as application/xhtml+xml. The main arguments against serving this media type are these: 1) Internet Explorer will choke on application/xhtml+xml and 2) if the browser supports this media type but the XML is not well-formed, then the browser will choke because this media type ensures strict handling of the mark-up.

Maintaining well-formed XHTML is no easy task... especially if you're running any site that allows user input, and this includes your own blog, forums, CMSes, etc.

It is possible, though, to get XHTML serving properly. Sam Ruby does it every day from his blog. However, it's very difficult to do it by concatenating strings, which is how most of us build HTML pages. The best way to generate well-formed XHTML is to serialize it, but there are no good XHTML serializers, and to serialize it means that it must either be written in HTML by the authors, parsed, and then re-serialized, or the authors must use some kind of other mark-up that is parsed and then serialized as XHTML.

In short, almost every site claiming to validate as XHTML is really serving their content as text/html because of these problems. Despite this, I agree with Chris that we should strive to create well-formed mark-up to spec, and that means writing well-formed XHTML.

The Habari Project has a good write up on the text/html vs. application/xhtml+xml issue:
http://wiki.habariproject.org/en/XHTML_vs_HTML

Comments are disabled for this post.