Best practices for escaping HTML

Fri, Mar 20, 2009 10:55 PM
I am working on Wordcraft, trying to get the last annoying HTML validation errors worked out.  Thinks like ampersands in URLs.  In doing so, I am asking myself where the escaping should take place. In the case of Wordcraft, there are several parts to it.
  1. The code that pulls data from the database.  Obviously not the right place.
  2. The code that formats data like dates and such.  It also organizes data from several data sources into one nice tidy array.  Hmm, maybe
  3. The parts of the code that set up the output data for the templates.
  4. The templates themselves.
Now, I am sure 1 is not the place.  And I really would not want 4 to be the place.  That would make for some ugly templating.  Plus, the templates, IMO, should assume the data is ready to be output.  So, that leaves the code that does the formatting and the code that does the data setup.

Of those two, I guess the place to do this job is in the data setup.  Wordcraft has a $WCDATA array that is available in the scope of the templates.  I suppose anything that goes into that array should be escaped as appropriate.

I largely wrote this blog post as a teddy bear exercise.  But, I am curious.  Where and when do you escape your data for use in HTML documents?
16 comments
Gravatar for Kyrre

Kyrre Says:

I would fix issues with strings before they go into the database in the first place. One needs to be sure the fix is the correct one though. Changing the function later will not help historical in any way.

Gravatar for Adam Jacob

Adam Jacob Says:

This really depends on what kind of escaping, and what other formats the data is going to be displayed in. Ideally, you will have a model that allows for and object to be rendered in multiple different formats, from the same raw data. That means you shouldn't escape the object going in to the database, or in the model (where you are pulling it out.) The templates themselves is tempting, but I think you are correct in saying its wrong (for example, what if you are going to display the data as a PDF? Where is the template now?)

That leaves two places you discussed, 'the code that does the formatting' and 'the code that sets up the template'. The code that does the formatting seems like a plausable case, except that we're back to the 'what am I rendering this as' problem, and ideally you wouldn't know anything about that at the layer where you are putting together the data structure.

So that leaves us with the location you came up with - 'the place where the data is set up for rendering'. In the nicest of frameworks I've seen, this is actually between the controller and the view. You create an object that responds to the content type you are about to render, knows how to deal with escaping the data on the way out. Just my $0.02 :)

Gravatar for Andrei

Andrei Says:

I like the django way of solving this issue. All variables that are used in the template have a default escape modifier. This is ok because most of the time you want to escape your output.

You can read more about autoescape at http://docs.djangoproject.com/en/dev/ref/templates/builtins/?from=olddocs#autoescape

Gravatar for Andrei

Andrei Says:

Not post related: I believe you have a bug in your url validation code.

Gravatar for Roland Bouman

Roland Bouman Says:

Hi!

"trying to get the last annoying HTML validation errors worked out. Thinks like ampersands in URLs. "

I am trying to figure out what is special about ampersands in URLs. w/re to escaping.

I mean this: if you have a proper URL the only ampersands in there appear in the "query" part as parameter delimiters. Any literal ampersands in the path, resource, query and fragment should already be "percent hex hex" escaped (else, it would not be a valid URL).

Now, In the (X)HTML output, the attribute value should simply conform to (X)HTML syntax, which means you are always safe when you turn ampersands into entity references ("ampersand amp semi-colon").

So, you shoudl take care to store only valid URL's, including proper "percent hex hex" escaping for literal ampersands, and preserve the meaningful ampersands that delimit parameters. This way you only have to worry about making whatever output you have conform to (X)HTML.

As for the choice between 2 and 3: it seems to me that things like "formatting dates" have to do with creating an acceptable human readable representation. It could for example change according to user preference, locale etc. The escaping of ampersands to conform to the syntax rules of output format is a different league. Suppose your output would not be (X)HTML, but pdf or text, then you still want to format dates and such in 2, but you definitely do not want to include the "tecnnical" formatting that happens to be required for the output format. So, I would certainly try to not mix these different kinds of "formatting"

(In other words escaping ampersands has nothing to do with formatting, it has to do with conforming to the pertinent data exchange format. If the data exchange format changes, the formatting should remain the same)

So i guess that if you want a choice between 2 and 3, and 2 is where you do your "formatting" it should be in 3.

kind regards,

I hope it helps.

Gravatar for David

David Says:

I'd got with 4. The template is really the only part that knows what format the output will be in (this could be applied globally (maybe this is the 3 you are referring to?) to all values passed to the template, or done on an individual basis. In reference to ampersands, they should be escaped as HTML entities. Yes, even the ampersands in the query string.
see: http://www.htmlhelp.com/tools/validator/problems.html#amp

Gravatar for gasper_k

gasper_k Says:

3 or 4. Escaping, just as date and number preparation, belongs in the View layer, because it completely depends on what kind of output are you doing. In your case, it's obviously only HTML, but in cases like this, you should alway think: What if I change the template to output i.e. CSV table? Would changing the template suffice? If you have escaping in the template, then yes, otherwise you'd have to change your other code, too, which is a code smell. And the same goes for date/number formatting, because they're output-dependant. If you're outputting dates/number in HTML for a user, you'd format them one way, but you'd do it differently for RSS/CSV, whatever that a computer has to understand. And you should be able to keep the same business logic in both cases.

In Symfony, you pass unescaped variables to the template, but the framework gives you escaped ones within the template. This way, it's hidden from the developer, and it works very fine.

Gravatar for PaulG

PaulG Says:

Andrei previously said
>I like the django way of solving this issue. All variables that are used in the template have a default escape modifier. This is ok because most of the time you want to escape your output.

There is a Django-templater to PHP por t called H2O

http://www.h2o-template.org

Although, IIRC, you have to use it like this: {{ myvar|e }}

Gravatar for Marc Gear

Marc Gear Says:

I simply always escape output as late as possible, immediately before it is going to be used - in your case I would do it in the template layer. If you want to absolutly guarantee that the code injection attack the user tried is never going to make it onto the page, no matter how that template is used in the future, you have to escape it as you echo/print it.

This prevents you from escaping your output the wrong way - or from assuming that it has already been escaped.

Gravatar for Les

Les Says:

> The template is really the only part that knows what format the output will be in...

Do not put this requirement in your presentation layer; the formatting of your data prior to presentation can easily be done on the immediate layer sitting above your model, just before the data goes to your presentation layer.

The individual who has suggested you use the presentation layer has absolutely no idea of what s/he is talking about - they have their feet well within the Smarty camp based upon that suggestion, as that is just how Smarty does it and it breaks once you come to changing to another method to present your data...

That is why Smarty is bad for PHP and bad for web development, but then again, no real world developer would ever think of using Smarty anyways - it's only 'pretend developers' who use Smarty.

Christ, that sucks :(

Gravatar for Richard Harrison

Richard Harrison Says:

I escape at #4, although it's a little bit like #3 (I escape in the template, but I don't escape at the point where the variable is output within the template).

My templates are PHP files and let's say the template has access to $vars (no other variables are in the local scope), which stores all the variables that it needs. At the top of the template file I handle all escaping and assign escaped variables to the local scope: ie $content = escape($vars['content']). Later on I'll "echo $content" or similar.

This approach helps me assert whether any variable within the scope of the local view (apart from $vars) has been escaped or not. If the template code is dealing with a "local" variable, then the assumption is that it's been escaped; if it's dealing with $vars['foo'] then it's raw, unescaped data.



Gravatar for Mark R

Mark R Says:

Definitely 4. The view side (template, whatever you call it) is the only thing which can know the context the strings are being output in, therefore the only thing which can escape it correctly.

If you escape it earlier, you won't get it right for every context, so the template developer will then have to un-escape it, and re-escape it for use in a different context from the one the controller developer imagined - or change the controller but then that would break other stuff.

HTML requires things to be escaped differently depending on the context.

Gravatar for kais

kais Says:

I read this headline differently, and I suppose my best practice suggestion would be "move up into management"...
Needless to say, I have nothing useful to contribute.
Interesting comments, though.

Gravatar for Roland Bouman

Roland Bouman Says:

Hi Brian!

just curious...did you decide on something in particular? I think the comments provide some interesting views.

rgds, Roland

Gravatar for Brian Moon

Brian Moon Says:

@Roland: Well, I decided on a mixture of things really. The way I build URLs, for example, in wordcraft kind of dictated that I do the escaping on demand in the function that build the URLs. So, the code that needs the URL tells wc_get_url() if it wants it escaped for use in HTML or not. I sometimes send the URLs out in emails. So, it would not be appropriate to HTML encode the URLs in that case.

For most everything else, I escape the data when it goes into the $WCDATA array which is the array the template system has available in local scope. The only non-HTML output other than email is the RSS feeds. Those scripts output valid XML that is escaped at output time within those scripts. They are not part of the template system since they never really change.

Add A Comment

Your Name:


Your Email:


Your URL:


Your Comment: