One common concern that new users have is data not being updated the second it goes in the database. This is how it worked before they were caching. They are used to it working that way. So, when they start caching, they miss that instant gratification. We went through this at dealnews. Our content team could write a deal, go to the front page and see it right then. They could then move on with their lives. As we grew and it became apparent that we could do that anymore, we had to make changes. Does the front page really have to be updated the second we write a deal? We discovered the answer was no.
We use three primary techniques to keep our cache as up to date as it needs to be.
With tools like memcached, you can provide a TTL (time to live) for each cached item. This ensures that a particular piece of cache will not longer be used after a given time. We cache data for our front page for 2 minutes. It does not sound like a lot I know. But, we have a 84% hit rate on that cache. So, the data is never really that old, but the cache does a wonderful job. For other content that hardly changes, we use a ttl of a day or even an hour. You have to decide per object for your application if this is the right thing to do for you. TTLs are best, IMO, because you know what to expect. A surge in traffic can not force the cache to expire more quickly.
If you can't use a TTL, removing, or better yet updating the cache via code is another option. If we have objects that need to be updated, we will usually update the cache rather than simply expiring it. For us, we usually have a function that will return an object from cache, but if its not there, it will make the queries and create the cache. The function will generally have a force option that will recreate the cache for the item even if the cache is found. We gave a talk at Apachecon and wrote a paper that covered this topic in 2001 (see Caching in the Real World on that page). The basics in that paper still hold true for caching today. WARNING! There are a couple actually. If your data is updated constantly and you are doing this on every single insert/update to your database, you are wasting your time. You have to use your cache wisely. Ask yourself, "Does this data have to be real time?" The second warning is that when you come under high load, one expiring item on your page can cause thousands of queries to be run. We experienced a little of this when Yahoo linked us.
We have been using this method for a while in our ad serving software. We are now using it more and more. IMO, its the most sure fire way to handle increased load. Basically, you don't have the pages of your web site make SQL requests to the live SQL data in the event no cache is found. That is what I call a pulled cache. Instead, you push the data from your primary database into some caching (or even another, optimized SQL server) for your web site to use. We are actually using MySQL Cluster for this purpose on our web site. The forward facing web site hits only the MySQL Cluster. If the data is not there, its just not there. We have processes on our backend that gather data from our primary database, assemble it for presentation and populate the cluster. The queries that the web site uses to access the cluster are highly optimized. You could do the same with memcached, but memcached is volatile. With cluster, we have high availability and get about the same performance as we did with a fully cached paged.
Of course, there are exceptions. Forums are a good example. For them to be useful, its kind of hard to cache a lot of stuff. With Phorum, we do cache things like user profiles, forum settings and other slow changing items. But, caching messages for any amount of time usually has a low ROI. They update so fast and if users don't see updates they lose interest.