In the past we just logged them and then had the log emailed daily. On some unregular schedule we would look through them and fix stuff to be more solid. But, when you start having lots of traffic, one notice error on one line could cause 1,000 error messages in a short time. So, we had to find a better way to deal with them.
Step 1: Managing the aggregation
The first thing we did was modify our error handler to log both human readable logs to the defined php error log and to write a serialzed (json_encode actually) version of the error to a second file. This second file makes parsing of the error data super easy. Now we could aggregate the logs on a schedule, parse them and send reports as needed. We have fatal errors monitored more aggressively than non-fatal errors. We also get daily summaries. The one gotcha is that PHP fatal errors do not get sent to the error handler. So, we still have to parse the human readable file for fatal errors. A next step may be to have the logs aggregated to a central server via syslog or something. Not sure where I want to go with that yet.
Step 2: Monitoring
Even with the logging, I had no visibility into how often we were logging errors. I mean, I could grep logs, do some scripting to calculate dates, etc. Bleh, lots of work. We recently started using Circonus for all sorts of metric tracking. So, why not PHP errors? Circonus has a data format that allows us to send any random metrics we want to track to them. They will then graph them for us. So, in my error handler, I just increment a counter in memached every time I log an error. When Circonus polls me for data, I just grab that value out of memcached and give it to them. I can then have really cool graphs that show me my error rate. In addition I can set alerts up. If we start seeing more than 1 error per second, we get an email. If we start seeing more than 5 per second, we get paged. You get the idea.
I polled my friends on Twitter and two different services were mentioned. I don't know that I would actually send my errors to a service, but it sounds like an interesting idea. One was HopToad. It appears to be specifically for application error handling. The other was loggly which appears to be a more generic log service. Both very interesing concepts. The control freak in me would have issues with it. But, that is a personal problem.
Most others either manually reviewed things or had some similar notification system in place. Anyone else have a brilliant solution to monitoring PHP errors? Or any application level errors for that matter?
P.S. For what it is worth, our PHP error rate is about .004 per second. This includes some notices. Way too high for me. I want to approach absolute zero. Lets see if all these new tools can help us get there.