Ramblings of a web guy (Tag: PHP)

Using socket_connect with a timeout

brianlmoon — Mon, 09 Mar 2015 18:04:15 -0500

TL;DR

I was having trouble with socket connections timing out reliably. Sometimes, my timeout would be reached. Other times, the connect would fail after three to six seconds. I finally figured out it had to do with trying to connect to a routable, non-localhost address. This function is what I finally ended up with that reliably connects to a working server, fails quickly for a server that has an address/port that is not reachable and will reach the timeout for routable addresses that are not up.

I have put a version of my final function into a Gist on Github. I hope someone finds it useful.

Full Story

So, it seems that when you try and connect to an IP that is routable on the network, but not answering, the TCP stack has some built in timeouts that are not obvious. This differs from trying to connect to an IP address that is up, but not listening on a given port. We took a Gearman server down for maintenance and I noticed our warning logs were showing a 3 to 7 second delay between the attempt to queue jobs and the warning log. The timeout we had set was only 100ms. So, this seemed odd.

After a lot of messing around, a coworker pointed out that in production, the failures were happening for an IP that was routable on the network, but that had no host listening on the IP. I had been using localhost and some foreign port for my "failed" server. After using an IP that was local to our LAN but had no host listening on the IP, I was able to recreate it on a dev server. I figured out that if you set the send and receive timeouts really low before calling connect, you can loop while calling connect. You check the error state and timeout. As long as the error is an acceptable one and the timeout is not reached, keep trying until it connects. It works like a charm.

I found several similar examples to this on the web. However, none of them mixed all these techniques.

You can simply set the send and receive timeouts to your actual timeout and it will return quicker. However, the timeouts apply to the packets. And there are retry rules in place. So, I found that a 100ms timeout for each send and receive would wind up taking 500ms or so to actually fail. This was not what I wanted. I wanted more control. So, I set a 100 microsecond timeout during connect. This makes socket_connect return quickly. As long as the socket error is 115 (in progress) or 114 (already trying), we keep calling it. Unless of course our timeout is reached. Then we fail.

It works really well. Should help for doing server maintenance on our Gearman servers.

Lock Wait Timeout Errors or Leave Your Data on the Server

brianlmoon — Wed, 27 Jun 2012 01:12:34 -0500

If you use MySQL with InnoDB (most everyone) then you will likely see this error at some point. There is some confusion sometimes about what this means. Let me try and explain it.

Let's say we have a connection called A to the database. Connection A tries to update a row. But, it receives a lock wait timeout error. That does not mean that connection A did anything wrong. It means that another connection, call it B, is also updating a row that connection A wants to update. But, connection B has an open transaction that has not been committed yet. So, MySQL won't let you update that row from connection A. Make sense?

The first mistake people may make is looking at the code that throws the error to find a solution. It is hardly ever the code that throws the error that is the problem. In our case, it was code that was doing a simple insert into a table. I had a look at our processing logs around the time that the errors were thrown and I found a job that was running during that time. I then looked for code in that job that updates the table that was locked. This was where the problem lied.

So, why does this happen? Well, there can be very legitimate reasons. There can also be very careless reasons. The genesis of this blog post was some code that appeared to be legitimate at first, but upon further inspection was careless. This is basically what the code did.

Start Transaction on database1
Clear out some old data from the table
Select a bunch of data from database2.table
Loop in PHP, updating each row in its own query to update one column
Select a bunch of data from database2.other_table
Loop in PHP, updating each row in its own query to update another column
Commit database1

This code ran in about 20 minutes on the data set we had. It kept a transaction open the whole time. It appeared legit at first because you can't join the data as there are sums and counts going on that have a one to many relationship which would cause some duplication of the sums and counts. It also looks legit because you are having to pull data from one database into another. However, there is a solution for this. We need to stop pulling all this data into PHP land and let it stay on the server where it lives. So, I changed it to this.

Create temp table on database2 to hold mydata
Select data from database2.table into my temp table
Select data from database2.other_table into my temp table
Move my temp table using extended inserts via PHP from database2 to database1
Start Transaction on database1
Clear out some old data from the table
Do a multi-table bulk update of my real table using the temp table
Commit database1

This runs in 3 minutes and only requires a 90 second transaction lock. Our lock wait timeout on this server is 50 seconds though. However, we have a 3 time retry rule for any lock wait timeout in our DB code. So, this should allow for our current workload to be processed without any data loss.

So, why did this help so much? We are not moving data from MySQL to PHP over and over. This applies to any language, not just PHP. The extended inserts for moving the temp table from one db to another really help. That is the fastest part of the whole thing. It moves about 2 million records from one to the other in about 1.5 seconds.

So, if you see a lock wait timeout, don't think you should sleep longer between retries. And don't dissect the code that is throwing the error. You have to dig in and find what else is running when it happens. Good luck.

Bonus: If you have memory issues in your application code, these techniques can help with those too.

Stop comparing stuff you don't understand

brianlmoon — Mon, 25 Jun 2012 23:09:28 -0500

I normally don't do this. When I see someone write a blog post I don't agree with, I often just dismiss it and go on. But, this particular one caught my attention. It was titled PHP vs Node.js: Yet Another Versus. The summary was:

Node.js = PHP + Apache + Memcached + Gearman - overhead

What the f**k? Are you kidding me? Clearly this person has NEVER used memcached or Gearman in a production environment that had any actual load.

Back in the day, when URLs and filesystems had a 1:1 mapping, it made perfect sense to have a web server separate from the language it is running. But, nowadays, any PHP app with attractive URLs running behind the Apache web server is going to need a .htaccess file, which tells the server a regular expression to check before serving up a file. Sound complex and awkward with unnecessary overhead? That’s because it is.

Node has a web server built in. Some people call this a bad thing, I call those people crazy. Having the server built in means that you don’t have the awkward .htaccess config thing going on. Every request is understood to go through the same process, without having to hunt through the filesystem and figure out which script to run.

He believes that PHP inside Apache requires a .htaccess file. Welcome to 1999. I have not used a .htaccess file since then. Anyone that cares at all about scaling Apache would disable .htaccess files. And as for running regexs, how does he propose you decide your code path in the controller of his Node.js code? Something somewhere has to decide what code is going to answer a given request. mod_rewrite is wire speed fast and compiled in C. Javascript nor PHP code could ever beat that.

The official website is quite ugly and outdated.

Really? You choose your tools based on that? I don't know what to say.

Since a PHP process starts, does some boilerplate work, performs the taks the user actually wants, and then dies, data is not persistent in memory. You can keep this data persistent using third party tools like Memcache or traditional database, but then there is the overhead of communicating with those external processes.

He clearly has no understanding of how memcached is supposed to be used. You don't put things in Memcached so you can use it on the next request on this server. You put things in memcached so you can use it in any request on any server in your server pool. If you just have one web server, you can write Perl CGI scripts. Performance and up time is not important to you. If you want to share things across requests in PHP, APC and XCache fill the need very well.

The number one bottleneck with web apps is not the time it takes to calculate CPU hungry operations, but rather network I/O. If you need to respond to a client request after making a database call and sending an email, you can perform the two actions and respond when both are complete.

This sums up the mythical magic of Node.js. People think just because your code is not "running" that somehow the server is not doing anything. The process does not get to go do other shit. No, that would be multi-threaded. And Node.js is not multi-threaded. It is single threaded. That means the process can only be doing one thing at a time. If you are waiting on a DB call, you are waiting. I don't care what world you think you live in. You are waiting on that DB call. How your code is structured is irrelevant to how computers actually work. The event driven nature that is Node.js is much like OOP. You are abstracting yourself from how computers really work. The further you get from that, the less you will be able to control the computer.

Node.js is a very new, unstable, untested platform. If you are going to be building a large corporate scale app with a long lifetime, Node.js is not a good solution. The Node API’s are changing almost daily, and large/longterm apps will need to be rewritten often.

So, if I plan on making a living, don't use Node.js. Got it. We finally agree on something.

Being so new, it doesn’t have a lot of baggage leftover from days of old. Having a server built in, the stack is a lot simpler, there are less points of failure, and there is more control over what you can do with HTTP responses (ever try overwriting the name of the web server using PHP?).

No, why would you? It is not the web server. Apache or nginx would be your web server.

Are you building some sort of daemon? Use Node.
Are you making a content website? Use PHP.
Do you want to share data between visitors? Use Node.
Are you a beginner looking to make a quick website? Use PHP.
Are you going to run a bunch of code in parallel? Use Node.
Are you writing software for clients to run on shared hosts? Use PHP.
Do you want push events from the server to the client using websockets? Use Node.
Does your team already know PHP? Use PHP.
Does your team already know frontend JavaScript? Node would be easier to learn.
Are you building a command line script? Both work.

Yes, of course you would not build a daemon in PHP. Do you plan to share that same data across servers? He already told us Node is not multi-threaded, so how can it run code in parallel? Websockets have a ton of their own pain to deal with that is not even related to Node vs. PHP. Things like proxies. If I was building a command line tool and wanted to use Javascript, I would just use V8.

Listen, I write code in PHP and JavaScript all day. I also use some Ruby, Lua and even dabble in C. I am not a language snob. Use what works for you. I do however take exception when people write about things they clearly have no idea about. He claims to have written a lot of PHP. He clearly has never deployed a lot of PHP in environments that matter. If you are building small sites that don't have a lot of traffic, you can use anything. If you are building massive sites that have to scale, any technology is going to require a full understanding of what it takes to scale it out. I leave you with this wisdom that I am reminded of by his blog post.

PHP Coding Standards

brianlmoon — Fri, 25 May 2012 19:27:56 -0500

Update: Matthew Weier O'Phinney, one of the core members of the group, has cleared up the naming history in the comments.

During the /dev/hell podcast at Tek12, someone asked the guys their opinion about PSR. I did not know what PSR was by that name. A quick search lead me to the Google Group named PHP Standards Working Group. I had vaguely remembered a consortium of frameworks, libraries and applications that were organizing to attempt to make their projects cooperate better. But, this did not sound like the same project. Another search and I found the PHP Framework Interoperability Group on Github. A bit more searching led me to a post where apparently the PHP FIG changed their name at some point citing people not knowing what FIG meant. But, this is not a history post. The group had done some work on setting a standard for auto loaders in PHP. This is a very good thing and much needed. That is a real thing that impacts real developers.

The person asking the question had asked about PSR1 and PSR2. These are the first two standards proposals in the group and they deal with coding standards. There were mixed feelings in the room about the proposals. I asked (being me, probably with very little tact) why in 2012 were a group of really smart people still discussing coding standards such as tabs vs. spaces. Because this is what immediately came to mind for me.

Source: http://xkcd.com/927/

There are already coding standards for PHP and any other language out there. Why does anyone need to make a new one? For Phorum we chose the PEAR standard (ok, with 2 minor modifications). On top of that, every one of the projects in this group already have coding standards. Why not just pick one of those? Are 10 projects that currently have their own standards going to actually all change to something else? I highly doubt it. My guess is that, at best, they will all end up with a modified version of the groups standards.

This reminds me a lot of Open Source licenses. There are tons of these things. And in the end, most (GPL has its issues I know) of the open source licenses represent the same idea. I suppose you could say that most of all of them fall into GPL like or BSD like. Anyhow, I quit worrying about having my own license years ago. I now just use a BSD style license that you can generate with several online BSD license generators.

When I voiced my concern about what is, in my opinion, a waste of very smart people's time, my good friend Cal Evans (He has bled in my car. So, I think he is my friend. And I hope he feels the same.) said that I was misunderstanding the point of the group. It was a group of projects that were collaborating to try and use similar standards and practices to make the PHP OSS community better. And that is exactly what I thought PHP FIG was. However, the group name is now "PHP Standards Working Group". That reminds me of the W3C HTML Working Group. And in my mind that means a group that is deciding the future of a technology. In addition the proposal being discussed is titled "PSR-1, a standard coding convention for PHP". If you pair that with the name of the group, it sounds very authoritative. And I don't think that is by accident. If I was heading up such an effort, I would hope that every PHP developer on the planet would follow it too. If you saw Terry Chay's keynote at the PHP Community Conference last year, he talked about frameworks and platforms. He pointed out that the reason people like Facebook were sharing their data center technology was in hopes that people would start using it and it would become common. Thus meaning the equipment they are custom building would be cheaper and people they hire would already be familiar with it. But, if the point of the group is *only* cooperation between lage OSS PHP projects, I wish they would pick a name that is more indicative of that. As it stands, when I landed on the page, my immediate assumption was that this groups intention was to dictate to the rest of the PHP world how to write their PHP code.

In the end, cooperation is good. And if these guys want to cooperate I say more power to them. I just hope they get into really good things soon. Like, can we talk about a maximum number of files, functions or classes used for any one single page execution? *That* would be valuable to the PHP community. I can deal with funny formatting. I can't deal with poorly performing code that his dragged down in abstracton and extension. Or how about things like *never* running queries inside loops that are reading results from another query. That would be a great thing to make examples of and show people the best practices. Tabs vs. spaces? That should have been solved 10 years ago. When in doubt, PHP code should do what the PHP core does. This is PHP we are talking about. Would it not make sense to have the people who write PHP writing code that is somewhat similar in style to those that make PHP? C and PHP syntax are very, very similar. So, why don't we all just refer to the PHP CODING_STANDARDS file when in doubt and not even worry with the little stuff that does not affect performance?

So, as of a few minutes ago, I have joined the group. If for no other reason, just to see what is discussed. Perhaps I should follow the advice I give people when they ask for features in my projects and do something about my issues and worries.

Living in the Prove It Culture

brianlmoon — Tue, 06 Mar 2012 22:45:18 -0600

Engineering cultures differ from shop to shop. I have been in the same culture for 13 years so I am not an expert on what all the different types are. Before that I was living in Dilbert world. The culture there was really weird. The ideas were never yours. It was always some need some way off person had. A DBA, a UI "expert" and some product manager would dictate what code you wrote. Creativity was stifled and met with resistance.

I then moved to the early (1998) days of the web. It was a start up environment. In the beginning there were just two of us writing code. So, we thought everything we did was awesome. Then we added some more guys. Lucky for us we mostly hired well. The good hires where type A personalities that had skills we didn't have. They challenged us and we challenged them. On top of that, we had a CEO who had been a computer hacker in his teens. So, he had just enough knowledge to challenge us as well. Over the years we kept hiring more and more people. We always asked in the interview if the person could take criticism and if they felt comfortable defending their ideas. We decided to always have a white board session. We would ask them questions and have them work it out on a white board or talk it out with us in a group setting. The point of this was not to see if they always knew the answer. The point was to see how they worked in that setting. Looking back, the hires that did not work out also did not excel in that phase of the interview. The ones that have worked out always questioned our methods in the interview. They did not belittle our methods or dismiss them. They just asked questions. They would ask if we had tried this or that. Even if we could quickly explain why our method was right for us, they still questioned it. They challenged us.

When dealing with people outside the engineering team, we subconsciously applied these same tactics. The philosophy came to be that if you came to us with an idea, you had to throw it up on the proverbial wall. We would then try to knock it down. If it stuck, it was probably a good idea. Some people could handle this and some could not. The ones that could not handle that did not always get their ideas pushed through. It may not mean they were bad ideas. And that is maybe the down side of this culture. But, it has worked pretty well for us.

We apply this to technology too. My first experience on Linux was with RedHat. The mail agent I cut my teeth on was qmail. I used djbdns. When Daniel Beckham, our now director of operations, came on, he had used sendmail and bind. He immediately challenged qmail. I went through some of the reasons I prefered it. He took more shots. In the end, he agreed that qmail was better than sendmail. However, his first DNS setup for us was bind. It took a few more years of bind hell for him to come around to djbdns.

When RedHat splintered into RedHat Enterprise and Fedora, we tried out Fedora on one server. We found it to be horribly unstable. It got the axe. We looked around for other distros. We found a not very well known distro that was known as the ricer distro of the Linux world called Gentoo. We installed it on one server to see what it was all about. I don't remember now whose idea it was. Probably not mine. We eventually found it to be the perfect distro for us. It let us compile our core tools like Apache, PHP and MySQL while at the same time using a package system. We never trusted RPMs for those things on RedHat. Sure, bringing a server online took longer but it was so worth it. Eventually we bought in and it is now the only distro in use here.

We have done this over and over and over. From the fact that we all use Macs now thanks to Daniel and his willingness to try it out at our CEO's prodding to things like memcached, Gearman, etc. We even keep evaluating the tools we already have. When we decided to write our own proxy we discounted everything we knew and evaluated all the options. In the end, Apache was known and good at handling web requests and PHP could do all we needed in a timely, sane manner. But, we looked at and tested everything we could think of. Apache/PHP had to prove itself again.

Now, you might think that a culture of skepticism like this would lead to new employees having a hard time getting any traction. Quite the opposite. Because we hire people that fit the culture, they can have a near immediate impact. We have a problem I want solved and a developer that has been here less than a year suggested that Hadoop may be a solution, but was not sure we would use it. I recently sent this in an email to the whole team in response to that.

The only thing that is *never* on the table is using a Windows server. If you can get me unique visitors for an arbitrary date range in milliseconds and it require Hadoop, go for it.

You see, we don't currently use Hadoop here. But, if that is what it takes to solve my problem and you can prove it and it will work, we will use it.

Recently we had a newish team member suggest we use a SAN for our development servers to use as a data store. Specifically he suggested we could use it to house our MySQL data for our development servers. We told him he was insane. SANs are magical boxes of pain. He kept pushing. He got Dell to come in and give us a test unit. Turns out it is amazing. We can have a hot copy of our production database on our dev slices in about 3 minutes. A full, complete copy of our production database in 3 minutes. Do you know how amazing that is? Had we not had the culture we do and had not hired the right person that was smart enough to pull it off and confident enough to fight for the solution, we would not have that. He has been here less than a year and has had a huge impact to our productivity. There is talk of using this in production too. I am still in the "prove it" mode on this. We will see.

I know you will ask how our dev db works, here you go:

1. Replicate production over VPN to home office

2. Write MySQL data on SAN

3. Stop replication, flush tables, snapshot FS

4. Copy snapshot to a new location

5. On second dev server, umount SAN, mount new snapshot

6. Restart MySQL all around

7. Talk in dev chat how bad ass that is

We had a similar thing happen with our phone system. We had hired a web developer that previously worked for a company that created custom Asterisk solutions. When our propietary PBX died, he stepped up and proved that Asterisk would work for us. Not a job for a web developer. But he was confident he could make it work. It now supports 3 offices and several home bound users world wide. He also had only been here a short time when that happened.

Perhaps it sounds like a contradiction. It may sound like we just hop on any bandwagon technology out there. But no. We still use MySQL. We are still on 5.0 in fact. It works. We are evaluating Percona 5.5 now. We tried MySQL 5.1. We found no advantage and the Gentoo package maintainer found it to be buggy. So, we did not switch. We still use Apache. It works. Damn well. We do use Apache with the worker MPM with PHP which is supposedly bad. But, it works great for us. But, we had to prove it would work. We ran a single node with worker for months before trusting it. Gearman was begrudgingly accepted. The idea of daemonized PHP code was not a comforting one. But once you write a worker and use it, you feel like a god. And then you see the power. Next thing you know, it is a core, mission critical part of your infrastructure. That is how it is with us now. In fact, Gearman has went from untrusted to the go to tech. When someone proposes a solution that does not involve Gearman, someone will ask if part of the problem can be solved using Gearman and not whatever idea they have. There is then a discussion about why it is or is not a good fit. Likewise, if you want to a build a daemon to listen on a port and answer requests, the question is "Why can't you just use Apache and a web service?" And it is a valid question. If you can solve your problem with a web service on already proven tech, why build something new?

This culture is not new. We are not unique. But, in a world of "brogramming" where "engineers" rage on code that is awesome before it is even working and people are said to be "killing it" all the time, I am glad I live in a world where I have to prove myself everyday. I am the most senior engineer on the team. And even still I get shot down. I often pitch an idea in dev chat and someone will shoot it down or point out an obvious flaw. Anyone, and I mean anyone, on the team can question my code, ideas or decisions and I will listen to them and consider their opinion. Heck, people outside the team can question me too. And regularly do. And that is fine. I don't mind the questions. I once wrote here that I like to be made to feel dumb. It is how I get smarter. I have met people that thought they were smarter than everyone else. They were annoying. I have interviewed them. It is hard to even get through those interviews.

Is it for everyone? Probably not. It works for us. And it has gotten us this far. You can't get comfortable though. If you do foster this type of culture, there is a risk of getting comfortable. If you start thinking you have solved all the hard problems, you will look up one day and realize that you are suffering. Keep pushing forward and questioning your past decisions. But before you adopt the latest and greatest new idea, prove that the decisions your team makes are the right ones at every step. Sometimes that will take a five minute discussion and sometimes it will take a month of testing. And other times, everyone in the room will look at something and think "Wow that is so obvious how did we not see it?" When it works, it is an awesome world to live in.

/dev/hell Podcast Episode #5

brianlmoon — Fri, 03 Feb 2012 12:12:10 -0600

I was privileged to be invited to be a part of the /dev/hell podcast this week. Thanks to Chris and Ed for having me on. Check it out. And subscribe to their podcast.

Errors when adding/subtracing dates using seconds

brianlmoon — Mon, 16 Jan 2012 15:55:49 -0600

This just came up today again for me. I have said it before, but even I get lazy and forget. When doing math with dates such as adding days it is really quick to think this works:




$date = "2011-11-01";



// add 15 days



$new_date = date("Y-m-d", strtotime($date) + (86400 * 15));



// $new_date should be 2011-11-16 right?



echo $new_date;



?>

This yields `2011-11-15`. The problem with this is that it assume that there are only 86400 seconds in every day. There are in fact not. On days when the clocks change for daylight savings time, there are either 1 hour more than that or 1 hour less than that. In addition, there are also leap seconds put into our time system to keep us in line with the sun. There is one this year, 2012, on June 30th in fact. Since they don't happen with the regularity that daylight savings time does, it may be easy to forget those. Luckily, for this problem, the solution is the same. You have two choices. And the solution you choose depends on the particular problem you have. For the simple problem above, you can simply let strtotime take care of it for you.

$new_date = date("Y-m-d", strtotime($date." +15 days"));

strtotime() is the most awesome date/time related function in all of computer programming. I have written about it before. It handles all those nasty weird seconds issues. But, if you are not solving a problem this simple or you are reading this and need help with another language, there is another solution. You do all your date math at noon. Simply only run code that does date math during lunch time. No, just joking. That would be silly. What I mean is to adjust all your date/time variables to represent the time at 12:00 hours.

$new_date = date("Y-m-d", strtotime($date." 12:00:00") + 86400 * 15);

Now that you are doing date math at noon, you will be safe for most any date range you are doing math on. Daylight savings time always gives and hour and takes an hour every year, so those cancel each other out. It would take a whole lot of leap seconds to cause the offset from the start date to the end date to shift enough to make this technique no longer work, but it is technically feasible. So, if you are adding hundreds of years to dates, this won't work for you. There, disclaimer added.

UPDATE: I should point out that the examples in this post only apply to US time zones that recognized Daylight Savings Time and the rules for DST as they apply after the changes in 2007.

Check for a TTY or interactive terminal in PHP

brianlmoon — Thu, 01 Sep 2011 23:42:54 -0500

Many UNIX tools do different things if they are connected to an interactive terminal, also called a TTY. This can be handy for lots of reasons. I had a use case today that prompted me to find out how to do it in PHP.

Here is the situation. We log errors to the PHP error log. We then have processes that monitor that error log and alert us about any uncaught exceptions or fatal errors very quickly so we can address issues. We also monitor non-fatal errors and alert on those on a less frequent schedule. However, this can be annoying if a user is running some code on a terminal that is generating errors. Let's say I am trying to find out why some file import did not happen. Running the job that is supposed to do it may yield an error. Maybe it was a file permission issue or something. There are other people watching the alerts. What they don't know is that I am running the code and looking at these errors in real time. So, they may start digging into the issue when I am the one causing it and can see it happening already. So, I thought it would be nice, if in my error handler, I could not send errors to the error log that are being sent to an interactive terminal. A few quick searches for "php check for tty" did not find anything. In the end, a coworker cracked open a book to see how it was done in C. That got me on the right path to finding two PHP functions: posix_isatty and posix_ttyname. These seem to do the trick. They take a file descriptor like STDOUT and will tell you if that is an interactive terminal and what the tty name is if it is one.

It took me a few tries to get the full effect I wanted in PHP. The thing I always forget is that fatal errors don't use my error handler. I understand why. The engine is in an unknown state when that happens, so it can't keep running code. In the end I added this code to my auto prepend file which is where my error handler, auto loader and other start up stuff is defined for all PHP code we run, both CLI and Apache. Since PHP will send errors to STDERR if it is defined, that is what we want to check for. If it is defined and it is a TTY, we just disable error logging. I should note that I already check the log_errors ini setting in my error handler before I even call the error_log function.

if(defined("STDERR") && posix_isatty(STDERR)){

    ini_set("log_errors", false);

    ini_set("error_log", null);

}

Hopefully anyone else searching for "php check for tty" or "php interactive terminal" will find this blog post and it will help them out.

Talking about Gearman at Etsy Labs

brianlmoon — Fri, 05 Aug 2011 08:11:16 -0500

I find myself flying to New York on Monday for some dealnews related business. Anytime I travel I try and find something fun to do at night. (Watching a movie by myself in Provo, Utah was kinda not that fun.) So, this week I asked on Twitter if anything was happening while I would be in town. Anything would do. A meetup of PHP/MySQL users or some design/css/js related stuff for example. Pretty much anything interesting. Well, later that day I received an IM from the brilliant John Allspaw, Senior VP of Technical Operations at Etsy. He wanted me to swing by the Etsy offices and say hi. Turns out it is only a block away from where I would be. Awesome! He also mentioned that he would like to have me come and speak at their offices some time. That would be neat too. I will have to plan better next time I am traveling up there.

Fast forward another day. I get an email from Kellan Elliott-McCrea, CTO of Etsy wanting to know if I would come to the Etsy offices and talk about Gearman. At first I thought "That is short notice, man. I don't know that I can pull that off." Then I remembered the last time I was asked to speak at an event on short notice based off a recommendation from John Allspaw.

It was in 2008 for some new conference called Velocity. That only turned out to be the best conference I have ever attended. I have been to Velocity every year since and this year took our whole team. In addition, I spoke again in 2009 at Velocity, wrote a chapter for John's book Web Operations that was released at Velocity in 2010 and was invited to take part in the Velocity Summit this year (2011) which helps kick off the planning for the actual conference. The moral of that story for me is: when John Allspaw wants you to take part in something, you do it.

In reality, it was not that tough a decision. Even without John's involvement, I love the chance to talk about geeky stuff. The Etsy and dealnews engineering teams are like two twins separated at birth. Every time we compare notes, we are doing the same stuff. For example, we have been trading Open Source code lately. They are using my GearmanManager and we just started using their statistics collection daemon, statsd. So, speaking to their people about what we do seem like a great opportunity to share and get input.

The event is open to the public. So, if you use Gearman, want to use Gearman, or just want to hear how we use Gearman at dealnews, come here me ramble on about how awesome it is Tuesday night in Dumbo at Etsy Labs. You can RSVP on the event page.

Best Practices for Gearman by Brian Moon
Etsy Labs
55 Washington St. Ste 712
NY 11222

Tuesday, August 09, 2011 from 7:00 PM - 10:00 PM (ET)

PHP Frameworks

brianlmoon — Mon, 25 Apr 2011 10:00:00 -0500

Last week I spoke at and attended the first ever PHP Community Conference. It was very good. It was also very different from my normal conference. I usually go for very technical stuff. I don't often stop and smell the roses of the community or history of my chosen (by me or it I am not sure sometimes) profession. There was a lot of community at this one.

One thing that seemed to be a hot topic at the conference was frameworks. CakeDC, the money behind CakePHP was the platinum sponsor. I chatted with Larry Masters the owner of CakeDC for a bit while walking one night. Great guy. Joël Perras gave a tutorial about Lithum. I attended most of this one. He did very well. Joël was frank and honest about the benefits and problems with using frameworks, including having to deal with more than one at a time. There was also a tutorial about Zend Framework patterns by Matthew Weier O'Phinney. I missed this one. On the second day, things were different. Rasmus Lerdorf warned about the bloat in most of the PHP frameworks and expressed hope that they would do better in the newer versions. I received several questions about frameworks in my session. I also spoke out about them a bit. Terry Chay wrapped up the day with his closing keynote and touched on them again. More on that later. I want to kind of summarize what I said (or meant to say).

PHP is a framework

In my session, I talked about the history of Phorum. One of the things I covered was the early days of PHP. Back in the 90s, before PHP, most dynamic web work was done in C or Perl. At that time, in those worlds, you had to do all the HTTP work yourself. If you wanted a content type of text/html, you had to set it, in code, on every single response. Errors in CGI scripts would often result in Apache internal error pages and made debugging very hard. All HTML work had to be done by writing to output. There was no embedding code with HTML even as a templating language. PHP changed all that. You had a default content type of text/html. You had automatic handling of request variables. Cookies were easily ingested and output. You could template your HTML with script instead of having to write everything out via print or stdout. It was amazing. Who could ask for more?

Frameworks as a tool

Well, apparently a world that I honestly don't work in could ask for more. There are three major segments of web developers these days. There are developers that work for a company that has a web site, but its business is not the web site. Maybe it is a network hardware company or some other industry where their business merits having a staff to run their site, but it is not their core business. Then there are developers like myself that work for a company where the web site is the business. Everything about the business goes through the web. We have the public web site and the internal web site our team uses. It is everything. The last type are those developers that are constantly building new sites or updating existing sites for clients. I will be honest, this is not a segment I have considered much in the past when writing code or blog posts. But, I met more of those people at this conference than any of the other two types. They seem to be the ones that are motivated and interested. Or at least, because PHP and the web are their business, they sent their people to the conference.

You see, I have spoken out about frameworks. Not very publicly, but those that know me have heard me comment about them. I have never really seen the point. Why start with something generic that will most likely not fit your ultimate need when you need to scale or expand beyond its abilities? Well, for thousands of web sites, that are likely being built by agencies, that time never occurs. Most likely, before that happens, the site will be redesigned and completely replaced. So, if you spend every day building a new site, why do all that groundwork every time?

In addition, why have to deal with every different client's needs? I often say that Apache is my controller. I don't like to use PHP as my controller. But, if I was deploying a site every week to a different stack, I can't rely on Apache with mod_rewrite or whatever things I rely on in my job today. So, you need to have full control in the application. What database will the client this week use? I don't care, the framework abstracts that for me. These are all very good reasons to use a framework.

Framework Trade-Off

There are some trade-offs though. The biggest one I see is the myriad of choices. Several of the pro-framework people even mentioned that there are a lot out there. And it seems that someone is making a new one everyday. With all these choices, it is likely that some of the benefit you get from a framework could be lost. If a client already has a site based on CakePHP and your agency uses Lithium what do you do? Say no to the work or have to deal with the differences? Some of them are big enough to be a real issue. Some are so small, you may not notice them until it's too late. That is a tough place to be.

The other issue is performance. Frameworks are notoriously inefficient. It has just been their nature. The more you abstract away from the core, the less efficient you are. This is even true with PHP. Terry Chay pointed out that PHP is less efficient than Java or C in his keynote. But, you gain power with PHP in way of quicker development cycles. Frameworks have that same benefit. But, have not solved this issue any better than PHP has over C. They abstract away the low level (for PHP at least) stuff that is going on. And that means loss of efficiency. This can be solved or at least worked on, however, and I hope it is.

Frameworks as a Commodity

So, this gets me back to something Terry Chay said. He talked about the motivation of companies to open source their technology. He used Facebook's Open Compute Project as an example. He pointed out that a major reason Facebook would open up this information would be in hopes that others would do the same in their data centers. If that happened, it would be easier for Facebook to move to a new data center because it was already mostly setup the way they like it.

Transitioning this same thought frameworks, the commoditization here, that I see, is in the interest of developers. If the framework you support becomes the de facto standard, then all those developers working in agencies using it are now ready to come to work for you. Plus, if you are the company behind it, there are opportunities for books, conferences, training, support, and all the other peripherals that come from the commercial/open source interaction. Need proof of that? Look no further than the "PHP Company", Zend. They could have committed developers to PEAR, but instead created Zend Framework. I see job listings very often for Zend Framework experience. Originally Zend tried to monetize the server with their optimizers and Zend Server. They had moderate success. The community came up with APC and XCache that sort of stole their thunder. I feel they have had much better success with Zend Framework in terms of market penetration. The money is with the people that write the code, not run the servers.

Frameworks are EVERYWHERE

I will close with something else that Terry Chay said. This was kind of an aha! moment for me. Terry pointed out that frameworks are everywhere. Wordpress, Drupal and even my project, Phorum, are frameworks. You can build a successful site using just those applications. It is not just the new breed code libraries that can be viewed as frameworks. In fact, Phorum's very own Maurice Makaay is building his new web site using only Phorum 5.3 (current development branch). Phorum offers easier database interaction, output handling, templating, a pluggable module system and even authentication and group based permissions. Wow, I have always kind of shunned this idea. In fact, when Maurice first showed me his site, I kind of grimaced. Why would you want to do that? You know why? Because the main thing that drives his site is Phorum. His users come to the site for Phorum. So, why would he want to install Phorum, invest in making it all it can be and then have to start from scratch for all the other parts of the site that are not part of the message board. Duh, I kind of feel stupid for never looking at things from this perspective before. Feeling dumb is ok. I get smarter when I feel dumb. New ideas make me a better developer. And I hope that is what comes out of this experience for me. You never know, I may throw my name in this hat and see how Phorum's groundwork could be useful outside of the application itself.

What is next for message board software?

brianlmoon — Thu, 24 Feb 2011 07:00:00 -0600

When I was hired at dealnews.com in 1998, my primary focus was to get our message board (Phorum) up to speed. I had written the first version as a side project for the site. Message boards were a lot simpler back then. Matt's WWWBoard was the gold standard of the time. And really, the functionality has been only evolutionary since. We added attachments to Phorum in 2003 or something. That was a major new feature. In Phorum 5 we added a module system that was awesome. But, that was just about the admin and not the user. From the user's perspective, message boards have not changed much since 1997. I saw this tweet from Amy Hoy and it got me to thinking about how message boards work. Here is the typical user experience:

Go to the message board
See a list of categories
Drill down to the category they want to read
Scroll through a list of messages that are in reverse cronological order by original post date or most recent post date
Click a message and read it.
Go to #3, repeat

Every message board software package pretty much works like that and has for over 10 years. And it kind of sucks. What a user would probably rather experience is:

Go to the message board
The most interesting things (to this user) are listed right there on the page. No drill down needed.
Click one and read it.
Goto #2, repeat.

Sounds easy? That #2 is easy to type but very hard to accomplish. I think it is conceivably doable if you are running a site that has all the data. Stackoverflow comes close. When you land on the site, they default the page to the "interesting" posts. However, they are not always interesting to me. They are making general assumptions about their audience. For example, right now, the first one is tagged "delphi". I could care less about that language and any posts about it. Its a good try, but misses by oh so far. This is not a Stackoverflow hate post. They are doing a good job. So, what do I do when I land there? I ignore the front page and click Tags (#2 in the first list), then pick a tag I want to read about (#3 in the first list). Low and behold the page I get is "newest". So, I end up doing exactly what is in the first list I mentioned. They do offer other sort options. But, they chose newest as the default. And from years of watching user behavior, 80% - 90% of people go with the good ol' default. This kind of brings me to another point though about the types of message boards there are.

Stackoverflow is a classic example of a help message board. People come there and ask a question. Other people come along and answer the question. Then more people come along and vote on whether the answers (and questions) are any good. This is one really nice feature that I think will have to become a core feature in any message board of the future. The signal to noise ratio can get so out of whack, you need human input to help decide what is good and what is noise. I think the core of the application has to rely on that if we are ever going to achieve the desired experience.

The second type of message board is a conversational system. It is almost like a delayed chat room. People come to a message board and post about their cat or asking who watched a TV show, that kind of thing. This has a completely different dynamic to it than the help message board. You can't really vote if a post is good or bad. The obvious exception being spam would of course want to be recognized and dealt with.

So, how do you know what content is desirable for the user that is entering the site right now? This concept has already been laid out for us: the social graph. You have to give users a way to associate with other users. If Bob really likes Tom's posts, he is probably more interested to read Tom's post from 30 minutes ago than some new guy that just joined the site and posted 1 minute ago. The challenge here is getting people to interconnect...but not too much. Everyone has that aunt on Facebook that follows you, your roommate and anybody else she can. She would follow your dog if he had a Facebook account. So, those people would still get a crappy experience if the whole system relied on the social graph. The other side is the people that will never "follow", "like" or whatever you call it another person. Their experience would lack as well. One key ingredient here is that you need to own this data. You can't just throw like buttons and Facebook connect on your message board and think you can leverage that data. That data is for Facebook, not you. I think the help message boards could benefit from the social graph as well.

Another aspect of what is most important to a user is discussions they are involved in. That could mean ones they started, ones they have replied to or simply ones they have read. Which of those aspects are more important than the others? Clearly if you started a discussion and someone has replied, that is going to interest you. If you posted a reply, you may be done with the topic or you may be waiting on a response. It would take some serious natural language algorithms to decide which is the case. For things you have read, I think you have to consider how many times the user has read the discussion. If every time it is updated they read it, they probably will want to read it again the next time it is updated. If they have only read it once, maybe they are not as interested.

The last aspect of message boards is grouping things. This is the part I actually struggle with the most. The easy first answer is tagging. Don't force the user down a single trail, let them tag posts instead of only posting them in one neat contained area. That gets you half way there. Let's use Stackoverflow (I really do like the site) as an example again. The first thing I do is go to Tags and click on PHP. I like helping people with PHP problems. So, is that really any different from categorization? Sure, there could be someone out there that really likes helping with Javascript. And if the same post was tagged with both tags then their coverage of potential help is larger. But, some of the time those tags are wrong when they tag it with more than one tag. The problem they need help with is either PHP or Javascript, most likely not both. They just don't know what they are doing. For example, there is this post on Stackoverflow. The user tagged it PHP and database-design. There is no PHP in the question. I am guessing he is using PHP for the app. But, it really never comes up and he is only talking about database design. So, who did the PHP tag help there? I don't think it helped him. And it only wasted my time. Having written all that, a free-for-all approach where there is no filtering sucks too. ARGH! It just all sucks. That brings us back to what Amy said in a way. Perhaps moderated tagging is an answer. I have not seen a way on Stackoverflow to untag a post. That would let people correct others. I am gonna write that down. If you work at Stackoverflow and are reading this, you can use that idea. Just put a comment in the code about how brilliant I am or something that aliens will find one day.

So, I am done. I know exactly what to do right? I just have to make code that does everything I put in the previous paragraphs. Man I wish it were that easy. When you want to write a distributed application to do it, the task is even more daunting. If I controlled the data and the servers and the code, I could do crazy things that would make great conference talks. But, it kind of falls apart when I want to give this code to a 60 year old retired guy that is starting a hobby site for watching humming birds on a crappy GoDaddy account. Yeah, he is not installing Sphinx or HandlerSocket or Gearman. Those are all things I would want to use to solve this problem in a scalable fashion. At that point you have two choices. Aim for the small time or the big time. If you aim for the small time, you may get lots of installs, but, you will be hamstrung. If you aim for the big time, you may be the only guy that ever uses the code. That is a tough decision.

What have I missed? I know I missed something. Are there other types of message boards? I can definitely see some sub-types. Perhaps a board where ideas instead of help messages are posted. Or maybe the conversations are more show off based as in a user posting pictures or videos for comment. Is there already something out there doing this and I have just missed it? Let me know what I have missed please.

Sharing gotchas on Facebook and Twitter

brianlmoon — Thu, 18 Nov 2010 07:00:00 -0600

I have been working on adding some sharing features to dealnews.com. Dealing with Facebook and Twitter has been nothing if not frustrating. Neither one seems to understand how to properly deal with escaping a URL. At best they do it one way, but not all ways. At worst, they flat out don't do it right. I thought I would share what we found out so that someone else my be helped by our research.

Facebook

Facebook has two main ways to encourage sharing of your site on Facebook. The older way is to "Share" a page. The second, newer, cooler way to promote your page/site on Facebook is with Facebook's Like Button. Both have the same bug. I will focus on Share as it is easier to show examples of sharing. To do this, you make a link and send it to a special landing page on Facebook's site. But, lets say my URL has a comma in it. If it does, Facebook just blows up in horrible fashion. The users of Phorum have run into this problem too. In Phorum, we dealt with register_globals in a unique way long ago. We just don't use traditional query strings on our URLs. Instead of the traditional var1=1&var2=2 format, we decided to use a comma delimited query string. 1,2,3,var4=4 is a valid Phorum URL query string.

According to RFC 3986, a query string is made up of:

query = *( pchar / "/" / "?" )

where pchar is defined as:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

and finally, sub-delims is defined as:

sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / 
 "*" / "+" / "," / ";" / "="

That is RFC talk for "A query string can have an un-encoded comma in it as a delimiter." So, in Phorum we have URLs like http://www.phorum.org/phorum5/read.php?61,145041,145045. That is the post in Phorum talking about Facebook's problem. It is a valid URL. The commas do not need to be escaped. They are delimiters much like an & would be in a traditional URL. So, what happens when you share this URL on Facebook? Well, a share link would look like http://www.facebook.com/share.php?u=http%3A%2F%2Fwww.phorum.org%2Fphorum5%2Fread.php%3F61%2C146887%2C146887. If I go to that share page and then look in my Apache logs I see this:

66.220.149.247 - - [18/Nov/2010:00:47:51 -0600] "GET /phorum5/read.php?61%2C146887%2C146887 HTTP/1.1" 302 26 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

Facebook sent %2C instead of comma? It decoded the other stuff in the URL. The slashes, the question mark, all of it. So, what is their deal with commas? Well, maybe I can hack Facebook and not send an encoded URL to the share page. Nope, same thing. So, they are proactively encoding commas in URL's query strings.

This has two effects. The first is that the share app attempts to pull in the title, description, etc. from the page. In this case, we redirect the request as the query string is invalid for a Phorum message page. So, they end up getting the main Phorum page. In the case of dealnews, we usually throw a 400 HTTP error when we get invalid query strings. Neither of these get the user what he wanted. The second problem is that the URL that is clickable when the user has shared the URL is not valid. So, the whole thing was just a huge waste of time.

I have submitted this to the Facebook Bugzilla. The only work around is to use a URL shortener or don't use commas in your URLs. Just make sure the shortener does not use commas. I guess you could use special URLs for Facebook that used something besides comma that are then redirected to the real URL with commas. I don't know what that character is, I am just guessing.

Twitter

Twitter's issues deal with their transition from their old interface to their new interface. Twitter is in the process of (or is done with) rolling a new UI on their site. The link in the old site to share something on Twitter was something like: http://twitter.com/home?status=[URL encoded text here]. This worked pretty darn well. You could put any valid URL encoded text in there and it worked. However, that now redirects you to their new interface's way of updating your status and they don't encode things right.

If I want to tweet "I love to eat pork & beans" I would make the URL http://twitter.com/home?status=I+love+to+eat+pork+%26+beans. Twitter then takes that, decodes the query string and redirects me to http://twitter.com/?status=I%20love%20to%20eat%20pork%20&%20beans. The problem is that they did not re-encode the &. It is in the bare URL. So, when I land on my twitter page, my status box just says "I love to eat pork ". Which while true, is not what I mean to tweet. This bug has been submitted to Twitter, but has yet to be fixed.

The second problem is with the new site and how they deal with validly encoded spaces. Spaces can be escaped two ways in a URL. The first, older way (which the PHP function urlencode uses) is to encode spaces as a plus (+) sign. This comes from the standard for how forms submit (or used to submit) data. It is understood by all browsers. The second way comes from the later RFC's written about URLs. They state that spaces in a URL should be escape like other characters by replacing a space with %20. The old Twitter UI would accept either one just fine. And, if you send that to the old status update URL it will redirect you (see above) with %20 in the URL instead of +. However, if you send + to the new Twitter UI, as above, you get "I+love+to+eat+pork+&+beans" in your status box. The only solution is to not send + has an encoding for space to Twitter. In PHP you can use the function rawurlencode to do this. It conforms to the RFC(s) on URL encoding. Doing so, with thew new linking pattern generates the URL http://twitter.com/?status=I%20love%20to%20eat%20pork%20%26%20beans which works great. This was also reported to Twitter as a bug by our team.

So, maybe that will help someone out that is having issues with sharing your site on the two largest social networks. Good luck with your social media development.

Monitoring PHP Errors

brianlmoon — Mon, 08 Nov 2010 08:00:00 -0600

PHP errors are just part of the language. Some internal functions throw warnings or notices and seem unavoidable. A good case is parse_url. The point of parse_url is to take apart a URL and tell me the parts. Until recently, the only way to validate a URL was a regex. You can now use filter_var with the FILTER_VALIDATE_URL filter. But, in the past, I would use parse_url to validate the URL. It worked as the function returns false if the value is not a URL. But, if you give parse_url something that is not a URL, it throws a PHP Warning error message. The result is I would use the evil @ to suppress errors from parse_url. Long story short, you get errors on PHP systems. And you don't need to ignore them.

In the past we just logged them and then had the log emailed daily. On some unregular schedule we would look through them and fix stuff to be more solid. But, when you start having lots of traffic, one notice error on one line could cause 1,000 error messages in a short time. So, we had to find a better way to deal with them.

Step 1: Managing the aggregation

The first thing we did was modify our error handler to log both human readable logs to the defined php error log and to write a serialzed (json_encode actually) version of the error to a second file. This second file makes parsing of the error data super easy. Now we could aggregate the logs on a schedule, parse them and send reports as needed. We have fatal errors monitored more aggressively than non-fatal errors. We also get daily summaries. The one gotcha is that PHP fatal errors do not get sent to the error handler. So, we still have to parse the human readable file for fatal errors. A next step may be to have the logs aggregated to a central server via syslog or something. Not sure where I want to go with that yet.

Step 2: Monitoring

Even with the logging, I had no visibility into how often we were logging errors. I mean, I could grep logs, do some scripting to calculate dates, etc. Bleh, lots of work. We recently started using Circonus for all sorts of metric tracking. So, why not PHP errors? Circonus has a data format that allows us to send any random metrics we want to track to them. They will then graph them for us. So, in my error handler, I just increment a counter in memached every time I log an error. When Circonus polls me for data, I just grab that value out of memcached and give it to them. I can then have really cool graphs that show me my error rate. In addition I can set alerts up. If we start seeing more than 1 error per second, we get an email. If we start seeing more than 5 per second, we get paged. You get the idea.

The Twitterverse:

I polled my friends on Twitter and two different services were mentioned. I don't know that I would actually send my errors to a service, but it sounds like an interesting idea. One was HopToad. It appears to be specifically for application error handling. The other was loggly which appears to be a more generic log service. Both very interesing concepts. The control freak in me would have issues with it. But, that is a personal problem.

Most others either manually reviewed things or had some similar notification system in place. Anyone else have a brilliant solution to monitoring PHP errors? Or any application level errors for that matter?

P.S. For what it is worth, our PHP error rate is about .004 per second. This includes some notices. Way too high for me. I want to approach absolute zero. Lets see if all these new tools can help us get there.

Boycott NuSphere. They are spammy spammers

brianlmoon — Thu, 28 Oct 2010 00:41:12 -0500

It took a lot for me to finally write this post. I tweeted about it a while back. It seems NuSphere needs more business. They have resorted to spamming people to promote their PhpED product. I have gotten emails to email addresses that:

I know are not on any mailing list
Are on web pages as plain mailto: anchor tags for good reasons.

In one case, it was the security@phorum.org address we have on the site to make it easy for people to report any security related issues. We get all kind of spam because of this, but it is worth it to have an easy access address for security issues.

In the other case, the email addresses used were on the dealnews.com jobs page. It was the addresses that are used to accept resumes.

Now, today, I started getting them to my personal inboxes.

Apparently, they sent it out the first time with a misspelling, so they had to send it out again!?!?

Now, the NuSphere people posted a tweet that claims the emails are not coming from them. That it's an independent marketer that "have permission to sell our products". First, the links in the email to go the NuSphere site, not a 3rd party. I would think an independent would want to take me to their site. Second, if that is the case, revoke the permission you have given them to sell your products. Don't use a marketing/sales agreement as a shield to allow people in the PHP Community to be spammed in your name.

PHP Community, please do no support NuSphere. They are spammers, directly or indirectly. If they do this to promote their product, how will they find a way to hose their user base later on?

PHP 5.3 and mysqlnd - Unexpected results

brianlmoon — Tue, 03 Aug 2010 08:00:00 -0500

I have started seriously using PHP 5.3 recently due to it finally making it into Portage. (Gentoo really isn't full of bleeding edge packages people.) I have used mysqlnd a little here and there in the past, but until it was really coming to my servers I did not put too much time into it.

What is mysqlnd?

mysqlnd is short for MySQL Native Driver. In short, it is a driver for MySQL for PHP that uses internal functions of the PHP engine rather than using the externally linked libmysqlclient that has been used in the past. There are two reasons for this. The first reason is licensing. MySQL is a GPL project. The GPL and the PHP License don't play well together. The second is better memory management and hopefully more performance. Being a performance junky, this is what peaked my interests. Enabling mysqlnd means it is used by the older MySQL extension, the newer MySQLi extension and the MySQL PDO driver.

New Key Feature - fetch_all

One new feature of mysqlnd was the fetch_all method on MySQLi Result objects. At both dealnews.com and in Phorum I have written a function to simply run a query and fetch all the results into an array and return it. It is a common operation when writing API or ORM layers. mysqlnd introduces a native fetch_all method that does this all in the extension. No PHP code needed. PDO already offers a fetchAll method, but PDO comes with a little more overhead than the native extensions and I have been using mysql functions for 14 years. I am very happy using them.

Store Result vs. Use Result

I have spoken in the past (see my slides and interview: MySQL Tips and Tricks) about using mysql_unbuffered_query or using mysqli_query with the MYSQLI_USE_RESULT flag. Without going into a whole post about that topic, it basically allows you to stream the results from MySQL back into your PHP code rather than having them buffered in memory. In the case of libmysqlclient, they could be buffered twice. So, my natural thought was that using MYSQLI_USE_RESULT with fetch_all would yield the most awesome performance ever. The data would not be buffered and it would get put into a PHP array in C instead of native code. The code I had hoped to use would look like:

$res = $db->query($sql, MYSQLI_USE_RESULT);
$rows = $res->fetch_all(MYSQLI_ASSOC);

But, I quickly found out that this does not work. For some reason, this is not supported. fetch_all only works with the default which is MYSQLI_STORE_RESULT. I filed a bug which was marked bogus. Which I put back to new because I really don't see a reason this should not work other than a complete oversight by the mysqlnd developers. So, I started doing some tests in hopes I could show the developers how much faster using MYSQLI_USE_RESULT could be. What happened next was not expected. I ended up benchmarking several different options for fetching all the rows of a result into an array.

Test Data

I tested using PHP 5.3.3 and MySQL 5.1.44 using InnoDB tables. For test data I made a table that has one varchar(255) column. I filled that table with 30k rows of random lengths between 10 and 255 characters. I then selected all rows and fetched them using 4 different methods.

In addition, I ran this test with mysqlnd enabled and disabled. For mysqli_result::fetch_all, only mysqlnd was tested as it is only available with mysqlnd. I ran each test 6 times and threw out the worst and best result for each test. FWIW, the best and worst did not show any major deviation for any of the tests. For measuring memory usage, I read the VmRSS value from Linux's /proc data. memory_get_usage() does not show the hidden memory used by libmysqlclient and does not seem to show all the memory used by mysqlnd either.

So, that is what I found. The memory usage graphs are all what I thought they would be. PDO has more overhead by its nature. Storing the result always uses more memory than using it. mysqli_result::fetch_all uses less memory than the loop, but more than directly using the results.

There are some very surprising things in the timing graphs however. First, the tried and true method of using the result followed by a loop is clearly still the right choice in libmysqlclient. However, it is a horrible choice for mysqlnd. I don't really see why this is so. It is nearly twice as slow. There is something really, really wrong with MYSQLI_USE_RESULT in mysqlnd. There is no reason it should ever be slower than storing the result and then reading it again. This is also evidenced in the poor performance of PDO (since even PDO uses mysqlnd when enabled). PDO uses an unbuffered query for its fetchAll method and it too got slower. It is noticably slower than libmysqlclient. The good news I guess is that if you are using mysqlnd, the fetch_all method is the best option for getting all the data back.

Next Steps

My next steps from here will be to find some real workloads that I can test this on. Phorum has several places where I can apply real world pages loads to these different methods and see how they perform. Perhaps the test data is too small. Perhaps the number of columns would have a different effect. I am not sure.

If you are reading this and have worked on or looked at the mysqlnd code and can explain any of it, please feel free to comment.

PHP and Memcached: The state of things

brianlmoon — Wed, 23 Jun 2010 11:48:31 -0500

Memcached is the de facto standard for caching in dynamic web sites. PHP is the one of the most widely used languages on the web. So, naturally there is lots of interest in using the two together. There are two choices for using memcached with PHP: PECL/memcache and PECL/memcached. Great names huh? But as of this writing there are issues with the two most popular Memcached libraries for PHP. This is a summary of those issues that I hope will help people being hurt by them and may bring about some change.

PECL/memcache

This is the older of the two and was the first C based extension for using memcached with PHP. Before this all the options were native PHP code. While they worked, they were slower of course. C > PHP. That is just fact. However, there has not been much active development on this code in some time. Yes, they have fixed bugs, but support for new features in the memcached server have not been added to this extension. Newer version of the server suppot a more efficient binary protocol and many new features. In addition, there are some parts of the extension that simply don't work anymore.

The most glaring one is the delete() function. It takes a second parameter that is documented as: "the item will expire after timeout seconds". In fact that was never a feature of memcached. It was completely misunderstood by the original extension authors. When that parameter was supported, it locked the key for timeout seconds and would not allow a new add operation on that key. Second, this feature was completely removed in memcached 1.4.0. So, if you send a timeout to the delete function, you simply get a failure. This is creating a huge support issue in the memcached community (not the PHP/PECL community) with people not being able to delete keys because of bad documentation and unsupported behavior. I sent a documentation patch for the this function to the PHP Docs list. I then modified it based on feedback. But since I have heard nothing about it getting merged. I have a PHP svn account, but even if I do have karma on the docs repository, I don't want to hijack them without the support of the people that work on the docs all the time. If you are reading this and can change the docs, please make the documentation for that function say "DON'T USE THIS PARAMETER, IT HAS BEEN DEPRECATED IN THE MEMCACHED SERVER!" or something.

Not too long ago the extension was officially abandoned by its original maintainers. Some people stepped up and claimed they wanted to see it continue. But, since that time, there have been no releases. There are bugs being closed though so maybe there is good things coming.

A very problematic issue with this extension is with the 3.0 beta release. It needs to just die. It has huge bugs that, IMO, were introduced by the previous maintainers in an effort to bring it up to speed, but never saw them through to make sure the new code worked. In their defense, it is marked as beta on PECL. But, thanks to Google, people don't see the word beta anymore. There are lots of people using this version and when they get bad results they blame the whole memcached world. Really, the new maintainers would do the world a favor if they just removed the 3.x releases from the PECL site.

PECL/memcached

This extension was started by Andrei Zmievski while working at Digg.com as an open source developer. It uses libmemcached, a C++ memcached library that does all the memcached work. This made it quite easy to support the new features in the memcached daemon as it as it was being developed at the same time as the new server. However, it has not had a stable release on PECL in nearly a year except for release to make it compatible with new versions of libmemcached. No bug fixes and no new features. There are currently 28 open bugs on PECL for this extension. Not all of which are bugs. Some are feature requests. The ironic thing is that the GitHub repository for this extension has seen a lot of development. But, none of these bug fixes have made it into the official PECL channel. And some of these bugs are major and others are just huge WTF for a developer.

The most major bug is one that I found. If you use persistent connections with this extension, it basically leaks those connections, not reusing an existing connection but also not closing the ones already made. This uses up the memory in your processes until they crash and it creates an exponential number of connections to your memcached server until it has no more connections available. Andrei does report in the bug that it is fixed in hist GitHub. But, for now, you can't use persistent connections.

The big WTF for a developer is that some functions don't take a numeric key and use it properly.


$mc = new Memcached();

$mc->addServer("localhost", 11211);

$mc->set(123, "yes!");

var_dump($mc->getMulti(array(123)));

?>

The above code should generate:

array(1) {
  [123]=>
  string(4) "yes!"
}

But, in reality, it generates:

bool(false)

The set succeeds, but the getMulti fails. Again, this is being fixed in GitHub, but is not available for release on PECL.

Compatibility

One big issue with these two extensions is that they are not drop in replacements for each other. If you want to move from one to the other, you have to take into consideration some time to convert your code. For instance, the older extension's set() function takes flags as the third parameter. The newer extension does not take a flags parameter at all. The older uses get() with an array to do a multi-get and the newer extension has a separate method for that called getMulti(). There are others as well.

Summary

So, what should you do as a PHP developer? If you are deploying memcached today, I would use the 2.2.x branch of PECL/memcache. It is the most stable. Just avoid the delete bug. It is not as fast and does not have the features. But, it is very reliable for set, get, add.... the basics of memcached. For the long term, it is a bit unclear. PECL/memcached looks to fix a lot of things in 2.0. But, I have not used it yet. In addition the long term growth of the project is a bit in question. Will there be a 2.1, 2.2, etc? I hope so. The other unknown is if the those people fixing the PECL/memcache bugs will keep it up and release a good stable product that supports new features of the server. Again, I hope so. The best scenario would be to have a choice between two fully compatible and feature rich extensions. Keep your fingers crossed.

PHP generated code tricks

brianlmoon — Fri, 18 Jun 2010 13:56:07 -0500

Something that is great about PHP is that you can write code that generates more PHP code to be used later. Now, I am not saying this a best practice. I am sure it violates some rule in some book somewhere. But, sometimes you need to be a rule breaker.

A simple example is taking a database of configuration information and dumping it to an array. We do this for each publication we operate. We have a publication table. It contains the name, base URL and other stuff that is specific to that publication. But, why query the database for something that only changes once in a blue moon? We could cache it, but that would still require an on demand database hit. The easy solution is to just dump the data to a PHP array and put it on disk.


$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "";

file_put_contents("/some/path/publications.php", $php_code);

?>

Now you can include the publications.php file and have a global variable named $PUBLICATIONS that holds the publication settings. But, how do we load a single publication without knowing numeric ids? Well, you could make some constants.


$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

    $constants[$row["publication_id"]] = strtoupper($row["name"]);

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "
$php_code.= "global \$PUBLICATIONS;\n";

$php_code.= "\$PUBLICATIONS = unserialize('$pubs_ser');\n";

foreach($constants as $id=>$const){

    $php_code.= "define('$const', $id);\n";

}

$php_code.= "?>";

file_put_contents("/some/path/publications.php", $php_code);

?>

So, now, we have constants. We can do stuff like:


//load a publication

require_once "publications.php";

echo $PUBLICATIONS[DEALNEWS]["name"];

?>

But, how about autoloading? It would be nice if I could just autoload the constants.


$sql = "select * from publications";

$res = $mysqli->query($sql);

while($row = $res->fetch_assoc()){

    $pubs[$row["publication_id"]] = $row;

    $constants[$row["publication_id"]] = strtoupper($row["name"]);

}

$pubs_ser = str_replace("'", "\\'", serialize($pubs));

$php_code = "
$php_code.= "class PUB_DATA {\n";

foreach($constants as $id=>$const){

    $php_code.= " const $const = $id;\n";

}

$php_code.= "    protected \$pubs_ser = '$pubs_ser';\n";

$php_code.= "}";

$php_code.= "?>";

file_put_contents("/some/path/pub_data.php", $php_code);

?>

Then we create a class in our autoloading directory that extends that object.


require_once "pub_data.php";

class Publication extends PUB_DATA {

    private $pub;

    public function __construct($pub_id) {

        $pubs = unserialize($this->pubs_ser);

        $this->pub = $pubs[$pub_id];

    }

    public function __get($var) {

        if(isset($this->pub[$var])){

            return $this->pub[$var];

        } else {

            // Exception

        }

    }

}

?>

Great, now we can do things like:

$pub = new Publication(Publication::DEALNEWS);

echo $pub->name;

The only problem that remains is dealing with getting the generated code to all your servers. We use rsync. It works quite well. You may have a different solution for your team. Back when we ran our own in house ad server we did all the ad work this way. None of the ad calls ever hit the database to get ads. We stored stats on disk in logs and processed them on a schedule. It was a very solid solution.

One more benefit of using generated files on disk is that they can be cached by APC or XCache. This means you don't have to actually hit disk for them all the time.

MySQL Conference Review

brianlmoon — Sat, 17 Apr 2010 10:54:52 -0500

I am back home from a good week at the 2010 O'Reilly MySQL Conference & Expo. I had a great time and got to see some old friends I had not seen in a while.

Oracle gave the opening keynote and it went pretty much like I thought it would. Oracle said they will keep MySQL alive. They talked about the new 5.5 release. It was pretty much the same keynote Sun gave last year. Time will tell what Oracle does with MySQL.

The expo hall was sparse. Really sparse. There were a fraction of the booths compared to the past. I don't know why the vendors did not come. Maybe because they don't want to compete with Oracle/Sun? In the past you would see HP or Intel have a booth at the conference. But, with Oracle/Sun owning MySQL, why even try. Or maybe they are not allowed? I don't know. It was just sad.

I did stop by the Maatkit booth and was embarrassed to tell Baron (its creator) I was not already using it. I had heard people talk about it in the past, but never stopped to see what it does. It would have only saved me hours and hours of work over the last few years. Needless to say it is now being installed on our servers. If you use MySQL, just go install Maatkit now and start using it. Don't be like me. Don't wait for years, writing the same code over and over to do simple maintenance tasks.

Gearman had a good deal of coverage at the conference. There were three talks and a BoF. All were well attended. Some people seemed to have an AHA! moment where they saw how Gearman could help their architecture. I also got to sit down with the PECL/gearman maintainers and discuss the recent bug I found that is keeping me from using it.

I spoke about Memcached as did others. Again, there was a BoF. It was well attended and people had good questions about it. There seemed to be some FUD going around that memcached is somehow inefficient or not keeping up with technology. However, I have yet to see numbers or anything that proves any of this. They are just wild claims by people that have something to sell. Everyone wants to be the caching company since there is no "Memcached, Inc.". There is no company in charge. That is a good thing, IMO.

That brings me to my favorite topic for the conference, Drizzle. I wrote about Drizzle here on this blog when it was first announced. At the time MySQL looked like it was moving forward at a good pace. So, I had said that it would only replace MySQL in one part of our stack. However, after what, in my opinion, has been a lack of real change in MySQL, I think I may have changed my mind. Brian Aker echoed this sentiment in his keynote address about Drizzle. He talked about how MySQL AB and later Sun had stopped focusing on the things that made MySQL popular and started trying to be a cheap version of Oracle. That is my interpretation of what he said, not his words.

Why is Drizzle different? Like Memcached and Gearman, there is no "Drizzle, Inc.". It is an Open Source project that is supported by the community. It is being supported by companies like Rackspace who hired five developers to work on it. The code is kept on Launchpad and is completely open. Anyone can create a branch and work on the code. If your patches are good, they will be merged into the main branch. But, you can keep your own branch going if you want to. Unlike the other forks, Drizzle has started over in both the code and the community. I personally see it as the only way forward. It is not ready today, but my money is on Drizzle five or ten years from now.

DevOps at dealnews.com

brianlmoon — Mon, 05 Apr 2010 08:00:00 -0500

I was telling someone how we roll changes to production at dealnews and they seemed really amazed by it. I have never really thought it was that impressive. It just made sense. It has kind of happened organically here over the years. Anyhow, I thought I would share.

Version Control

So, to start with, everything is in SVN. PHP code, Apache configs, DNS and even the scripts we use to deploy code. That is huge. We even have a misc directory in SVN where we put any useful scripts we use on our laptops for managing our code base. Everyone can share that way. Everyone can see what changed when. We can roll things back, branch if we need to, etc. I don't know how anyone lives with out. We did way back when. It was bad. People were stepping on each other. It was a mess. We quickly decided it did not work.

For our PHP code, we have trunk and a production branch. There are also a couple of developers (me) that like to have their own branch because they break things for weeks at a time. But, everything goes into trunk from my branch before going into production. We have a PHP script that can merge from a developer branch into trunk with conflict resolution assistance built in. It is also capable of merging changes from trunk back into a branch. Once it is in trunk we use our staging environment to put it into production.

Staging/Testing

Everything has a staging point. For our PHP code, it is a set of test staging servers in our home office that have a checkout of the production branch. To roll code, the developer working on the project logs in via ssh to a staging server as a restricted user and uses a tool we created that is similar to the Python based svnmerge.py. Ours is written in PHP and tailored for our directory structure and roll out procedures. It also runs php -l on all .php and .html files as a last check for any errors. Once the merge is clean, the developer(s) use the staging servers just as they would our public web site. The database on the staging server is updated nightly from production. It is as close to a production view of our site as you can get without being on production. Assuming the application performs as expected, the developer uses the merge tool to commit the changes to the production branch. They then use the production staging servers to deploy.

Rolling to Production

For deploying code and hands on configuration changes into our production systems, we have a staging server in our primary data center. The developer (that is key IMO) logs in to the production staging servers, as a restricted user, and uses our Makefile to update the checkout and rsync the changes to the servers. Each different configuration environment has an accompanying nodes file that lists the servers that are to receive code from the checkout. This ensures that code is rolled to servers in the correct order. If an application server gets new markup before the supporting CSS or images are loaded onto the CDN source servers, you can get an ugly page. The Makefile is also capable of copying files to a single node. We will often do this for big changes. We can remove a node from service, check code out to it, and via VPN access that server directly to review how the changes worked.

For some services (cron, syslog, ssh, snmp and ntp) we use Puppet to manage configuration and to ensure the packages are installed. Puppet and Gentoo get along great. If someone mistakenly uninstalls cron, Puppet will put it back for us. (I don't know how that could happen, but ya never know). We hope to deploy more and more Puppet as we get comfortable with it.

Keeping Everyone in the Loop

Having everyone know what is going on is important. To do that, we start with Trac for ticketing. Secondly, we use OpenFire XMPP server throughout the company. The devops team has a channel that everyone is in all day. When someone rolls code to production, the scripts mentioned above that sync code out to the servers sends a message via an XMPP bot that we wrote using Ruby (Ruby has the best multi-user chat libraries for XMPP). It interfaces with Trac via HTTP and tells everyone what changesets were just rolled and who committed them. So, in 5 minutes if something breaks, we can go back and look at what just rolled.

In addition to bots telling us things, there is a cultural requirement. Often before a big roll out, we will discuss it in chat. That is the part than can not be scripted or programmed. You have to get your developers and operations talking to each other about things.

Final Thoughts

There are some subtle concepts in this post that may not be clear. One is that the code that is written on a development server is the exact same code that is used on a production server. It is not massaged in any way. Things like database server names, passwords, etc. are all kept in configuration files on each node. They are tailored for the data center that server lives in. Another I want to point out again is that the person that wrote the code is responsible all the way through to production. While at first this may make some developers nervous, it eventually gives them a sense of ownership. Of course, we don't hire someone off the street and give them that access. But it is expected that all developers will have that responsibility eventually.

Logging with MySQL

brianlmoon — Wed, 24 Mar 2010 00:20:12 -0500

I was reading a post by Dathan Vance Pattishall titled "Cassandra is my NoSQL solution but..". In the post, Dathan explains that he uses Cassandra to store clicks because it can write a lot faster than MySQL. However, he runs into problems with the read speed when he needs to get a range of data back from Cassandra. This is the number one problem I have with NoSQL solutions.

SQL is really good at retrieving a set of data based on a key or range of keys. Whereas NoSQL products are really good at writing things and retrieving one item from storage. When looking at redoing our architecture a few years ago to be more scalable, I had to consider these two issues. For what it is worth, the NoSQL market was not nearly as mature as it is now. So, my choices were much more limited. In the end, we decided to stick with MySQL. It turns out that a primary or unique key lookup on a MySQL/InnoDB table is really fast. It is sort of like having a key/value storage system. And, I can still do range based queries against it.

But, back to Dathan's problem: clicks. We store clicks at dealnews. Lots of clicks. We also store views. We store more views than we do clicks. So, lots of views and lots of clicks. (Sorry for the vague numbers, company secrets and all. We are a top 1,000 Compete.com site during peak shopping season.) And we do it all in MySQL. And we do it all with one server. I should disclose we are deploying a second server, but it is more for high availability than processing power. Like Dathan, we only use about the last 24 hours of data at any given time. There are three keys for us doing logging like this in MySQL.

Use MyISAM

MyISAM supports concurrent inserts. Concurrent inserts means that inserts can add rows to the end of a table while selects are being performed on other parts of the data set. This is exactly the use case for our logging. There are caveats with range queries as pointed out by the MySQL Performance Blog.

Rotating tables

MySQL (and InnoDB in particular) really sucks at deleting rows. Like, really sucks. Deleting causes locks. Bleh. So, we never delete rows from our logging tables. Instead, nightly we rotate the tables. RENAME TABLE is an (near) atomic process in MySQL. So, we just create a new table.

create table clicks_new like clicks;
rename table clicks to clicks_2010032500001, clicks_new to clicks;

Tada! We now have an empty table for today's clicks. We now drop any table with a date stamp that is longer than x days old. Drops are fast, we like drops.

For querying these tables, we use UNION. It works really well. We just issue a SHOW TABLES LIKE 'clicks%' and union the query across all the tables. Works like a charm.

Gearman

So, I get a lot of flack at work for my outright lust for Gearman. It is my new duct tape. When you have a scalability problem, there is a good chance you can solve it with Gearman. So, how does this help with logging to MySQL? Well, sometimes, MySQL can become backed up with inserts. It happens to the best of us. So, instead of letting that pile up in our web requests, we let it pile up in Gearman. Instead of having our web scripts write to MySQL directly, we have them fire Gearman background jobs with the logging data in them. The Gearman workers can then write to the MySQL server when it is available. Under normal operating procedure, that is in near real time. But, if the MySQL server does get backed up, the jobs just queue up in Gearman and are processed when the MySQL server is available.

BONUS! Insert Delayed

This is our old trick before we used Gearman. MySQL (MyISAM) has a neat feature where you can have inserts delayed until the table is available. The query is sent to the MySQL server and it answers with success immediately to the client. This means your web script can continue on and not get blocked waiting for the insert. But, MySQL will only queue up so many before it starts erroring out. So, it is not as fool proof as a job processing system like Gearman.

Summary

To log with MySQL:

Use MyISAM with concurrent inserts
Rotate tables daily and use UNION to query
Use delayed inserts with MySQL or a job processing agent like Gearman

Happy logging!

PS: You may be asking, "Brian, what about Partitioned Tables?" I asked myself that before deploying this solution. More importantly, in IRC I asked Brian Aker about MySQL partitioned tables. I am paraphrasing, but he said that if I ever think I might alter that table, I would not trust it with the partitions in MySQL. So, that kind of turned me off of them.

PHP command line progress bar

brianlmoon — Wed, 10 Mar 2010 11:33:34 -0600

Was just looking through some code and came across this function I wrote some time ago. If you do a lot of your processing scripts in PHP like we do, you probably need to know what is going on sometimes. So, I made a progress bar for use on the cli. I thought I would share it. Here is a video of it in action. And the code can be found here.

ob_start and HTTP headers

brianlmoon — Fri, 29 Jan 2010 10:00:00 -0600

I was helping someone in IRC deal with some "headers already sent" issues and told them to use ob_start. Very diligently, the person went looking for why that was the right answer. He did not find a good explination. I looked around and I did not either. So, here is why this happens and why ob_start can fix it.

How HTTP works

HTTP is the communication protocol that happens between your web server and the user's browser. Without too much detail, this is broken into two pieces of data: headers and the body. The body is the HTML you send. But, before the body is sent, the HTTP headers are sent. Here is an example of an HTTP request response including headers:

HTTP/1.1 200 OK
Date: Fri, 29 Jan 2010 15:30:34 GMT
Server: Apache
X-Powered-By: PHP/5.2.12-pl0-gentoo
Set-Cookie: WCSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxx; expires=Sun, 28-Feb-2010 15:30:34 GMT; path=/
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8




 
 Ramblings of a web guy
.
.

So, all those lines before the HTML starts have to come first. HTTP headers are where things like cookies and redirection occur. When a PHP script starts to send HTML out to the browser, the headers are stopped and the body begins. When your code tries to set a cookie after this has started, you get the "headers already sent" error message.

How ob_start works

So, how does ob_start help? The ob in ob_start stands for output buffering. ob_start will buffer the output (HTML) until the page is completely done. Once the page is completely done, the headers are sent and then the output is sent. This means any calls to setcookie or the header function will not cause an error and will be sent to the browser properly. You do need to call ob_start before any output occurs. If you start output, it is too late.

The down side

The down side of doing this is that the output is buffered and sent all at once. That means that the time between the user request and the time the first byte gets back to the user is longer than it has to be. However, in modern PHP application design, this is often already the case. An MVC framework for example would do all the data gathering before any presentation is done. So, your application may not have any issue with this.

Another down side is that you (or someone) could get lazy and start throwing setcookie calls in any old place. This should be avoided. It is simply not good programming design. In a perfect world, we would not need output buffering to solve this problem for us.

Using ini files for PHP application settings

brianlmoon — Tue, 19 Jan 2010 17:24:35 -0600

At dealnews we have three tiers of servers. First is our development servers, then staging and finally production. The complexity of the environment increases at each level. On a development server, everything runs on the localhost: mysql, memcached, etc. At the staging level, there is a dedicated MySQL server. In production, it gets quite wild with redundant services and two data centers.

One of the challenges of this is where and how to store the connection information for all these services. We have done several things in the past. The most common thing is to store this information in a PHP file. It may be per server or there could be one big file like:








if(DEV){


    $server = "localhost";


} else {


    $server = "10.1.1.25";


}





?>

This gets messy quickly. Option two is to deploy a single file that has the settings in a PHP array. And that is a good option. But, we have taken that one step further using some PHP ini trickeration. We use ini files that are loaded at PHP's startup and therefore the information is kept in PHP's memory at all times.

When compiling PHP, you can specify the --with-config-file-scan-dir to tell PHP to look in that directory for additional ini files. Any it finds will be parsed when PHP starts up. Some distros (Gentoo I know) use this for enabling/disabling PHP extensions via configuration. For our uses we put our custom configuration files in this directory. FWIW, you could just put the above settings into php.ini, but that is quite messy, IMO.

To get to this information, you can't use ini_get() as you might think. No, you have to use get_cfg_var() instead. get_cfg_var returns you the setting, in php.ini or any other .ini file when PHP was started. ini_get will only return values that are registered by an extension or the PHP core. Likewise, you can't use ini_set on these variables. Also, get_cfg_var will always reflect the initial value from the ini file and not anything changed with ini_set.

So, lets look at an example.



; db.ini


[myconfig]


myconfig.db.mydb.db     = mydb


myconfig.db.mydb.user   = user


myconfig.db.mydb.pass   = pass


myconfig.db.mydb.server = host

This is our ini file. the group in the braces is just for looks. It has no impact on our usage. Because this is parsed along with the rest of our php.ini, it needs a unique namespace within the ini scope. That is what myconfig is for. We could have used a DSN style here, but it would have required more parsing in our PHP code.










/**


 * Creates a MySQLi instance using the settings from ini files


 *


 * @author     Brian Moon 


 * @copyright  1997-Present dealnews.com, Inc.


 *


 */





class MyDB {





    /**


     * Namespace for my settings in the ini file


     */


    const INI_NAMESPACE = "dealnews";





    /**


     * Creates a MySQLi instance using the settings from ini files


     *


     * @param   string  $group  The group of settings to load.


     * @return  object


     *


     */


    public static function init($group) {





        static $dbs = array();





        if(!is_string($group)) {


            throw new Exception("Invalid group requested");


        }





        if(empty($dbs["group"])){





            $prefix = MyDB::INI_NAMESPACE.".db.$group";





            $db   = get_cfg_var("$prefix.db");


            $host = get_cfg_var("$prefix.server");


            $user = get_cfg_var("$prefix.user");


            $pass = get_cfg_var("$prefix.pass");





            $port = get_cfg_var("$prefix.port");


            if(empty($port)){


                $port = null;


            }





            $sock = get_cfg_var("$prefix.socket");


            if(empty($sock)){


                $sock = null;


            }





            $dbs[$group] = new MySQLi($host, $user, $pass, $db, $port, $sock);





            if(!$dbs[$group] || $dbs[$group]->connect_errno){


                throw new Exception("Invalid MySQL parameters for $group");


            }


        }





        return $dbs[$group];





    }





}





?>

We can now call DB::init("myconfig") and get a mysqli object that is connected to the database we want. No file IO was needed to load these settings except when the PHP process started initially. They are truly constant and will not change while this process is running.

Once this was working, we created separate ini files for our different datacenters. That is now simply configuration information just like routing or networking configuration. No more worrying in code about where we are.

We extended this to all our services like memcached, gearman or whatever. We keep all our configuration in one file rather than having lots of them. It just makes administration easier. For us it is not an issue as each location has a unique setting, but every server in that location will have the same configuration.

Here is a more real example of how we set up our files.



[myconfig.db]


myconfig.db.db1.db         = db1


myconfig.db.db1.server     = db1hostname


myconfig.db.db1.user       = db1username


myconfig.db.db1.pass       = db1password





myconfig.db.db2.db         = db2


myconfig.db.db2.server     = db2hostname


myconfig.db.db2.user       = db2username


myconfig.db.db2.pass       = db2password





[myconfig.memcache]


myconfig.memcache.app.servers    = 10.1.20.1,10.1.20.2,10.1.20.3


myconfig.memcache.proxy.servers  = 10.1.20.4,10.1.20.5,10.1.20.6





[myconfig.gearman]


myconfig.gearman.workload1.servers = 10.1.20.20


myconfig.gearman.workload2.servers = 10.1.20.21

Autoloading for fun and profit

brianlmoon — Fri, 08 Jan 2010 09:44:47 -0600

So, as I stated in a previous post, the code base here at dealnews is going through some changes. I have been working heavily on those changes since then. Testing and benchmarking to see what works best. One of those changes is the heavy use of autoloading.

During the holidays, I always like to read the PHP Advent. One of the posts this year was by Marcel Esser titled You Don’t Need All That. It was a great post and echos many things I have said about PHP and web development. In his post, Marcel benchmarks the difference between using an autoloader and using straight require statements. I was not surprised by the result. The autoloading overhead and class overhead is well known to me. It is one thing that kept me from using any heavy OOP (we banned classes on our user facing pages for a long time) in PHP 4 and PHP 5.0. It has gotten a lot better however. Class overhead is very, very small now. Especially when using it as a namespacing for static functions. However, there is one word of caution I would like to add to his statements. When you use require/include statements instead of autoloading, you end up with a file like this:





require "DB.php";

require "Article.php";





function myfunc1() {

    DB::somemethod();

}



function myfunc2(){

    Article::somemethod();

}



?>

That file needs to require two files, but each one is only needed by one function in the file. This is the dilemma we have found ourselves in. We have a file that is filled with functions for building, taking apart, repairing, fetching, or anything else you can think to do with a URL. So, at the top of that file are 13 require statements. This all happened organically over time. But, now we are in a situation where we load lots of files that may not even be needed to begin with. By moving to an autloading system, we will only include the code we need. This saves cycles and file IO.

Again, Marcel's post was dead on. Our front end is written with multiple entry points. We use auto_prepend files to do any common work that all requests need (sessions, loading the application settings, etc). The front page of dealnews.com is about 600 lines of PHP that does the job as quickly as possible. But, we are moving it to use autoloaded code because its require list has grown larger than I would like and some of those requires are not always "required" to generate the page.

One point Marcel did make was about how modern MVC systems are part of the problem. We don't have a highly structured traditional MVC layout of our code. We use a very wide, flat directory structure rather than deep nesting of classes and objects. We also don't do a lot of overloading. Maybe 1% of our classes extend another and never more than one deep. All that makes for much quicker loading objects vs. some of the packaged frameworks like Zend Framework.

So, as Marcel warned, be aware of both your use of autoloading and require statements. Both can be bad when used the wrong way.

Code Organization Dilemma

brianlmoon — Wed, 18 Nov 2009 08:00:00 -0600

So, we have been building up our code library at dealnews for 9 years. It was started at the end of PHP3 and the beginning of PHP4. So, we did not have autoloading, static functions, and all that jazz. Classes had lots of overhead in early PHP4 so we started down a pure procedural road in 2000. And for a long time, it was very maintainable. We had 2 or 3 developers for most of this time. We now have 5 or 6 depending on whether we have contractors. There are starting to be too many files and too many functions. We find ourselves adding new files when some new function is created instead of adding it to an existing file because we don't want to have huge files with 100 functions in them. File names and function names are getting longer and more ambiguous. For example, we have a file called url_functions.php. It contains functions to generate URLs for different types of pages on the site, functions to fetch URLs from the web and functions to parse URLs from an article. Those probably don't all belong in one file. But, they got nickle and dimed in there over time. So, now, we are inclined to not add anything to that file and make new files for new semi-URL related functions. Ugh.

It is time to start thinkinb about a reorganization. There are 1,900+ functions in 400+ files in our code library. This is just our library. This does not include the code that actually builds a page and generates output. It does not include our cron jobs or system administration scripts. Yeah, that is a lot. So, where do we go from here? Some things are easy to do. For example, we have a file called string.php. Most all the functions in that file can easily be moved a String class with static functions that can be accessed via an autoloader.

Then we have the various ways we deal with the articles on the web site. I have written about our front end vs. back end system before. What this means for our code base is that we have two ways to deal with an article. One is in our highly relational backend system. The other is in our optimized front end database servers. So, one Article object won't really do. We already have an Article object that serves as an ORM interface for the backend. To access the front end data, we currently have a library of functions (fetch_article for a single, fetch_articles for a set, etc.) but it does not fit with an autoloading environment. It also is not related to the object (the article) and is associated with where the data is stored. New developers don't grok the server infrastructure, so the code organization may not make sense to them. We have about 10 different objects that need both a back end and front end interface.

On the other hand, I really don't want to end up with a class named FrontEndArticle and BackEndArticle. Much less do we want to have stuff like BackEnd_Article where the file is actually in BackEnd/Article.php somewhere. The verbosity becomes overwhelming and hard to read, IMO.

So, what are others doing with huge code bases? I see lots of projects with 100 or so functions/methods in 20-30 files. Frameworks have it easy because they don't have a CEO that wants something on this one page to be different than it is on every other page where that data is used. We have to deal with those types of hacks in an elegant way that can be maintained.

Forums are crap. Can we get some help?

brianlmoon — Mon, 12 Oct 2009 10:01:44 -0500

Amy Hoy has written a blog post about why forums are crap. And she is right. Forum software does not always do a good job of helping people communicate. I have worked with Amy. She did a great analysis of dealnews.com that led to our new design. So, she is not to be ignored.

However, as a software developer (Phorum), I see a lot of problems and no answers. And it is not all on the software. Web site owners use forums to solve problems that they really, really suck at. Ideally, every web site would be very unique for their audience. They would use a custom solution that fits a number of patterns that best solves their problem. However, most web site owners don't want to take the time to do such things. They want a one stop, drop in solution. See the monolith that is vBulletin, scary.

And what if a forum is the best solution? Well, software developers, in general, are not good designers. They don't think like normal people. And they don't see their applications as a whole, but as pieces that do jobs. The forum software market has been run by software developers for over 10 years. Most of them all are still copies of what UBB was 13 years ago. And software (like Phorum) that has tried to be different is shunned by the online communities of the world because they don't work/look/feel like every other forum software on the planet.

So, as software developers, what are we to do? We want to make great software. We want to help our users help their users. But, what we have been doing for 10+ years has only been adequate. As the leader of an open source forum software project, I am open to any and all ideas.

Memcached: What is it and what does it do?

brianlmoon — Wed, 30 Sep 2009 11:49:39 -0500

I spoke at CodeWorks in Atlanta, GA this week. I totally dropped the ball promoting it on my blog. It was a neat venue. Rather than a large conference they are doing a traveling show. Seven cities in 14 days. Many of the presenters are working in every city. Crazy. I was just in Atlanta. It is close to home and easy for me to get to.

I spoke about memcached. I tried to dig a bit deeper into how memcached works. On the mailing list we get a lot of new people that make assumptions about memcached. Most talks I have seen focus on why caching is good, how to use memcached, the performance gain. I kind of assumed everyone knew that stuff already. I guess you could say I gave a talk that was the real FAQs of the project.

Here are the slides. Derick Rethans took video of the talk. When he gets that online I will add it to this post.

Memcached: What is it and what does it do?

View more documents from Brian Moon.

Wordcraft 0.10 available

brianlmoon — Mon, 10 Aug 2009 00:12:20 -0500

The latest package of Wordcraft, the PHP/MySQL based blog software that runs this site, is available for download from Google Code. Just some minor bug fixes and cosmetic stuff. Its getting a little use in the wild. That is always fun to see.

Forking PHP!

brianlmoon — Thu, 23 Jul 2009 13:17:42 -0500

We use PHP everywhere in our stack. For us, it makes sense because we have hired a great staff of PHP developers. So, we leverage that talent by using PHP everywhere we can.

One place where people seem to stumble with PHP is with long running PHP processes or parallel processing. The pcntl extension gives you the ability to fork PHP processes and run lots of children like many other unix daemons might. We use this for various things. Most notably, we use it run Gearman worker processes. While at the OReilly Open Sourc Convention in 2009, we were asked about how we pulled this off. So, we are releasing the two scripts that handle the forking and some instructions on how we use them.

This is not a detailed post about long running PHP scripts. Maybe I can get to the dos and don'ts of that another time. But, these are the scripts we use to manage long running processes. They work great for us on Linux. They will not run on Windows at all. We also never had any trouble running them on Mac OS X.

The first script, prefork.php, is for forking a given function from a given file and running n children that will execute that function. There can be a startup function that is run before any forking begins and a shutdown function to run when all the children have died.

The second script, prefork_class.php, uses a class with defined methods instead of relying on the command line for function names. This script has the added benefit of having functions that can be run just before each fork and after each fork. This allows the parent process to farm work out to each child by changing the variables that will be present when the child starts up. This is the script we use for managing our Gearman workers. We have a class that controls how many workers are started and what functions they provide. I may release a generic class that does that soon. Right now it is tied to our code library structure pretty tightly.

We have also included two examples. They are simple, but do work to show you how the scripts work.

You can download the code from the dealnews.com developers' page.

UPDATE: I have released a Gearman Worker Manager on Github.

Building PECL/memcache on Mac OS X

brianlmoon — Sun, 31 May 2009 22:23:29 -0500

My coworker Rob ran into an issue building the PECL/memcache extension on his Mac. He did find the solution however. You can read and leave comments on his blog.