What is next for message board software?

When I was hired at dealnews.com in 1998, my primary focus was to get our message board (Phorum) up to speed. I had written the first version as a side project for the site. Message boards were a lot simpler back then. Matt's WWWBoard was the gold standard of the time. And really, the functionality has been only evolutionary since. We added attachments to Phorum in 2003 or something. That was a major new feature. In Phorum 5 we added a module system that was awesome. But, that was just about the admin and not the user. From the user's perspective, message boards have not changed much since 1997. I saw this tweet from Amy Hoy and it got me to thinking about how message boards work. Here is the typical user experience:
  1. Go to the message board
  2. See a list of categories
  3. Drill down to the category they want to read
  4. Scroll through a list of messages that are in reverse cronological order by original post date or most recent post date
  5. Click a message and read it.
  6. Go to #3, repeat
Every message board software package pretty much works like that and has for over 10 years. And it kind of sucks. What a user would probably rather experience is:
  1. Go to the message board
  2. The most interesting things (to this user) are listed right there on the page. No drill down needed.
  3. Click one and read it.
  4. Goto #2, repeat.
Sounds easy? That #2 is easy to type but very hard to accomplish. I think it is conceivably doable if you are running a site that has all the data. Stackoverflow comes close. When you land on the site, they default the page to the "interesting" posts. However, they are not always interesting to me. They are making general assumptions about their audience. For example, right now, the first one is tagged "delphi". I could care less about that language and any posts about it. Its a good try, but misses by oh so far. This is not a Stackoverflow hate post. They are doing a good job. So, what do I do when I land there? I ignore the front page and click Tags (#2 in the first list), then pick a tag I want to read about (#3 in the first list). Low and behold the page I get is "newest". So, I end up doing exactly what is in the first list I mentioned. They do offer other sort options. But, they chose newest as the default. And from years of watching user behavior, 80% - 90% of people go with the good ol' default. This kind of brings me to another point though about the types of message boards there are.

Stackoverflow is a classic example of a help message board. People come there and ask a question. Other people come along and answer the question. Then more people come along and vote on whether the answers (and questions) are any good. This is one really nice feature that I think will have to become a core feature in any message board of the future. The signal to noise ratio can get so out of whack, you need human input to help decide what is good and what is noise. I think the core of the application has to rely on that if we are ever going to achieve the desired experience.

The second type of message board is a conversational system. It is almost like a delayed chat room. People come to a message board and post about their cat or asking who watched a TV show, that kind of thing. This has a completely different dynamic to it than the help message board. You can't really vote if a post is good or bad. The obvious exception being spam would of course want to be recognized and dealt with.

So, how do you know what content is desirable for the user that is entering the site right now? This concept has already been laid out for us: the social graph. You have to give users a way to associate with other users. If Bob really likes Tom's posts, he is probably more interested to read Tom's post from 30 minutes ago than some new guy that just joined the site and posted 1 minute ago. The challenge here is getting people to interconnect...but not too much. Everyone has that aunt on Facebook that follows you, your roommate and anybody else she can. She would follow your dog if he had a Facebook account. So, those people would still get a crappy experience if the whole system relied on the social graph. The other side is the people that will never "follow", "like" or whatever you call it another person. Their experience would lack as well. One key ingredient here is that you need to own this data. You can't just throw like buttons and Facebook connect on your message board and think you can leverage that data. That data is for Facebook, not you. I think the help message boards could benefit from the social graph as well.

Another aspect of what is most important to a user is discussions they are involved in. That could mean ones they started, ones they have replied to or simply ones they have read. Which of those aspects are more important than the others? Clearly if you started a discussion and someone has replied, that is going to interest you. If you posted a reply, you may be done with the topic or you may be waiting on a response. It would take some serious natural language algorithms to decide which is the case. For things you have read, I think you have to consider how many times the user has read the discussion. If every time it is updated they read it, they probably will want to read it again the next time it is updated. If they have only read it once, maybe they are not as interested.

The last aspect of message boards is grouping things. This is the part I actually struggle with the most. The easy first answer is tagging. Don't force the user down a single trail, let them tag posts instead of only posting them in one neat contained area. That gets you half way there. Let's use Stackoverflow (I really do like the site) as an example again. The first thing I do is go to Tags and click on PHP. I like helping people with PHP problems. So,  is that really any different from categorization? Sure, there could be someone out there that really likes helping with Javascript. And if the same post was tagged with both tags then their coverage of potential help is larger. But, some of the time those tags are wrong when they tag it with more than one tag. The problem they need help with is either PHP or Javascript, most likely not both. They just don't know what they are doing. For example, there is this post on Stackoverflow. The user tagged it PHP and database-design. There is no PHP in the question. I am guessing he is using PHP for the app. But, it really never comes up and he is only talking about database design. So, who did the PHP tag help there? I don't think it helped him. And it only wasted my time. Having written all that, a free-for-all approach where there is no filtering sucks too. ARGH! It just all sucks. That brings us back to what Amy said in a way. Perhaps moderated tagging is an answer. I have not seen a way on Stackoverflow to untag a post. That would let people correct others. I am gonna write that down. If you work at Stackoverflow and are reading this, you can use that idea. Just put a comment in the code about how brilliant I am or something that aliens will find one day.

So, I am done. I know exactly what to do right? I just have to make code that does everything I put in the previous paragraphs. Man I wish it were that easy. When you want to write a distributed application to do it, the task is even more daunting. If I controlled the data and the servers and the code, I could do crazy things that would make great conference talks. But, it kind of falls apart when I want to give this code to a 60 year old retired guy that is starting a hobby site for watching humming birds on a crappy GoDaddy account. Yeah, he is not installing Sphinx or HandlerSocket or Gearman. Those are all things I would want to use to solve this problem in a scalable fashion. At that point you have two choices. Aim for the small time or the big time. If you aim for the small time, you may get lots of installs, but, you will be hamstrung. If you aim for the big time, you may be the only guy that ever uses the code. That is a tough decision.

What have I missed? I know I missed something. Are there other types of message boards? I can definitely see some sub-types. Perhaps a board where ideas instead of help messages are posted. Or maybe the conversations are more show off based as in a user posting pictures or videos for comment. Is there already something out there doing this and I have just missed it? Let me know what I have missed please.

Forums are crap. Can we get some help?

Amy Hoy has written a blog post about why forums are crap. And she is right. Forum software does not always do a good job of helping people communicate. I have worked with Amy. She did a great analysis of dealnews.com that led to our new design. So, she is not to be ignored.

However, as a software developer (Phorum), I see a lot of problems and no answers.  And it is not all on the software.  Web site owners use forums to solve problems that they really, really suck at.  Ideally, every web site would be very unique for their audience.  They would use a custom solution that fits a number of patterns that best solves their problem.  However, most web site owners don't want to take the time to do such things.  They want a one stop, drop in solution. See the monolith that is vBulletin, scary.

And what if a forum is the best solution? Well, software developers, in general, are not good designers. They don't think like normal people. And they don't see their applications as a whole, but as pieces that do jobs. The forum software market has been run by software developers for over 10 years. Most of them all are still copies of what UBB was 13 years ago. And software (like Phorum) that has tried to be different is shunned by the online communities of the world because they don't work/look/feel like every other forum software on the planet.

So, as software developers, what are we to do? We want to make great software. We want to help our users help their users. But, what we have been doing for 10+ years has only been adequate. As the leader of an open source forum software project, I am open to any and all ideas.

Did you know I am going to be at Velocity?

Well, neither did I until today. HA!

Velocity is a new O'Reilly conference dedicated to "Optimizing Web Performance and Scalability".  It starts next Monday.  Yesterday I was contacted by Adam Jacobs of HJK Solutions about taking part in a panel discussion about what happens when success comes suddenly to a web site.  I think he thought I was in the bay area.  Little did he know I am in Alabama.  But, amazingly, I was able to work it all out so I can be there.  I wish I had known about this conference ahead of time.  It sounds really awesome.  Performance has always been something I focus on.  I hope to share some and learn at the same time.

So, if you are going to be there, come see our panel.

P.S. Thanks to John Allspaw of Flickr for recommending me to Adam.

MemProxy 0.1

MemProxy 0.1 is out!  It has taken me a while, but I have finally gotten around to releasing the code that I credited with saving us during a Yahoo! mention.  It is a caching proxy "server" that uses memcached for storing the cache.  I put server in quotes because it is really just a PHP script that handles the caching and talking to the application servers.  Apache and other HTTP servers already do a good job talking HTTP to a vast myriad of clients.  I did not see any reason to reinvent the wheel.  Here are some of the features that make it different from anything I could find:

  • Uses memcached for storage

  • Serves cache headers to clients based on TTL of cached data

  • Uses custom headers to assemble multiple pieces of cache into one object

  • Minimal dependencies.  Only PHP and pecl/memcached needed.

  • Small code base.  It is just two files, one when settings are cached.

  • Application agnostic.  If the backend is hosted on an HTTP server this can cache it.


Some other things it does that you might expect:

  • Handles HTTP 1.1 requests to the backend

  • Allows TTLs set by the standard Cache-Control header

  • Appears transparent to the client.

  • Sends proper HTTP error codes relating to proxies/gateways

  • Allows pages to be refreshed or removed from cache

  • Allows a page to be viewed from the application server without caching it

  • more....


You can find the code on Google Code.  The code (or something like it rather) has been in use at dealnews for well over a year.  But, this is a new code base.  It had to be refactored for public consumption.  So, there may be bugs.

in_array is quite slow

So, we had a cron job hanging for hours.  No idea why.  So, I started debugging.  It all came down to a call to in_array().  See, this job is importing data from a huge XML file into MySQL.  After it is done, we want to compare the data we just added/updated to the data in the table so we can deactivate any data we did not update.  We were using a mod_time field in mysql in the past.  But, that proved to be an issue when we wanted to start skipping rows from the XML that were present but unchanged.  Doing that saved a lot of MySQL writes and sped up the process.

So, anyhow, we have this huge array of ids accumulated during the import.  So, an in clause with 2 million parts would suck.  So, we suck back all the ids in the database that exist and stick that into an array.  We then compared the two arrays by looping one array and using in_array() to check if the value was in the second array.  Here is a pseudo example that shows the idea:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(in_array($i, $arr2)){

unset($arr1[$key]);

}
}

[/sourcecode]

So, that was running for hours with about 400k items.  Our data did not contain the value as the key, but it could as the value was unique.  So, I added it.  So, now, the code looks like:

[sourcecode language='php']

foreach($arr1 as $key=>$i){

if(isset($arr2[$i])){

unset($arr1[$key]);

}
}

[/sourcecode]

Yeah, that runs in .8 seconds.  Much better.

So, why were we using in_array to start with if in_array is clearly not the right solution to this problem?  Well, it was basic code evolution.  Originally, these imports would be maybe 100 items.  But, things changed.

FWIW,  I tried array_diff() as well.  It took 25 seconds.  Way better than looping and calling in_array, but still not as quick as a simple isset check.  There was refactoring needed to put the values into the keys of the array.

UPDATE: I updated this post to properly reflect that there is nothing wrong with in_array, but simply that it was not the right solution to this problem.  I wrote this late and did not properly express this.  Thanks to all those people in the comments that helped explain this.

Thoughts on the 2008 MySQL Conference and Expo

Well, it has been almost a month.  I know I am late to the blogosphere on my thoughts.  Just been busy.

Again this year, the Phorum team was invited to be a part of the DotOrg Pavilion.  What is that?  Basically they just give expo floor space to open source projects.  It is cool.  We had a great location this year.  We were right next to the area where they served food and drinks during the breaks.  We had lots of traffic and met some of our power users.  IMVU.com is getting 1.5 million messages per month in their Phorum install.  They did have to customize it to fit into their sharding.  But, that is expected.  A guy (didn't catch his name) from Innobase came by and told us that they just launced InnoDB support forums on their site using Phorum.  Cool.  So now MySQL and Innobase use Phorum.  I am humbled by the message that sends to me about Phorum.

Speaking of our booth, we were right next to the phpMyAdmin guys.  Wow, that product has come a long way.  I was checking out the visual database designer they have now.  It was neat.  I also met the Gentoo MySQL package maintainer.  He was in the phpMyAdmin booth.

I was interviewed by WebDevRadio as I already posted.  I was also asked to do a short Q&A with the Sun Headlines video team.  They used one part of my clip.  I won't link to that.  No, if you find it good for you.  I need to be interviewed some more or something.  I did not look comfortable at all.

There were lots of companies with open in their name or slogan.  I guess this is expected pandering.

I attended part of the InnoDB talk given by Mark Callaghan of Google.  It appears that Google is serious about improving InnoDB on large machines.  That is, IMO, good news for anyone that likes InnoDB.  If I counted right, they had more than 5 people who at least part of their job is to improve InnoDB.

I gave my two talks.  The first had low attendance, but the feedback was nice.  It was just after the snack break in the expo hall and I was in the farthest room from the expo hall.  That is what I keep telling myself. =)  The second was better attended and the feedback seemed good there.  I was told by Maurice (Phorum Developer) that I talked too fast and at times sounded like Mr. Mackey from South Park by repeating the word bad a lot.  I will have to work on that in the future.  I want to do more speaking.

On the topic of my second talk, there seemed to be a lot of "This is how we scaled our site" talks.  I for one found them all interesting.  Everyone solves the problem differently.

Next year I am thinking about getting more specific with my talk submissions.  Some ideas include: PHP, MySQL and Large Data Sets, When is it ok to denormalize your data?, Using memcached (not so much about how it works), Index Creation (tools, tips, etc.).

In closing, I want to give a big thanks to Jay Pipes and Lenz Grimmer from MySQL.  Despite Jay's luggage being lost he was still a big help with some registration issues among other things.  Both of them helped out the Phorum team a great deal this year.  Thanks guys.

Interview with WebDevRadio

While I was at the MySQL Conference, I sat down with Michael Kimsal of WebDevRadio and recapped the two talks that I gave at the conference.  I have uploaded the slides so you can follow along if you want.

One to a Cluster - The evolution of the dealnews.com architecture.

MySQL Tips and Tricks - Some simple tips and some of the more advanced SQL we use in Phorum.

Thanks Michael.  Any time you need a guest, just let me know.

MySQL Conference Swag

I was reading a post about The Swag Report and realized that I stayed so busy at the Phorum booth (and a little at the memcached booth) and preparing for my talks, I did not bother to go around and collect any swag from the conference.  So, if you are a vendor and want to mail me some swag that I missed, you can send it to: Brian Moon, 198 S. Hillcrest Rd., Odenville, AL  35120.  Of course, I expect nothing.  But, ya never know what product I might pimp because of a t-shirt. =)

2008 MySQL Conference, part 1

It is always surprising what I learn when I go to a conference these days. Years ago, I could go to any talk and just suck it all in. Now, it is the little nuggets. The topics as a whole do more to confirm what I have already developed while running the Phorum project and building the infastructure for dealnews.com. That confirmation is still nice. You know you are not the only one that thought a particular solution was a good idea.

One of the confirmations I have had is that the big sites like Flickr, Wikipedia, Facebook and others don't use exotic setups when it comes to their hardware and OS. During a keynote panel, they all commented that they did not do any virtualization on their servers. Most did not use SANs. Some ran older MySQL versions but some were running quite recent versions. I have kept thinking that I did not have the desire to get to fancy with that stuff and clearly I am not the only one.

One of the little nuggets that will likely change my world is index_merge in MySQL. I feel silly as this has been around since 5.0.3 but I was not aware of it. Basically MySQL will now use more than one key to resolve a where clause and possibly an order by depending on the query. This could lead to me removing several keys from tables in both Phorum and at dealnews.

There were others, but I am tired and trying to get OpenID into the Phorum trunk right now so I will have to think of more later.

Phorum turns 10

So, I am at the MySQL Conference this week with my Phorum co-developers. We got to talking last night about how old Phorum is. We knew it was about 10 years. We pulled up some old archived zip file of version 1.5 and found in the this in the comment block.


* Created 04/16/1998


Whoa! That means that yesterday was the 10th birthday of the Phorum project. I would guess that is the date I originally put the code up on my personal web site for people to download. I remember sending that email to the PHP General mailing list. I told people they could have the code if they would help debug it. Later I officially made a GPL license and then a BSD style license as I became more knowledgeable about the open source and free software world.

So, for kicks we decided to install version 1.6 on the phorum.org site. Keep in mind the release date for that was March 30, 1999. The only hurdles were a default value on an auto increment column in the .sql file, needing register_globals and adding .php3 to be parsed as PHP. That got it up and running. I had hoped to post the URL for fun, but sadly, 5 lines in were sql injection vulnerabilties. Ah, the good ol' days.

Sadly, I don't have my emails from 1998. I lost everything in 2001 due to either a hard drive crash or some shady deal I had with someone hosting the Phorum site at the time. I can't remember. If anyone happens to have UseNet archives or mailing list archives of the PHP General list from April 1998, please let me know. I would love to have that old stuff.