Bing ads are such bull crap

Have you see these Bing ads? The insinuate that web search gives you a lot of crap. But, that Bing can cure that problem. Really? One simple question Bing:

How big is the sun?

Google

Google tells me the mass right off the bat. Neat. The following links are all very relevant too.



Yahoo

Yahoo gives me an interesting alternate query. The links are all relevant too.



Bing

Bing starts well by suggesting some queries.



But, it soon falls apart IMO.



Big Red Sun? A garden design company? That is the most relevant? Really? The third result is way off too. Their related searches clearly seem to indicate that Bing understands what I am looking for, but their results fail badly. Get the log out of your eye Bing.

O'Reilly Open Source Conference Day Two

So, day two was the cool keynote day.  Day one keynotes were from Tim O'Reilly (not that he is not cool) and the vendors sponsors.  The Intel building blocks stuff was neat, but most of it was vendor stuff IMO.

Today we had the "cool thing to here and see, but I proabably won't use it" keynote.  It was The Processing Development Environment.  It was really cool.  You can read more about it at processing.org.

The next keynote was hard for me to follow.  There were no slides he stood behind the podium the whole time.  Gnat  seemed to love it as he all told us in IRC.  You can read the guys blog at overcomingbias.com.  It was basically about overcoming the biases you have.... I think.

Interestingly, (speaking of bias) the next keynote was from Microsoft.  Coincidence?  According to the speaker, MS (or at least this guy) is really trying to make some Open Source stuff.  Time will tell.  Also, they are "working" with the OSI to get their licensing approved as Open Source licenses.  As somone in IRC said, its a win/win from them.  If they don't get approved, they can just blame the OSI for being inflexible.  Nate kind of put him on the spot about patents after his talk.  He handled it well and kind of rode the fence.

The last keynote was, for me, the pay off keynote.  Its the one I will remember from this year the most.  It was about branding.  The poor guy did not have his slides due to technical issues and still did a great job.  You can read Steve's blog at steve-yegge.blogspot.com.  Maybe he will post the slides.

I attended a couple of good sessions today.  One was about caching, mostly with APC.  But, if you stripped down the APC stuff and just took some of his concepts, you could apply some of it to lots of caching methods.  The talk was given by Gopal Vijayaraghavan of Yahoo! I don't have a URL for the site where his slides may be.  If I find it, I will post it.

Another one was about legacy PHP code.  I didn't agree with 100% of what he was saying, but if you are in the boat he described, anything is better than where you are.  The guys site is clintonrnixon.net.  Hopefully he will put of the slides and maybe a blog post about it.

The last talk that I want to tell you about was from Amy Hoy.   She gave the "When Interface Design Attacks!" again this year. Just like last year, it was brilliant.  There were new topics like web 2.0.  I was happy to see that the Phorum 5.2 template I have been working on (emerald) already included many of her recommendations.  I guess she rubbed off on me last year.  Amy has started her own consulting company.  If we need a usability and/or interface design help again (bleh, the last one was less than exciting) I will push for using her for sure.  Check out her site (linked above) for more stuff from her.

The day (and conference really) ended with parties.  We went to the Sourceforge Open Source Awards party.  phpBB won best tool for communication.  Gag me with a chicken bone.  I guess it has a large install base.  But, MySpace has lots of users too.  That does not mean its not a black eye on the internet.  Ok, MySpace is worse than phpBB for sure.  But, c'mon, I write Phorum.  I am biased (see above keynote =).  It was a popularity contest and I guess there are more kiddies to vote for them than say Pidgin which is what I voted for.  With all the trouble they have had with their name, I wonder if "Gaim" would have gotten more votes.  (see other keynote on branding =).  The phpBB team may need to see the branding keynote from this morning.  It talked about how it takes a generation to change perception about a brand.  Most people I talked to here have a negative reaction to the phpBB brand.

The rest of the night we just hung out at the party hosted by Jive Software. We use OpenFire from those guys.  I am not a big Java user on the server.  Its just one more different thing to admin in a company that is 99% GNU C apps on the servers.  But, Openfire does a damn good job with XMPP.

In closing, O'Reilly Open Source Convention was great.  I got some great ideas of stuff we should be doing.  I got confirmation of things that we are already doing.  And most important, IMO, we got to share with others how we solve problems.  As  Gopal said in his caching talk, sometimes is better to stop doing stuff and tell others what you are doing (paraphrase).

O'Reilly Open Source Conference Day One

Day one is complete.  Portland is great as always.  Its really day 1 1/2 since we got in at 1PM yesterday.  That allowed us to go to the MySQL/Zend party last night.  Great party by those guys.  Touched based with old friends and made some new ones.

I kind of session hopped today.  Of note, I attended Andi Gutmans PHP Security talk which really had little to do with PHP.  Like Larry Wall's onion metaphor, Andi presented an onion metaphor for security.  I stopped in for a while on the SOLR talk.  It looks neat.  I like that it is a REST interface to Lucene.  If we were not using Sphinx already I might take a longer look.  But, we like Sphinx and, SOLR and Lucene are Java.  Not that there is anything wrong with that, we just don't use Java a lot, so its just one more thing that would be out of the norm.  I admit I spent a good bit of time in what is being called the "hallway track" working on some code.  Work does not stop just because you are at a conference.

I got to hang out with Jay Pipes of the MySQL Community team a good bit.  We talked about the MySQL forums (which or course runs Phorum) and how they want to improve them.  They would like to see tagging, user and post rating and some other things.  Some good things will come out of that.  Hopefully they have some of the tagging stuff done already at MySQL Forge and can contribute that code to Phorum, saving us time.

I hosted the Caching for fun and profit BoF.  It was not packed, but it was a good time.  The MySQL BoF was at the same time, so we lost some folks to that I am sure.  They had beer and pizza.  Brad Fitzpatrick did come by and contribute.  Thanks Brad.  It was mostly the same stuff you get on the memcached mailing list.  "How do we expire lots of cache at once?"  Questions about different clients.  Stuff like that.  It kind of turned into a memcached BoF, but I tried to share the dealnews experience with the attendees including our MySQL Cluster pushed caching.

I have met many readers of both dealnews and this blog (hi to you) while here.  Glad to know that both my professional work and my personal work are of use to folks.  The demographic at this conference is dead on for dealnews.  Maybe I can get them to sponsor it next year.  That would be cool.

I say every year that I want to present "next year".  Something always keeps me from doing it.  Usually its just not having time to prep for it.  By the time I think about it, the call for papers has passed.  I really want to get it done this time.  We shall see I suppose.

We went to the Sun party tonight.  It was a good time.  There was beer that was free as in beer.  More hanging with friends and talking about all kinds of stuff.  Now, all you Slashdotters sit down.  I saw people from the PostgreSQL and MySQL teams drinking beer and having fun together.  OMGWTFBBQ!!!1!!  See, the people that really matter in those projects don't bicker and fight about which is better.  They just drink beer and have a good time together.

Anyhow, I will blog more after day 2.  There won't be a day 3 as I have to catch an 11:30 flight back home.  That is usually how it goes.  Not sure why they book anything on Friday really.  Even O'Reilly has its "after party" on Thursday night.  Its late, and I need sleep.

Five months with MySQL Cluster

So, the whole world changed at dealnews when Yahoo! linked us. We realized that our current infrastructure was not scaling very well. We had to make a change.

The Problem

Even though we were using all sorts of cool techniques, the server architecture was really still just a bunch of web servers all serving the same content. In addition to that, our existing systems as the time used a pull method. When a request came in, memcache was checked, if the data was not there, it was fetched from our main MySQL server. So, when there is no data in the cache or when it expires, this was very bad. Like when Yahoo! hit us. Some cache item would expire and 60,000 users would hit a page and each page would try and create the cache item.

The Solution

I was tasked with two things. Find a way to handle something like the Yahoo burst and finding a way to store the data we need to generate our web pages that was highly available and would scale. For bursting, I wrote a proxy using apache, mod_rewrite, php and memcached. I have reasons I did it this way that are not relevent to this post. Maybe more on that later.

For the data solution, I considered several things: MySQL replication, writing my own replicating memcached client, and other exotic ideas. One of the semi-exotic ideas for us was MySQL Cluster. We had not used it at all. Some things about it made us gun shy. But, we tested it and were very happy with the results.

Initial Test

With the help of Gentoo, getting a cluster up and running was really, really easy. In fact, it seemed too easy. We ran a cluster on some dev boxes at first. We did some generic testing using the PHPTestSuite from the guys at MySQL Performance Blog. What we found was that while the cluster appeared slower at low concurrent connections, it scaled much better than InnoDB (our prefered storage engine) when the concurrent connections grew.

Application Testing

So, we moved to the next step, testing our application. We discovered early on with cluster that we would have to redesign our application. Our DB was highly relational. Almost no data could be put on the site without data from other tables. We used a lot of joins. We learned (later) that joins in the cluster are not a good idea. Neither are sub-selects. So, we wrote some proof of concept scripts for our application. We were very happy. Very few issues were found. Nothing anywhere near show stopping.

Installation

We ordered our servers. Six new Dell dual-core, dual processor Opterons with a lot of memory. Two would become SQL nodes and the other four would be storage nodes. Our data set is not that large compared to a lot of companies. So, we configured the cluster with 4 replicas. Our main goal is high availability and scalability. I could find nothing in my tests or in the manual that indicated this would be bad for scalability and it should be great for HA.

We rewrote our application (basically, our public web site) to use the new cluster and its new table design. We hit our first snag when we tried to seed the data in the cluster. We got errors from the cluster about its transaction logs not being big enough to handle the inserts. Through the manual, forum posts, the mailing list archives and some blogs I was able to find the correct settings for our needs. I remembered back when I first installed the cluster thinking it was too easy. I now realize that getting a cluster running is easy. Making it run well, is a whole other story.

The second snag was with joins. Our test bed for the cluster was not a cluster. We used a group of servers using InnoDB to test against. That was a mistake. Joins did not work at all with the cluster. We had to back up, rewrite some code and redo some tables. In the end, the design is probably faster on InnoDB or cluster.

Everyday Use

We started using the cluster for every day use about a month ago. I guess 5 months is not bad for starting from nothing to live in production. We have been slowly moving applications to it. We take care each time to monitor the cluster and see that its not throwing new errors. So far, so good. We have about 80% of our page views (40% of our page views are our front page) and about 50% of our end user applications using the cluster now. We are doing caching at the proxy level for a lot of this. But, when tested, the new architecture is much more reliable even without the caching proxies. Some things like our forums will never translate to the cluster. But, they have their own dedicated systems already and are non-critical for our business. They could be shut down if there was a problem with them.

Administration

MySQL Cluster is a whole new animal. Its not like monitoring mysqld, apache or other stuff we already use. It took me a while to get the hang of rolling restarts, brining nodes up and down after crashes, etc. We have had just one crashed node since we switched over to production use. The cluster stayed up and kept serving content. We have written a Nagios monitor to keep track of the nodes' status. It uses ndb_mgm and reports any problems to us.

Feedback

Now, as the title says, I have only been using MySQL Cluster for 5 months. If you are reading this and have more experience and are thinking "What a moron!", please tell me. We are still learning.

Update:

Ronald Bradford had some questions on his blog for me.  I figured I would just answer them here.

You didn’t mention any specific sizes for data, I’d be interested to know, particularly growth and how you will manage that?

We currently have a DataMemory of 4GB and IndexMemory of 2GB.  Based on the crude methods we have to monitor it, I think we are at about 40% capacity.  We are using MySQL Cluster purely as a data store for content on our web site.  So, we can trim the data store down significantly.  If it does not appear on the site, its not in cluster.

You also didn’t mention anything about Disk? MySQL Cluster may be an in-memory database but it does a lot of disk work, and having appropriate disk is important. People overlook that.

Yes, we have U320 15k SCSI drives.  We do use RAID 1 on our servers contrary to some opinions.  We see a lot of drive failures.  About one every 4 months.  Sucks to lose a whole machine just because a $200 drive failed.

You didn’t mention anything about timings? Like how does backups for example compare now to previously.

Well, we don't currently back up the cluster data as it is being copied from our main database already.  Maybe that is a mistake, I don't know.  But, I can't come up with a reason to backup data that is just a copy of another database server.  Also, I have written a PHP class that does parallel writing to multiple servers using transactions.  Everything we write to the cluster also gets written to an "oh shit" mysql server that users InnoDB.  So, in the event we have a total cluster failure, F5 BIG-IP load balancers will send mysql traffic to the InnoDB server.

You didn’t mention version? 5.1 (not GA) is significant improvement in memory utilization due to true varchar support, saving a lot of memory, but as I said not yet production software.

Yeah, I am drooling over 5.1.  But, we are using current Gentoo stable, 5.0.38 I believe.  5.1 looks superior in many many ways.  I can't wait to upgrade.

Getting all SOAPY

So, we (dealnews.com) rolled out a new site this month, metaprice.com.  Its young and lacking features of many of the other price comparison sites, but is has great potential.  Our hope is to bring together the best features of all the other players in the market in one great application.
Part of this project required using web services with several different data suppliers.  Most support simple REST and SOAP, but some only offer SOAP.  So, given that I bit the bullet and enabled the SOAP extension for PHP5.  Wow!  I was happily surprised.  The last SOAP code I had looked at was the old PEAR code.  It was not that attractive to me.  It required a lot of work IMO to talk SOAP.

Now, with just 3 lines of code, I can get back a nice object that has all the data I need.  Kudos to Brad Lafountain, Shane Caraveo, Dmitry Stogov and anyone else that worked on this extension.  It definitely made my life easier.  Its so easy, I am actually looking forward to making a SOAP server with some of our data.

On another note, I have been a little disappointed with the MySQL FullText relavance matching.  I know that single term searches are not really easy to deal with.  But, sometimes, even multiword searches don't yield what I would hope.  For example, a search for Windows XP yields 2 systems that includes Windows XP as 2 of the top 4 matches.  There are two other matches there that are good  matches.  And, yes, I do have my min length set to 2.   I am thinking about giving Sphinx a shot to see if its relevance ranking is any better.

Anyone have a good home grown algorithm for relevance?

Feeling Lucky with Yahoo!

So, one of the things I always use in Firefox is the auto Google Feeling Lucky feature. If you don't know what I mean, then load Firefox and type php in the address bar. You will most likely go to http://www.php.net or a mirror of it. What is happening is Firefox is sending the words you type to the Google Feeling Lucky URL. That feature at Google then redirects you to the top ranked site for those terms. Its really handy for the PHP Manual for example. Just type php strtotime. You should be taken to the PHP Manual page for strtotime. Its really handy.

The only problem I have is that I like to use Yahoo! for search. There are a couple of reasons. One of the biggest is because they support the PHP community so much. Yahoo has Instant Search. They consider it to be an answer for Feeling Lucky, but it does me no good in this case.

I thought I was stuck until I discovered that Yahoo! offers RSS versions of searches. So, I wrote yahoo_luck.php. Its a simple little script that uses SimpleXML to grab a Yahoo! RSS search result and forward you on. I did put some backup in there for times when, for some reason, Yahoo! would not answer. Perhaps some of you Yahoo! guys in the PHP world can poke someone about that.

The one downside is that you will need your own web server. I tried to come up with some sort of javascript to do this. But, alas, I am not the JS wizard I wish I was. I am not even sure if the keywords.URL setting in Firefox could use JS.

To use this, just stick it on a server and replace keyword.URL in your Firefox about:config with http://www.example.com/yahoo_luck.php?s=. Enjoy.

Phorum + Sphinx = really fast

Thomas wrote the Sphinx search module nearly a month ago. I have just now gotten to looking at it. From the things I have read, Sphinx looks really cool. I am considering using it in some other sites.

Sphinx is a "standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL, or from an XML pipe."

(Originally posted on the Phorum web site)