Replication is much better than cold backups

Sun, Aug 17, 2008 02:58 AM
So, I wrote about the begining of our wild database issues. Since then, I have been fighting a cold, coaching little league football and trying to help out in getting our backup solutions working in top shape.  That does not leave much time for blogging.

Never again will we have ONLY a cold backup of anything.  We were moving nightly full database dumps and hourly backups of critical tables over to that box all day long.  Well, when the filesystem fails on both the primary database server and your cold backup server, you question everything.  A day after my marathon drive to fix the backup server and get it up and running, the backup mysql server died again with RAID errors.  I guess that was the problem all along.  In the end, we had to have a whole new RAID subsystem in our backup database server.  So, my coworker headed over to the data center to pull the all nighter to get the original, main database server up and running.  The filesystem was completely shot.  ReiserFS failed us miserably.  It is no longer to be used at dealnews.

Well, today at 6:12PM, the main database server stops responding again.  ARGH!!  Input/Ouput errors.  That means RAID based on last weeks experience.  We reboot it.  It reports memory or battery errors on the RAID card.  So, I call Dell.  Our warranty on these servers includes 4 hour, onsite service.  They are important.  While on the phone with Dell, I run the Dell diagnostic tool on the box.  During the diagnostic test, the box shuts down.  Luckily, the Dell service tech had heard enough.  He orders a whole new RAID subsystem for this one as well.

There is one cool thing about the PERC4 (aka, LSI Megaraid) RAID cards in these boxes.  They write the RAID configuration to the drives as well as on the card.  So, when a new blank RAID card is installed, it finds the RAID config on the drives and boots the box up.  Neato.  I am sure all the latest cards do it.  It was just nice to see it work.

So, box came up, but this time we had Innodb corruption.  XFS did a fine job in keeping the filesystem in tact.  So, we had to go from backups.  But, this time we had a live replicated database that we could just dump and restore.  We should have had it all along, but in the past (i.e. before widespread Innodb) we were gun shy about replication.  We had large MyISAM tables that would constantly get corrupted on the master or slave and would halt replication on a weekly basis.  It was just not worth the hassle.  But, we have used it for over a year now in our front end database servers with an all Innodb data set.  As of now, only two tables in our main database are not Innodb.  And I am trying to drop the need for a Full-Text index on those right now.

So, here is to hoping our database problems are behind us.  We have replaced almost everything in one except the chassis.  The other has had all internal parts but a motherboard.  Kudos to Dell's service.  The tech was done with the repair in under 4 hours.  Glad to have that service.  I recommend it to anyone that needs it.
Gravatar for erin oneill

erin oneill Says:

I'm surprised about the MyISAM replication problems? At my last job we had nothing but problems with InnoDB and replication (we kept hitting a known bug). The MyISAM Master with many slaves never crashed, nor did the slaves. Never any corrupt data. Now I did run a nightly analyze script to analyze the MyISAM tables.

The dataset was chunked (vertically partitioned) enough that our main db was still small enough to have an import take roughly 40 minutes. It probably would have taken less time if they'd given me 16GB or 32GB of RAM (we had 8GB).

I love replication...

Gravatar for Peter Zaitsev

Peter Zaitsev Says:

Designing the deployment I prefer to look at it as on "redundant array of inexpensive servers" - any server can die with all the data it holds and it should not be catastrophic. Replication is a real help with this of course.

I also prefer to keep the separate server "standby" so you can recover data to it without being required to depend on Dell or whomever to fix server in the short time frame.

Server dies - it gets out of production and can be fixed and pass good QA before it is taken back.

Gravatar for Brian Moon

Brian Moon Says:

We are still doing everything we were before. Nightly backups moved to other servers and offsite. Hourly backups are still made and copied to other servers. I will look into a delayed slave as well. Sounds like an interesting idea.

Gravatar for Xaprb

Xaprb Says:

When you said "Never again will we have a cold backup of anything" I took it to mean you were getting rid of actual backups ;-) I guess you meant "never again will we have ONLY...."

Gravatar for Xaprb

Xaprb Says:

A replication slave will not save you from a malicious or accidental DROP DATABASE or DELETE WHERE 1=1. Replication != backup.

You need both: slaves for hot failovers, and backups for actual recovery.

A delayed slave can get you some good benefits too. That way you have N minutes to notice the DROP DATABASE and stop the delayed slave before it's too late. mk-slave-delay ;-)

Comments are disabled for this post.