Most epic ticket of the day

Mon, Feb 2, 2015 10:02 PM
UPDATE: I should clarify. This ticket is an internal ticket at DealNews. It is about what the defaults on our servers should be. It is not about what the defaults should be in MySQL. The frustration that UTF8 support in MySQL is only 3 bytes is quite real.

 This epic ticket of the day is brought to you by Joe Hopkinson.

#7940: Default charset should be utf8mb4
------------------------------------------------------------------------
 The RFC for UTF-8 states, AND I QUOTE:

 > In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
 accessible range) are encoded using sequences of 1 to 4 octets.

 What's that? You don't believe me?! Well, you can read it for yourself
 here!

 What is an octet, you ask? It's a unit of digital information in computing
 and telecommunications that consists of eight bits. (Hence, __oct__et.)

 "So what?", said the neck bearded MySQL developer dressed as Neo from the
 Matrix, as he smuggly quaffed a Surge and settled down to play Virtua
 Fighter 4 on his dusty PS2.

 So, if you recall from your Pre-Intro to Programming, 8 bits = 1 byte.
 Thus, the RFC states that the storage maximum storage requirements for a
 multibyte character must be 4 bytes, as required.

 I know that RFCs are more of GUIDELINE, right? It's not like they could be
 considered a standard or anything! It's not like there should be an
 implicit contract when an implementor decides to use a label like "UTF-8",
 right?

 Because of you, we have to strip our reader's carefully crafted emojii.
 Because of you, our search term data will never be exact. Because of you,
 we have to spend COUNTLESS HOURS altering every table that we have (which
 is a lot, by the way) to make sure that we can support a standard that was
 written in 2003!

 A cursory search shows that shortly after 2003, MySQL release quality
 started to tank. I can only assume that was because of you.

 Jerk.

 * The default charset should be utf8mb4.
 * Alter and test critical business processes.
 * Change OrderedFunctionSet to generate the appropriate tables.
 * Generate ptosc or propagator scripts to update everything else, as needed.
 * Curse the MySQL developer who caused this.
4 comments
Gravatar for Justin Swanhart

Justin Swanhart Says:

For technical performance reasons the default should be latin1. Utf8mb4 uses four bytes per character in sorting and grouping, regardless of the character. If you want a default of utf8mb4, then create your database with that character set.

Gravatar for Brian Moon

Brian Moon Says:

Yes, Justin. And we will be converting all of our tables. The point is that MySQL (AB? Sun? Oracle?) knowingly made a character set available named "utf-8" that was not actually capable of storing utf-8 data. It is only capable of storing a partial subset of utf-8. And we were using utf-8 in MySQL before 5.5 was released so it was the only option.

Gravatar for Justin Swanhart

Justin Swanhart Says:

I didn't understand from your post that you started with utf8 and need to migrate to utfmb4. Further, when you said make uf8mb4 the default, I thought you meant the database default character set (as opposed to the current latin1) versus substituting utf8mb4 for utf8 automatically when utf8 is selected, which I think is what you are asking for?

There are online schema changes tools like pt-online-schema-change that might help with your migration.

Gravatar for Brian Moon

Brian Moon Says:

Yeah Justin. This was rally just a funny ticket meant to be some comic relief. We use pt online Schema change. It's going to take weeks to convert all the tables. So the author of the ticket was frustrated.

Add A Comment

Your Name:


Your Email:


Your URL:


Your Comment: