This epic ticket of the day is brought to you by Joe Hopkinson.
#7940: Default charset should be utf8mb4
------------------------------------------------------------------------
The RFC for UTF-8 states, AND I QUOTE:
> In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
What's that? You don't believe me?! Well, you can read it for yourself
here!
What is an octet, you ask? It's a unit of digital information in computing
and telecommunications that consists of eight bits. (Hence, __oct__et.)
"So what?", said the neck bearded MySQL developer dressed as Neo from the
Matrix, as he smuggly quaffed a Surge and settled down to play Virtua
Fighter 4 on his dusty PS2.
So, if you recall from your Pre-Intro to Programming, 8 bits = 1 byte.
Thus, the RFC states that the storage maximum storage requirements for a
multibyte character must be 4 bytes, as required.
I know that RFCs are more of GUIDELINE, right? It's not like they could be
considered a standard or anything! It's not like there should be an
implicit contract when an implementor decides to use a label like "UTF-8",
right?
Because of you, we have to strip our reader's carefully crafted emojii.
Because of you, our search term data will never be exact. Because of you,
we have to spend COUNTLESS HOURS altering every table that we have (which
is a lot, by the way) to make sure that we can support a standard that was
written in 2003!
A cursory search shows that shortly after 2003, MySQL release quality
started to tank. I can only assume that was because of you.
Jerk.
* The default charset should be utf8mb4.
* Alter and test critical business processes.
* Change OrderedFunctionSet to generate the appropriate tables.
* Generate ptosc or propagator scripts to update everything else, as needed.
* Curse the MySQL developer who caused this.
4 comments
Justin Swanhart Says:
For technical performance reasons the default should be latin1. Utf8mb4 uses four bytes per character in sorting and grouping, regardless of the character. If you want a default of utf8mb4, then create your database with that character set.
Brian Moon Says:
Yes, Justin. And we will be converting all of our tables. The point is that MySQL (AB? Sun? Oracle?) knowingly made a character set available named "utf-8" that was not actually capable of storing utf-8 data. It is only capable of storing a partial subset of utf-8. And we were using utf-8 in MySQL before 5.5 was released so it was the only option.
Justin Swanhart Says:
I didn't understand from your post that you started with utf8 and need to migrate to utfmb4. Further, when you said make uf8mb4 the default, I thought you meant the database default character set (as opposed to the current latin1) versus substituting utf8mb4 for utf8 automatically when utf8 is selected, which I think is what you are asking for?
There are online schema changes tools like pt-online-schema-change that might help with your migration.
Brian Moon Says:
Yeah Justin. This was rally just a funny ticket meant to be some comic relief. We use pt online Schema change. It's going to take weeks to convert all the tables. So the author of the ticket was frustrated.
Add A Comment