Long time MySQL users will recognize that there are two varieties of utf8 support in MySQL; utf8mb3 and utf8mb4. Let me dig a little bit deeper in explaining the history between the two:
- MySQL 4.1 (2004) was the first version to support character sets and collations. The default character set was latin1, but utf8[mb3] was available as an option. An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost all modern languages.
- MySQL 5.5 (2010) added support for up to 4 byte utf8 using the new utf8mb4 character set.
- MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC. This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.
- MySQL 8.0 (In development) vastly improves the performance of utf8mb4, as well as adding several new collations. It is now the default character set for MySQL.
Is utf8mb3 still faster?
With the original purpose of utf8mb3 being a performance optimization, the next question is, does this still yield true today? The short answer is no; the new utf8mb4-based collations are much faster than any of the old utf8mb3-based ones:
utf8mb4 shown in red. Results in transactions per second; higher is better.
We expect cases where utf8mb3 is faster to be quite rare, and any such case will be considered a bug 🙂
Making the case for utf8mb4
If the performance gains in MySQL 8.0 aren’t enough to entice you, perhaps these additional points will:
- Even for English speaking markets, the prevalence of emojis as character input is driving adoption of utf8mb4 over utf8mb3 and latin1.
- We have improved our collations to account for a number of language specific sorting rules. The collations for utf8mb3 are correct for the common cases, but the devil is in the details.
- The new collations also support accent and case sensitivity.
- Even in Asia, we are seeing adoption of utf8mb4 over CJK character sets, largely because it supports a super-set of possible characters. We also now support a collation specific to Japanese.
Future Steps
As we no longer see a strong use-case for utf8mb3, we intend to mark it as deprecated in MySQL 8.0. Because upgrading from earlier character-sets requires tables to be rebuilt, we expect that it may be some it time before we are able to move from deprecation to removal. However, in making this first step we are communicating that it is a legacy feature that should no longer be used in new applications.