[转帖]Migrating to utf8mb4: Things to Consider

migrating,to,utf8mb4,things,consider · 浏览次数 : 0

小编点评

Sure, here's a summary of the content you provided: **Converting Character Sets** * Converting all VARCHAR columns that could store more than 16383 characters to TEXT or LONGTEXT data type. * Adjusting index definitions that could take more than 767 bytes for the REDUNDANT, COMPACT, and DYNAMIC row formats. * Optimizing queries to avoid using internal disk-based temporary tables. **Important Considerations** * Ensure that all VARCHAR columns that could store more than 16383 characters are converted to TEXT or LONGTEXT data type. * Adjust index definitions that could take more than 767 bytes for the REDUNDANT, COMPACT, and DYNAMIC row formats. * Optimize queries to avoid using internal disk-based temporary tables. **Additional Notes** * The query uses CTE, available as of MySQL 8.0. * Ensure that you converted all VARCHAR columns that could store more than 16383 characters to TEXT or LONGTEXT data type. * Adjust index definitions that could take more than 767 bytes for the REDUNDANT, COMPACT, and DYNAMIC row formats. * Optimize queries to avoid using internal disk-based temporary tables.

正文

 

Migrating to utf8mb4The utf8mb4 character set is the new default as of MySQL 8.0, and this change neither affects existing data nor forces any upgrades.

Migration to utf8mb4 has many advantages including:

  • It can store more symbols, including emojis
  • It has new collations for Asian languages
  • It is faster than utf8mb3

Still, you may wonder how migration affects your existing data. This blog covers multiple aspects of it.

Storage Requirements

As the name suggests, the maximum number of bytes that one character can take with character set utf8mb4 is four bytes. This is larger than the requirements for utf8mb3 which takes three bytes and many other MySQL character sets.

Fortunately, utf8mb3 is a subset of utf8mb4, and migration of existing data does not increase the size of the data stored on disk: each character takes as many bytes as needed. For example, any digit or letter in the Latin alphabet will require one byte. Characters from other alphabets can take up to four bytes. This can be verified with a simple test.

As a result, all your data that uses a maximum of three bytes would not change and you will be able to store characters that require 4-bytes encoding.

Maximum Length of the Column

While the data storage does not change, when MySQL calculates the maximum amount of data that the column can store, it may fail for some column size definitions that work fine for utf8mb3. For example, you can have a table with this definition:

If you decide to convert this table to use the utf8mb4 character set, the operation will fail:

The reason for this is that the maximum number of bytes that MySQL can store in a VARCHAR column is 65,535, and that is 21845 characters for utf8mb3 character set and 16383 characters for the utf8mb4 character set.

Therefore, if you have columns that could contain more than 16383 characters, you will need to convert them to the TEXT or LONGTEXT data type.

You can find all such columns if you run the query:

For example, in my test environment, it returns:

 

Index Storage Requirement

MySQL does not know in advance which characters you will store in the column when you are creating indexes. Therefore, when it calculates the storage required for the index, it takes the maximum value for the character set chosen. As a result, you may hit the index storage limit when converting from another character set to utf8mb4. For InnoDB, the maximum size of the index is 767 bytes for REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats. See The User Reference Manual for details.

That means you need to check if you have indexes that could grow to exceed these values before performing the update. You can do this with the following query:

Here is the result of running the query in my test environment:

Once you have identified such indexes, check the columns and adjust the table definition accordingly.

Note: The query uses CTE, available as of MySQL 8.0. If you are still on version 5.7 or earlier, you will need to rewrite the query.

Temporary Tables

One more issue you can hit after converting to the utf8mb4 character set is an increased size of the implicit temporary tables that MySQL creates to resolve queries. Since utf8mb4 may store more data than other character sets, the column size of such implicit tables will also be bigger. To figure out if you are affected by this issue, watch the global status variable Created_tmp_disk_tables. If this starts significantly increasing after the migration, you may consider updating RAM on your machine and increasing the maximum size of the temporary tables. Note that this issue could be a symptom that some of your queries are poorly optimized.

Conclusion

Converting to the utf8mb4 character set brings you the advantages of better performance, a larger range of characters that you can use, including emojis and new collations (sorting rules). This conversion comes at almost no price, and it can be done smoothly.

Ensure:

  • You converted all VARCHAR columns that could store more than 16383 characters to the TEXT or LONGTEXT data type
  • You adjusted index definitions that could take more than 767 bytes for the REDUNDANT and COMPACT row formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats after migration.
  • You optimized your queries so that they should not start using internal disk-based temporary tables

与[转帖]Migrating to utf8mb4: Things to Consider相似的内容:

[转帖]Migrating to utf8mb4: Things to Consider

Back to the Blog 29Mar2022 By Sveta Smirnova Insight for DBAs, Insight for Developers, MySQL Character Sets, MySQL, MySQL character se, mysql-and-vari

[转帖]

Linux ubuntu20.04 网络配置(图文教程) 因为我是刚装好的最小系统,所以很多东西都没有,在开始配置之前需要做下准备 环境准备 系统:ubuntu20.04网卡:双网卡 网卡一:供连接互联网使用网卡二:供连接内网使用(看情况,如果一张网卡足够,没必要做第二张网卡) 工具: net-to

[转帖]

https://cloud.tencent.com/developer/article/2168105?areaSource=104001.13&traceId=zcVNsKTUApF9rNJSkcCbB 前言 Redis作为高性能的内存数据库,在大数据量的情况下也会遇到性能瓶颈,日常开发中只有时刻

[转帖]ISV 、OSV、 SIG 概念

ISV 、OSV、 SIG 概念 2022-10-14 12:29530原创大杂烩 本文链接:https://www.cndba.cn/dave/article/108699 1. ISV: Independent Software Vendors “独立软件开发商”,特指专门从事软件的开发、生产、

[转帖]Redis 7 参数 修改 说明

2022-06-16 14:491800原创Redis 本文链接:https://www.cndba.cn/dave/article/108066 在之前的博客我们介绍了Redis 7 的安装和配置,如下: Linux 7.8 平台 Redis 7 安装并配置开机自启动 操作手册https://ww

[转帖]HTTPS中间人攻击原理

https://www.zhihu.com/people/bei-ji-85/posts 背景 前一段时间,公司北京地区上线了一个HTTPS防火墙,用来监听HTTPS流量。防火墙上线之前,邮件通知给管理层,我从我老大那里听说这个事情的时候,说这个有风险,然后意外地发现,很多人原来都不知道HTTPS防

[转帖]关于字节序(大小端)的一点想法

https://www.zhihu.com/people/bei-ji-85/posts 今天在一个技术群里有人问起来了,当时有一些讨论(不完全都是我个人的观点),整理一下: 为什么网络字节序(多数情况下)是大端? 早年设备的缓存很小,先接收高字节能快速的判断报文信息:包长度(需要准备多大缓存)、地

[转帖]awk提取某一行某一列的数据

https://www.jianshu.com/p/dbcb7fe2da56 1、提取文件中第1列数据 awk '{print $1}' filename > out.txt 2、提取前2列的文件 awk `{print $1,$2}' filename > out.txt 3、打印完第一列,然后打

[转帖]awk 中 FS的用法

https://www.cnblogs.com/rohens-hbg/p/5510890.html 在openwrt文件 ar71xx.sh中 查询设备类型时,有这么一句, machine=$(awk 'BEGIN{FS="[ \t]+:[ \t]"} /machine/ {print $2}' /

[转帖]Windows Server 2022 简体中文版、英文版下载 (updated Oct 2022)

https://sysin.org/blog/windows-server-2022/ Windows Server 2022 正式版,2022 年 10 月更新,VLSC Posted by sysin on 2022-10-27 Estimated Reading Time 8 Minutes