At work, I am in charge of upgrading my team’s database fleet from MySQL 5.7 to MySQL 8.0 within a short (3 months) timeframe before AWS charges an extra 40% to maintain our (legacy) databases.
We know that most of our SQLs are simple SELECT, UPDATE and INSERTs. Looking at the official MySQL upgrading document, it seems likely that we won’t be facing any compatibility issue during the upgrade. But how can I be sure? How can I convince our team’s stakeholders that it is the case?
Here’s a high-level plan: First, obtain a sample of database audit logs. Then, group the SQL statements into a small number (<100) of unique SQL statements. Lastly, go through these unique SQL statements and present it to stakeholders.
I first obtained a sample of database audit logs. These files come in gzipped CSV, each containing about 1 million SQL statements. I fired up DuckDB, a fast local OLAP database, and use its convenient CSV imports feature to create a table of my audit log file.
The SQL statements is stored in the query
field of the table.
Next, I will group these SQL statements. I searched around for a pre-existing tool to group these SQLs but couldn’t find any. So, let’s do it ourselves. It shouldn’t be too hard and knowing my team’s table design should help.
Before we can do the grouping however, we need to do some pre-processing of the statements. Otherwise, queries with similar structure (e.g. WHERE id = 1
and WHERE id = 2
) ended up in different groups.
Here’s the SQL I used:
The overall SQL query used to group our audit logs ended up pretty simple.
queries1
removes statements that are either operational-related, e.g. PURGE
and FLUSH
. queries2
makes use of REGEXP feature to replace string values and field names with ?
. The reason I need to remove the field names is because the SQL generated by our ORM has an unstable order, hence there are many queries are the same just with the fields shuffled around. queries2
also exploits our table design knowledge by removing ids (that I know is between 7 and 9 digits) and normalizing sharded table names by turning t_table_01, t_table_02 into t_table_XX. In essence, queries2
reverts what SQL drivers did, i.e. turning raw SQL back into prepared statements.
Running this SQL in DuckDB is really fast, and I ended up with 20-100 SQL statement groups. It is now not intractable to go through it one-by-one to check if there are any features that is deprecated in MySQL 5.7
Does this 100% guarantee that I won’t face compatibility issues during the upgrade? Definitely not. There are other ways that provides better guarantee, e.g. Facebook’s shadow replay tool, but it gets the job done.