At work, I am in charge of upgrading my team’s database fleet from MySQL 5.7 to MySQL 8.0 within a short (3 months) timeframe before AWS charges an extra 40% to maintain our (legacy) databases.

We know that most of our SQLs are simple SELECT, UPDATE and INSERTs. Looking at the official MySQL upgrading document, it seems likely that we won’t be facing any compatibility issue during the upgrade. But how can I be sure? How can I convince our team’s stakeholders that it is the case?

Here’s a high-level plan: First, obtain a sample of database audit logs. Then, group the SQL statements into a small number (<100) of unique SQL statements. Lastly, go through these unique SQL statements and present it to stakeholders.

I first obtained a sample of database audit logs. These files come in gzipped CSV, each containing about 1 million SQL statements. I fired up DuckDB, a fast local OLAP database, and use its convenient CSV imports feature to create a table of my audit log file.

CREATE TABLE testdb_00 AS
  SELECT serverhost, username, host, operation, database, query, retcode
  FROM read_csv('/Users/bennett.clement/Downloads/auditlog/testdb/000000.gz',
      compression='gzip',
      header = false,
      columns = {
          'timestamp': 'VARCHAR',
          'serverhost': 'VARCHAR',
          'username': 'VARCHAR',
          'host': 'VARCHAR',
          'connectionid': 'BIGINT',
          'queryid': 'BIGINT',
          'operation': 'VARCHAR',
          'database': 'VARCHAR',
          'query': 'VARCHAR',
          'retcode': 'BIGINT'
      });

The SQL statements is stored in the query field of the table.

Next, I will group these SQL statements. I searched around for a pre-existing tool to group these SQLs but couldn’t find any. So, let’s do it ourselves. It shouldn’t be too hard and knowing my team’s table design should help.

Before we can do the grouping however, we need to do some pre-processing of the statements. Otherwise, queries with similar structure (e.g. WHERE id = 1 and WHERE id = 2) ended up in different groups.

Here’s the SQL I used:

WITH queries1 AS (
    select query from testdb_00
    where operation='QUERY'
    and query not like '%@@%' and query not like 'PURGE BINARY LOGS %' and query not like 'FLUSH %' and query not like 'SHOW %' and query not like 'show %'
    and query not like '%_schema%' and query not like '%_SCHEMA%' and query not like '%mysql.%' and query not like 'START %'
    and query not like 'call %' and query not like 'SET time_zone=%' and query <> 'SET NAMES utf8mb4' and query not like 'set local %'
    and query not like 'SET .*' and query <> 'BEGIN' and query <> 'COMMIT'
), queries2 AS (
    select query,
    regexp_replace(
    regexp_replace(
    regexp_replace(
    regexp_replace(
    regexp_replace(query,
    '(''.*?'')', '?', 'ig'), -- remove strings
    '(\d{7,9})', '?', 'ig'), -- remove id
    '(.*(?:t_table_|db_database_))\S+(.*)', '\1XXX\2', 'i'), -- normalize sharded tables
    '(?:\w+([,)]))', '?\1', 'ig') -- remove field names
    as q from queries1
)
select q, count(1) as c from queries2 group by q order by c desc;

The overall SQL query used to group our audit logs ended up pretty simple.

queries1 removes statements that are either operational-related, e.g. PURGE and FLUSH. queries2 makes use of REGEXP feature to replace string values and field names with ?. The reason I need to remove the field names is because the SQL generated by our ORM has an unstable order, hence there are many queries are the same just with the fields shuffled around. queries2 also exploits our table design knowledge by removing ids (that I know is between 7 and 9 digits) and normalizing sharded table names by turning t_table_01, t_table_02 into t_table_XX. In essence, queries2 reverts what SQL drivers did, i.e. turning raw SQL back into prepared statements.

Running this SQL in DuckDB is really fast, and I ended up with 20-100 SQL statement groups. It is now not intractable to go through it one-by-one to check if there are any features that is deprecated in MySQL 5.7

Does this 100% guarantee that I won’t face compatibility issues during the upgrade? Definitely not. There are other ways that provides better guarantee, e.g. Facebook’s shadow replay tool, but it gets the job done.