Versioning a MySQL Database on GitHub

As part of my work with Scripta Qumranica Electronica I have been working together with Spencer Jones to get our project(s) up on GitHub in a way that others might be able to easily spin up a development environment and in such a way that our open-source data and software are freely accessible. Spencer put together a wonderful Docker container so that the a MariaDB instance can be loaded effortlessly on any developer’s computer. The task of making our unique data available in that container fell to me.

The Problem

I know very little about versioning databases, my sense is that it is not all that common, and when databases are available on GitHub they will be either an empty schema or a full database dump. I did look into Klonio a while ago, but it wasn’t yet ready for use. A database dump would have been fine, and in fact that is how we started out. But as the database size grew (now over 1GB uncompressed), it was no longer possible to fit even a compressed dump within GitHub’s file size limitation. So the first solution is to dump to individual tables to individual files, but we should also run some constraints on those dumps so that we are only uploading our own SQE data to GitHub, and not any data that the users of the database may have input. That requirement puts some limitations on the “dumping” procedure we use.

Wrangling Query Output into CSV Files

Enter SELECT INTO OUTFILE, a cool way to dump queries into text files—CSV text files by default. There are many settings that can be used to change the output, but I will just work with the defaults for now. A basic query with some filtering goes something like this:

SELECT column1, column2
INTO OUTFILE "/my/database/table.sql"
FROM table1
JOIN table2 USING(table1_id)
WHERE table2.user = 0

It is important to remember that the path you used here is relative to the server, not whatever client you may be using to execute the query. So, you will need access to the server hosting the database and the database will need to have write permissions to the folder you specify. Also, it will do nothing if such a file already exists.

Getting Automated

Ok, I am not going to write all these queries by hand. Fortunately Ingo Kottsieper, who is in charge of designing our database, has used some very good practices to regularize the tables and their relationships. Firstly the primary key for any table is labeled table_id, so joins become very easy and generally look something like:... FROM table1 JOIN table2 USING(table1_id) or ... FROM table1 JOIN table2 USING(table2_id). Also, all the tables into which users might insert data have a corresponding *_owner table, and this allows us to easily filter data that users have input from the base data that we have bulk imported.

So, since I was tired from all this research and various failed attempts, I wrote my backup script in python. Not only that, it isn’t very idiomatic python either—don’t judge. Also I don’t make use of the benefits of the mysql connection pool in the code below, it is just copied from some of my other bad python code. Anyway, we start by grabbing a list of all the tables in the database, specifying some tables that we will exclude from automated processing (stuff with user passwords, stuff that needs to be treated special, etc.), and filtering these in sets so we don’t accidentally try to back up a table twice:

import mysql.connector
from mysql.connector.pooling import MySQLConnectionPool

dbconfig = {'host': "localhost",
            'user': "username",
            'password': "password",
            'database': "database_name"

cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",
                                                      pool_size = 30,

db = cnxpool.get_connection()
cursor = db.cursor()
result_set = cursor.fetchall()
path = '/Users/bronson/sqe-mysql-backup/'
owner_tables = set()
non_owner_tables = set()
exclude_tables = {'user', 'user_sessions', 'sqe_session', 'artefact', 'scroll_version',
                  'external_font_glyph', 'image_to_image_map', 'single_action', 'main_action'}
for result in result_set:
    if 'owner' in result[0]:
        owner_tables.add(result[0].replace("_owner", ""))
non_owner_tables = non_owner_tables - owner_tables

Now that I have some lists of tables, I can write a simple loop to back them all up in one fell swoop—or more properly 100 fell swoops. The variable files will collect all the filenames if you want to print them out somewhere; and 1058 is a magic number for the point where automatically created data has already stopped and user data has just begun.

files = set()
for table in owner_tables:
    if table not in exclude_tables:
        print('Exporting table: %s' % table)
        query1 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table + ' JOIN ' \
                + table + '_owner USING(' + table + '_id) WHERE ' + table + '_owner.scroll_version_id < 1058'

    print('Exporting table: %s_owner' % table)
    query2 = 'SELECT * INTO OUTFILE "' + path + 'tables/' + table + '_owner.sql" FROM ' + table + '_owner WHERE ' \
             + table + '_owner.scroll_version_id < 1058'
    files.add(path + table + '.sql')
    files.add(path + table + '_owner.sql')
for table in non_owner_tables:
    if table not in exclude_tables:
        query3 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table
        files.add(path + table + '.sql')

After this I still do need to run a couple custom query dumps (only 3). Mainly this was because some tables have Geometric data types, which are binary data types, and SELECT ... INTO OUTFILE doesn’t play well with that. So I convert those columns to ASCII representations in the exported CSV and then use a custom parser in my import script. I then gzip all the backed up tables, use bash script to compare them with the files I already have in the repository—cmp -i 8 /backup/file.gz /github/file.gz works the magic for comparing to gzipped files (it ignores the header info, but make sure your files have the same name). Finally I dump a bare schema from my DB into the GitHub project, and push it all up to the web.

Getting that Data into a New Database

Now you, the unsuspecting collaborator, pull the latest database down to your devel server. Which brings me to my even lazier import script. It is just simple bash since I wasn’t sure if everyone would have the right tools for a python script using a mysql connector. Everyone has bash right?

The bash script just uses the command line mysql client. It loads up the schema, unzips all the tables, and loads each backup file into the DB. It is important to turn off foreign key checks before running the import command and then turning them back on afterwards—I assume this is not being run on a production server. Hopefully the script does not run too fast; you do need a little time to grab a coffee. If anyone is interested, the command to load the backup files looks like this:

for file in tables/*.sql; do
    printf "\rLoading table: ${table##*/}\n"
    mysql --host=${host} --user=${user} --password=${password} --local-infile ${database} -e "SET FOREIGN_KEY_CHECKS=0;
    LOAD DATA LOCAL INFILE '$cwd/$file' INTO TABLE ${table##*/};

    pid=$! # Process Id of the previous running command

    while kill -0 $pid 2>/dev/null
      i=$(( (i+1) %4 ))
      printf "\r${spin:$i:1}"
      sleep .1

The while loop at the end just gives you a little spinner so you know your computer really is still working. Also, it provides some low-key entertainment while you sip that coffee you just got. Thankfully, Spencer Jones worked the import script right into the docker container script, so this install script meshes seamlessly with the installation of our Docker container. What to do now that you have a copy of the database working in your development environment? Finish that coffee and go get to work on some cgi scripts and queries…

Some final thoughts

If we didn’t gzip the individual table backups, we would have a fully readable versioning system for the DB on GitHub. Some folks may wish to do this if they can stay within the GitHub file limitations (100MB I think). We could not. In fact, my estimates show that one of our tables, the one with the Geometry data, will probably grow beyond 2GB by the date we get all of our data into the it. This means, I will be looking for another, better solution in the not too distant future. Helpful hints and comments are most certainly welcome! You can find the whole project, messy scripts and all, here

Pin It on Pinterest