Versioning a MySQL Database on GitHub

As part of my work with Scripta Qumranica Electronica I have been working together with Spencer Jones to get our project(s) up on GitHub in a way that others might be able to easily spin up a development environment and in such a way that our open-source data and software are freely accessible. Spencer put together a wonderful Docker container so that the a MariaDB instance can be loaded effortlessly on any developer’s computer. The task of making our unique data available in that container fell to me.

The Problem

I know very little about versioning databases, my sense is that it is not all that common, and when databases are available on GitHub they will be either an empty schema or a full database dump. I did look into Klonio a while ago, but it wasn’t yet ready for use. A database dump would have been fine, and in fact that is how we started out. But as the database size grew (now over 1GB uncompressed), it was no longer possible to fit even a compressed dump within GitHub’s file size limitation. So the first solution is to dump to individual tables to individual files, but we should also run some constraints on those dumps so that we are only uploading our own SQE data to GitHub, and not any data that the users of the database may have input. That requirement puts some limitations on the “dumping” procedure we use.

Wrangling Query Output into CSV Files

Enter SELECT INTO OUTFILE, a cool way to dump queries into text files—CSV text files by default. There are many settings that can be used to change the output, but I will just work with the defaults for now. A basic query with some filtering goes something like this:

SELECT column1, column2
INTO OUTFILE "/my/database/table.sql"
FROM table1
JOIN table2 USING(table1_id)
WHERE table2.user = 0

It is important to remember that the path you used here is relative to the server, not whatever client you may be using to execute the query. So, you will need access to the server hosting the database and the database will need to have write permissions to the folder you specify. Also, it will do nothing if such a file already exists.

Getting Automated

Ok, I am not going to write all these queries by hand. Fortunately Ingo Kottsieper, who is in charge of designing our database, has used some very good practices to regularize the tables and their relationships. Firstly the primary key for any table is labeled table_id, so joins become very easy and generally look something like:... FROM table1 JOIN table2 USING(table1_id) or ... FROM table1 JOIN table2 USING(table2_id). Also, all the tables into which users might insert data have a corresponding *_owner table, and this allows us to easily filter data that users have input from the base data that we have bulk imported.

So, since I was tired from all this research and various failed attempts, I wrote my backup script in python. Not only that, it isn’t very idiomatic python either—don’t judge. Also I don’t make use of the benefits of the mysql connection pool in the code below, it is just copied from some of my other bad python code. Anyway, we start by grabbing a list of all the tables in the database, specifying some tables that we will exclude from automated processing (stuff with user passwords, stuff that needs to be treated special, etc.), and filtering these in sets so we don’t accidentally try to back up a table twice:

import mysql.connector
from mysql.connector.pooling import MySQLConnectionPool

dbconfig = {'host': "localhost",
            'user': "username",
            'password': "password",
            'database': "database_name"
            }

cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",
                                                      pool_size = 30,
                                                      **dbconfig)

db = cnxpool.get_connection()
cursor = db.cursor()
sql = 'SHOW TABLES'
cursor.execute(sql)
result_set = cursor.fetchall()
path = '/Users/bronson/sqe-mysql-backup/'
owner_tables = set()
non_owner_tables = set()
exclude_tables = {'user', 'user_sessions', 'sqe_session', 'artefact', 'scroll_version',
                  'external_font_glyph', 'image_to_image_map', 'single_action', 'main_action'}
for result in result_set:
    if 'owner' in result[0]:
        owner_tables.add(result[0].replace("_owner", ""))
    else:
        non_owner_tables.add(result[0])
non_owner_tables = non_owner_tables - owner_tables

Now that I have some lists of tables, I can write a simple loop to back them all up in one fell swoop—or more properly 100 fell swoops. The variable files will collect all the filenames if you want to print them out somewhere; and 1058 is a magic number for the point where automatically created data has already stopped and user data has just begun.

files = set()
for table in owner_tables:
    if table not in exclude_tables:
        print('Exporting table: %s' % table)
        query1 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table + ' JOIN ' \
                + table + '_owner USING(' + table + '_id) WHERE ' + table + '_owner.scroll_version_id < 1058'
        cursor.execute(query1)

    print('Exporting table: %s_owner' % table)
    query2 = 'SELECT * INTO OUTFILE "' + path + 'tables/' + table + '_owner.sql" FROM ' + table + '_owner WHERE ' \
             + table + '_owner.scroll_version_id < 1058'
    cursor.execute(query2)
    files.add(path + table + '.sql')
    files.add(path + table + '_owner.sql')
for table in non_owner_tables:
    if table not in exclude_tables:
        query3 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table
        cursor.execute(query3)
        files.add(path + table + '.sql')

After this I still do need to run a couple custom query dumps (only 3). Mainly this was because some tables have Geometric data types, which are binary data types, and SELECT ... INTO OUTFILE doesn’t play well with that. So I convert those columns to ASCII representations in the exported CSV and then use a custom parser in my import script. I then gzip all the backed up tables, use bash script to compare them with the files I already have in the repository—cmp -i 8 /backup/file.gz /github/file.gz works the magic for comparing to gzipped files (it ignores the header info, but make sure your files have the same name). Finally I dump a bare schema from my DB into the GitHub project, and push it all up to the web.

Getting that Data into a New Database

Now you, the unsuspecting collaborator, pull the latest database down to your devel server. Which brings me to my even lazier import script. It is just simple bash since I wasn’t sure if everyone would have the right tools for a python script using a mysql connector. Everyone has bash right?

The bash script just uses the command line mysql client. It loads up the schema, unzips all the tables, and loads each backup file into the DB. It is important to turn off foreign key checks before running the import command and then turning them back on afterwards—I assume this is not being run on a production server. Hopefully the script does not run too fast; you do need a little time to grab a coffee. If anyone is interested, the command to load the backup files looks like this:

for file in tables/*.sql; do
    table=${file%.sql}
    printf "\rLoading table: ${table##*/}\n"
    mysql --host=${host} --user=${user} --password=${password} --local-infile ${database} -e "SET FOREIGN_KEY_CHECKS=0;
    LOAD DATA LOCAL INFILE '$cwd/$file' INTO TABLE ${table##*/};
    SET FOREIGN_KEY_CHECKS=1;" &

    pid=$! # Process Id of the previous running command

    while kill -0 $pid 2>/dev/null
    do
      i=$(( (i+1) %4 ))
      printf "\r${spin:$i:1}"
      sleep .1
    done
done

The while loop at the end just gives you a little spinner so you know your computer really is still working. Also, it provides some low-key entertainment while you sip that coffee you just got. Thankfully, Spencer Jones worked the import script right into the docker container script, so this install script meshes seamlessly with the installation of our Docker container. What to do now that you have a copy of the database working in your development environment? Finish that coffee and go get to work on some cgi scripts and queries…

Some final thoughts

If we didn’t gzip the individual table backups, we would have a fully readable versioning system for the DB on GitHub. Some folks may wish to do this if they can stay within the GitHub file limitations (100MB I think). We could not. In fact, my estimates show that one of our tables, the one with the Geometry data, will probably grow beyond 2GB by the date we get all of our data into the it. This means, I will be looking for another, better solution in the not too distant future. Helpful hints and comments are most certainly welcome! You can find the whole project, messy scripts and all, here

A Thought on Steps to a New Edition of the Hebrew Bible

Most people who might be reading this blog will be aware of the new edition of the Hebrew Bible now being produced under the moniker of The Hebrew Bible: A Critical Edition—it started out as the Oxford Hebrew Bible. The main distinction of the project is its aim to produce eclectic editions of books in the Hebrew Bible, rather than the now traditional diplomatic editions of Codex Leningradensis (so largely BHS/BHL) or the Aleppo Codex (so HUB). Ron Hendel, the general editor, is to be congratulated for publishing alongside these new critical editions a monograph discussing the theoretical framework within which the editorial work is being carried out. The book has garnered quite a bit of attention since its publication and has been a topic of discussion within the scholarly community.

I have been working through the book rather closely for a while now since I am doing a large amount of text critical research these days. One of the difficulties in designing the purview of a project like this new edition of the Hebrew Bible is to define the and defend its constraints. One of the innovations of this Hebrew Bible edition is that it works to establish one or more literary archetypes which it presents in Hebrew script and language. This is, of course, problematized by the fact that some archetypes may currently be derived only from Greek texts (or the so called daughter versions of the Septuagint) and thus translation (back?) into Hebrew becomes necessary. A discussion of that decision notwithstanding, I would like to focus on a different problem that Hendel rightly connects to this editorial practice of trying to represent archetypes: the usage of Tiberian pointing.

Hendel acknowledges the logical difficulty to be overcome as follows:

The copy-text will be L, our oldest complete manuscript of the Hebrew
Bible. Since the accidentals of vocalization and accentuation in L are the
product of medieval scribes, our critical text is open to the complaint of
anachronism. (p. 31)

Hendel then goes on to lay out the following arguments in defence of this decision:

There are several ameliorating factors that lessen this dissonance. First, biblical scholars already know that the consonantal text is older than the medieval vocalization system. So a critical text with this overlay is not
strange. Second, critical editions in other fields use anachronistic accidentals, including editions of Greek texts (including the New Testament, the LXX , and classical literature) that use rough breathings, accents, punctuation,
and miniscule letters, all of which were scribal inventions of the Carolingian era (ninth century CE), roughly contemporary with the Tiberian Masoretes. Third, the phonology of the Tiberian vocalization system is not wholly or even mostly anachronistic. (p. 32)

I find this defence quite surprising, especially in regard to the Greek material. The Greek rough breathings, accents, and punctuation were most certainly not the product of the Carolingian era, though their ubiquitous usage in scholarly transmission is established at that point in time. The accents, breathings, and punctuation can be already found in various usages two centuries before the common era, and appear in our earliest Greek Bible codices and some of the papyri as well. That means, at least in the case of Septuagint, some of those markings could possibly stem from the first penning of the translations, though I do not expect that to be so. The minuscule letters alone belong to the 9th century.

The second surprising claim is that “the Tiberian vocalization system is not wholly or even mostly anachronistic.” I don’t know by what measure we might make decisions about the similarity of one phonological system to another, but even if the books of the Hebrew Bible were written over as short a time span as a few hundred years and a small geographical region, comparative evidence would suggest that the phonology of those individual works would have varied. Certainly the phonology imposed by the transmitters, and redactors, of these works would vary as well. Some of the spellings in Qumran belie phonologies more in line with that found in the Babylonian system. Attenuation of a -> i (maqtal -> miqtal) occurred in the Byzantine era. I could go on and on discussing the problems of reconstructing Hebrew phonology from a diachronic perspective, but the real crux is that the whole defence seems rather ad hoc.

I have no problem with anachronistically using Tiberian phonology as indicated by the accents in transcriptions. Ahituv did this in his Echos from the Past, where we find Moabite and other NWS languages/dialects presented with Tiberian pointing. The “point” is that the Tiberian phonology in that book was being used to help readers associate the “proper” lexeme and morphological parsing to any given word, not some historical phonological reconstruction such as that attempted in K. Beyer’s “Die Sprache der moabitischen Inschriften,” KUSATU 11 (2010): 5–41. But I find it untenable to claim that the phonology of the Tiberian vocalization system is quite similar to the vocalization of the archetypes of all the books in the Hebrew Bible as witnessed in the manuscript tradition. Perhaps Hendel should have stopped with his first argument in favor of the policy, a statement which contrary to the others is both factually correct and reasonable: we Bible scholars (should) know that the the Tiberian points are a late addition to traditions that only had consonants and we are nevertheless accustomed to editions with the those points. My only quibble is that I would change “the consonantal text” in his statement to a plural “the consonantal texts,” something that will certainly be clarified by the presentation of parallel versions in the HBCE and the accompanying notes (see also the discussion on p. 29).

The really nice thing about Hendel’s book is that it provides a clearly stated rationale behind the editorial policies of the project, and relates them to the currents within the larger field. I would agree that such an enterprise is worthy of book-length treatment, which can be far more detailed than the customary introductory chapter in the editions themselves. Such an expanded discussion enables the reader to use the editions much more judiciously and at times to infer the rationale behind certain editorial decisions in the absence of explicit commentary.

Pin It on Pinterest