I was sexually assaulted when I was an undergraduate. This is nothing too shocking, if we take even the shortest glance around us, our lives are permeated with this sort of thing. My grandmother was raped, my mother was raped, at least one of my sisters was raped, my wife was sexually assaulted. Those are just the crimes I know about, what others lie behind the veil of shame?
I now have children of my own, and I fear that they too might be added to the list of family member who have suffered at the hands of sexual predators. So, I must now speak up as so many others already have. This must stop!
As I watched the dialog around the confirmation of Brett Kavanaugh to the U.S. supreme court, I was reminded of the circus surrounding the Clarence Thomas nomination—Anita Hill eventually became a professor at my alma mater. I was too young then to understand what was really going on. But I am older now, and have had my own life experiences to draw upon.
In truth, I don’t know what really happened in either case. I was not there, and I am not trained in that field of criminal investigation, though I have my suspicions. What concerns me most is not the past; if we have enough evidence to convict a sex offender in court, that person should pay for his or her crimes. But if the trail of evidence has gone cold, I do not know what can be done except redouble our efforts at vigilance. My greatest concern, however, is for our future, and I will use my experience as an example.
It was a late night on the dark streets of L.A. I had just spent some quality time with my fiancée, Sarah, who lived on the outskirts of Compton and was making my way back to my lovely Loyola Marymount University right by the ocean in Westchester. I had made it all the way to Manchester and Sepulveda, and needed to walk the rest of the way back to campus, since that bus had stopped running. It was late and had gotten cold, especially since I only had a tank top on (it had been very hot earlier in the day). I still had about a mile of walking left when a car pulled up and the man driving offered me a ride. Of course I took the offer; I was an invincible 19 year old (white) male, what could happen?
Well the older man who picked me up, couldn’t communicate too well in English, and I didn’t know any Spanish, but he seemed to understand where I needed to go. As the drive went on he asked me some questions I did not understand and began touching my leg. I became a bit confused, since it seemed somehow sexual, but why would one man take a sexual interest in another? Yes, I grew up rather sheltered. But as the touching became more insistent and travelled up to my crotch, I decided enough was enough and I demanded to be let out. And what do you know, we were at the gates to the University. What luck!
As I walked the rest of the way to my dorm I remember being so angry at myself. I should have known better than to take a ride from a stranger. All this happened because of some really poor decisions I made. I shouldn’t have been walking home so late at night in such “skimpy” clothing. In my mind, the whole thing was my fault, and it took me years to finally realize that I had in fact experienced sexual assault. Funny thing is, if I saw the guy today, I doubt I would even recognize him.
Another side:
As a teen, I had no idea what I wanted to do with my life. As I neared completion of high school, I got the idea that music composition was my thing, and I decided I wanted to go to the new England Conservatory of Music. So, I called up my old piano teacher—I had studied piano pretty seriously as a kid. He agreed to take me on again, I had been one of his best students after all. He even allowed me to work around the house and yard in payment for the lessons—a great deal since I had no money. In any event, over a couple years of study, I was never able to get to the level I knew I needed (though I did work semi-professionally playing guitar at that time). Eventually I went off to University to study crazy old dead languages and what-not.
Even though I never went forward with a career in music, I remember fondly my time studying piano. My teacher had a profound impact on my life—I first learned textual criticism while working through various “versions” of Chopin’s preludes. Indeed, by the time I went off to University, we had become quite close. Though we grew apart when I was on the west coast (he lived in New Hampshire, where I grew up), I remember having a lovely evening with him and his wife shortly after Sarah and I got married.
Well, a few years later he was accused of sexually assaulting one of his students. It is hard for me to believe it could be true. He was always warm and personal. I even remember when I would start a lesson cold, and was rather stressed, he could see that and would gently push my shoulders down, give my arms a little shake, and then the cares of the world melted away and we had the music… Perhaps something like that happened with someone less receptive, or maybe he was a sexual predator. All I know is that for me he was a sweet man, and I cared for him deeply. I don’t know if it matters, but he went to trial, walked out during deliberations, shot himself in the chest with a handgun and died. Again, I don’t know if it matters, but it turned out to be a deadlocked jury, and the judge declared a mistrial. In the words of my favorite author, “so it goes.”
So what next:
When I think back on my own experiences, I wonder. Why didn’t I notify campus police immediately? Why didn’t my wife call the police on the guy that tried to rape her? He did go on the sexually abuse a young boy on the bus. Is she to blame for that too? Victims often feel that way. What did my guy go on to do? Was my piano teacher guilty? Or, did he kill himself because the accusation had ruined him, or even hit a little too close to home, like Shirley McClaine’s character in “The Childrens Hour”? I simply do not know.
We might not be able to turn back time and get the evidence to convict every sex offender in the past. If we can, we should, even if the trail grows cold on a lot of these cases. But to make things better for our children, we need to change our culture now. Why should we continue to be ashamed about being victims? That is backwards.
I don’t want to talk about our biblical culture or any odd ideas of the past. Our society is no longer one to offer a rapist the opportunity to marry his victim (Deut 22:28–29). But I wonder how far we have moved on from poor Leucothoë, buried alive for her unhappy role as the glorious sun’s rape victim. Or that jewel of a boy Hermaphroditus, violently raped by the “love-struck” nymph Salmacis, and who ceased to be a “real man” when the gods granted Salmacis her request that the two become one. Both are metaphors for how victims of sexual assault can be treated and can see themselves in our modern society. And that is a problem.
We can tell our children that they are not to blame if they should, God forbid, become victims of sexual assault. But that won’t matter when they see victims continue to be ostracized in the media and among their peers. How can they be confident in their innocence and their right to justice, when they see other victims do everything right and still fail to find justice? Or worse, see the powerful of our society bury such victims under threats to family members, blacklists, and any other form of coercion—heaping humiliation upon humiliation. These are problems we can solve, some quite easily if we just have the will.
What is more, we could be doing a much better job at demonstrating what is and is not acceptable behavior in our society. I can’t begin to count how many times I have seen or read inappropriate behavior in recent media that is still treated as cute or funny, or perhaps worse inevitably par for the course. The one sure fire way to confuse the issue is … to confuse the issue.
I wouldn’t be so naïve as to claim that I know the answers here or really understand everything that is going on, but for my part I am no longer confused. And that in and of itself is part of the path forward.
Data are the lifeblood of the computational humanities. My first serious encounter with the usage of computers to benefit humanistic research was the Perseus project. It was at the turn of the new millennium and I had just begun my Classics studies at Loyola Marymount University in Los Angeles. I remember marvelling at how quickly I could look up words as I was reading Plato or the Aeneid. Not only that, Perseus seemed to have all the major works I might want to read available at my fingertips, along with Smyth’s famous grammar, the “middle” Liddell, Lewis and Short, and even the big LSJ—the only thing we need now is the OLD…or the TLL! As a poor kid in a relatively wealthy university, it was empowering to have access to all these resources which I could never dream of acquiring on my own (Sarah, my wife, did eventually buy me Smyth and I got an ancient copy of LSJ when the Semitic library at the Catholic University of America was cleaning its stacks). Something about having these works appearing on my own computer screen made me feel that I could own these monuments of western learning too, if only in the form of 1’s and 0’s flitting through the aether.
But feeling like one has ownership of those bits of data is not enough. There is often a sense that if something is on the internet, it is free for everyone to use, which is not technically correct. Licensing still applies, and this is one of the main problems plaguing computational humanities projects. With the explosion of open source everything, the world is becoming more comfortable with “open” data and publicly funded research is thankfully beginning to be required to provide it. I see that my old mainstay the Perseus project is using a CC-BY-SA license, which is cool because it allows products like Logos to package the Perseus data into a software platform familiar to their user base. If data are the lifeblood of computational humanities, restrictive licenses, and even a failure to provide any license at all, is a death sentence. And given its etymological background can data (think Latin do, dare) really be held back? While there are many ways to copywrite data these days, the Creative Commons family of licenses provides clarity and simplicity, and would be a welcome addition to any DH project. Using such a license ensures the continued usability of a project’s hard earned data beyond the financial, institutional, and temporal bounds of the project itself, thus integrating the work of individual projects into the wider DH collective. A project’s code can also be openly licensed, such as the MIT license, even if making the right choice for code licensing can become complicated by the usage of libraries with possibly conflicting licenses (Apache, BSD, GPL, LGPL…)—a complication greatly simplified by tools like FOSSA.
Even when data are clearly licensed in a permissible form, that is only the first step to making those data productive within the computational humanities community. The data need to be accessible in some recognizable data format (CSV, JSON, SQL) and have some sort of understandable semantic and syntax (e.g., TEI and XML respectively). I currently use data from many sources in my professional capacity within the Scripta Qumranica Electronica project and in my own personal research. Much of my time in such endeavors is bound up in transforming the data from the format it came packaged in to a format I can compute it in and a format that can link with the other data sources I might be using. This may involve reading the data into objects/structs in whatever programming language I’m using (Go, Python, Perl, Javascript, etc.) for in memory processing, parsing it into a database (PostgreSQL, MySQL, GraphDB, Jena, etc.) for logical queries, or converting into whatever format is required for a particular processing suite that I want to experiment with (TRACER, OpenNMT, Ugarit). This is simply inefficient and results in a lot of duplicated work of inconsistent quality. What is needed is a common data interchange format and a way to provide an ontology for the data stored in it.
I guess that last sentence was a pretty overt setup for JSON-LD. But let’s not get ahead of ourselves. There are many data formats to choose from, and perhaps CSV is the most versatile since you can open it in anything, even Excel, but it can be a real pain with complex linked data. And since I am interested in simplifying the display of data as much as access to processing, CSV is a poor choice. A desire for human readable access to primary research data has led to the popularity of XML (an acceptable, though unwieldy, format for linked data). But this confuses the matter, human readability and digital portability (in all its complexity), represent two axes and accommodating one often comes at the expense of the other. What I would rather see is a very clear and extensible data representation that does justice to the data and provides the necessary tools to easily extract human readable readable representations (along with useful forms for computation). Enter data ontology.
As a classicist ontology is a funny sounding word—the participle of ειμι + λογια, so the study of being? That sounds cool, so we store not only our data, but also information about what our data actually is, its state of being. This reminds me of what is partially accomplished in SQL databases: we define our data types, Integer, Float, Decimal. But what is the String “לא”? Is it a Hebrew word “no”, an Aramaic word “no”, a number “31”, something else? MySQL data types like TIMESTAMP give us more information about what they are since we know that whatever data stored in such a format provide a UTC designation of time; that is, we get not only the data, but also know exactly what it means, a 4 byte integer representing the number of seconds since January 1, 1970. But creating a new data type for MYSQL or any other database is no easy thing. An SQL database (with InnoDB tables) allows you to specify relations very easily via the schema, as do graph databases of various sorts, but the definition of data type is very closed, thus limiting a the ontological horizons. So are we trying to fit a rug into a room that is too small?
Well, the semantic web has long ago developed a way to deal with this. The goal there has been to store data in a very simple format and to couple that with a mechanism to describe that data. The backbone of the semantic web, which doesn’t exactly exist yet, is the triplestore (a type of database graph) conforming to the remote data framework (RDF) with a separate schema (RDFS) providing the ontology, which can be chosen from a preexisting schema or described ad-hoc. So what is a triplestore? Person -> has a name -> name would be an example of a triple store, we have a subject “person”, a predicate “has a name”, and a “name” is the object—three values, hence “triple”, in a clearly described relationship. These triplestores can be represented in various ways, including good old plain text. So we can link anything to anything else with this triplestore data structure, for instance a “work” can “have a chapter” “chapter”, a “chapter” can “have a sentence” “sentence”, and this “sentence” can “have a word” a word, and that “word” can “be a part of speech” “noun”, and that “noun” can “have a syntactic role” “agent”, but it could also “have a semantic domain” “beverage”, or anything else. I should ask here, though, what is a word, is it a morpheme, a lexeme, a vocable, all letters between spaces in text? Is it:
Or should it be something different from either of those? Having an ontology to begin defining the meaning of each piece of data begins to help here when we want to make clear what our data means and thus make it easier for others to know how to use it for their own purposes.
Let’s take names as an example. What if I want to store some information about people with names? I can use the common ontology of foaf for a person: “The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.” I guess maybe I would have suggested that a “person” is, was, will be, or is imagined to be a member of the species homo sapiens (sorry to all you dog and cat lovers out there), but no matter, we will follow the FOAF ontology. So we have a person and we will provide data for that person’s name, whether a “family name” (foaf:familyName) or “first name” (foaf:givenName):
“The familyName property is provided (alongside givenName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”
“The givenName property is provided (alongside familyName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”
So, with Bronson Brown-deVost I would store “Bronson” as the “givenName” and “Brown-deVost” as “familyName”, but for someone like Kim Jong-Un, I would describe “Kim Jong” as the “familyName” and “Un” as the “givenName”. So in RDF/Turtle syntax that would be:
Even though the names are written in a different syntax, the ontology of the names is made clear, or at least described in a consistent way. This ontology certainly misses that Kim is a family name and Jong is a more specific clan name, and that Brown is a family name I acquired through marriage, and deVost or Devost is a family name I acquired though birth. But if I wanted a more fine grained ontology for personal names, I could always create one, and these data, which are probably just UTF-8 strings can be given a more distinctive data type and related in very specific ways to other data types (I could make “marriedName” and “clanName” subclasses of the foaf:familyName property).
Moving to the broader picture, using an ontology provides a semantic schema to data that makes it much easier to know exactly what each piece of data really is intended to represent. What is more, making that data available as JSON (here JSON-LD) makes it easily accessible and consumable for a variety of purposes. Up to now, there are software solutions that can be used today for this type of data access (think SPARQL endpoints with things like Apache Jena, etc.). But let’s push further here, we have our data and it is curated to have a consistent schema and ontology so we know what our data means, or is supposed to mean. What we still lack though is simplified accessibility, attribution, and versioning of these data. Let me tackle these one by one.
I am very interested in thinning out the data access chain. In SQE we have a long chain of data handlers. Basically, the user of our website requests data by clicking somewhere, the browser-side javascript data models make a request over HTTP to our Perl cgi scripts, which query the MariaDB database, which returns the result to the Perl cgi scripts, which send those data back to the user’s browser, which updates the Javascript data model, which the Vue.js template then automatically reacts to and changes the display the user sees on the screen. We are even thinking of throwing in another middleman like socket.io for realtime data synchronization. These are tried and true technologies, but I still am not thrilled about how many “hands” the data needs to pass through to go between user and the single “source of truth” (those database files on the SQE server’s disk) and the various transformations made along the way to the users browser. It is a very difficult juggling act to maintain that single “source of truth” as the data are passed back and forth between the browser and the database. One user could change some datum and a fraction of a second later another user could try to change that same datum. The first user would receive back an all clear that their write was successful, and not learn until the next complete browser reload that the successfully written data is already out of date. And the second user would perhaps never learn of the first user’s write. I would love to have the browser more directly interact with the data, but a number of issues, most importantly security, stand in the way of this. No developer in her right mind would expose a database to the world wide web, we simply can’t let everyone have direct access to the datastore…or can we?
The issue of attribution, and indeed ownership, of data is a real problem. Most systems are based on trust: I create data attached to my user credentials and I trust that no one will remove it. The list of people with admin access to a database, and hence the ability to alter my data, is often small. Nonetheless, part of a robust system is not relying on trust but rather engineering security from the ground up. That is why we want the systems that handle our finances and our personal data to be open source, and thus open to public scrutiny. Why should our academic data be any different? Asymmetric encryption works with a set of keys, one public and one private. These keys are used to alter data, one basically turning it into a mess of unintelligible characters, and the other restoring it to its original form. Basically, data is locked with a private key, and only people with the corresponding public key can open it, this also allows them to verify the originator of the content. People can also encrypt data using a public key, which can only be unlocked by the owner of the private key (we assume people know better than to give out their private key). This allows only the owner of the keypair full access to her or his data, since only that person can unlock messages in both directions. Thus private/public key encryption provides two benefits for shared data, it place control of data access back into the hands of the content originator (if you lock something with your public key, only you can access it, since that requires the corresponding private key), and public verification of the content originator. Let’s try a little example slightly modified from the one jsencrypt provides in their readme:
The message hasn’t been encrypted yet.
The message hasn’t been unencrypted yet.
Using encryption keys like these to govern data ownership would be especially useful in a database system with uniqueness constraints, so when someone tries to create a content that already exists, the uniqueness constraint ensures that the first originator is always credited—Ingo Kottsieper has already integrated a different method for achieving this in the SQL database for the SQE project. Such constraints becomes complex with atomistic linked data, but should be possible to a large extent, especially with robust ontological schemas.
Along with this true ownership of data comes the desire to update it. As humans, we do not have a permanent state, we change in relation to the stimuli around us, and sometimes this means changing our opinions and perceptions—in programming this is mutability. In programming, mutability is a big problem because most programs rely on a consistent state. Changing data in a program is not necessarily the problem, the issue is making sure every aspect of the program shares the same state at the same time. Managing this can become difficult when data is shared across many parts of a local program, across an intranet, or especially across the world wide web. What happens when someone is viewing some data locally and it gets changed on the server? Or what happens when someone loses connectivity to the network and wants to change some data they are already working with? RDBMS’s have long incorporated complex routines to synchronize data between query and execution as well as master and slave database servers in closed systems. But we probably also want access to the history of changes, something BigchainDB is all about.
We are now seeing the entrance of new technologies to deal with sharing the state of data across the internet in real-time such as IPFS and various other solutions built on websockets (even high level frameworks like Meteor.js and Django support real-time updates). One neat implementation I have been playing with is Gun.js, which is based on the concept of eventual consistency and cryptographic permissions. Basically data can be shared between any number of clients (ok, server memory limitations do apply, but you can always add more servers). Each user’s web app is able to store data locally when the websocket connection is not available and updates the server when a connection is established again—thus, eventually consistent. Gun.js will choose the highest update value (each time a datum, or “node”, is updated it increments an update counter), and in a race condition (two or more users are trying to update the same datum at the same time), the one with the lexically longest value wins. Hey, you have to decide somehow. Supposedly the whole update history of each datum is accessible, so a version history can be accessed.
If you want to try this out, change the color in the dropdown box below and watch the background color of this paragraph change. Perhaps it has already been changing as you were reading this, since whenever anyone anywhere changes this value, it updates for everyone everywhere. Go ahead, call up your friend in Australia and start a color battle! My son and I did a little test sending the update back and forth from Germany to Australia a couple times and experienced real-time synchronisation of a second or less!
Or perhaps we could do something graph like that looks a bit more similar to a triplestore. Try typing a relationship in the following fields: Subject:
Predicate:
Object:
Now let’s make a little query: Find every of
And our results are:
No search run yet.
You can always try a new triple of your own, just pop in any subject, predicate, and object. You can add as many objects to the same subject and predicate as you want, and they will magically appear in your search results. Or delete a triple and watch it disappear. In fact, it is never deleted, the node is just given a value of null, or “tombstoned” to use the technical term. We could always revive the data later or assign it new values.
In the end, what I am looking for is a distributed, real-time, semantic, versioned, licensed web of knowledge. That is, my dream system, if it could be cobbled together, would be to have a data set turned into a “knowledge base” via an ontological schema, licensed in a clear way, encoded via the private key of the person who generated it (anyone with that person’s public key can then access it), served dynamically from a truly distributed data store (not just master/slave), with a clear versioning record (and the versioning should maintain compatibility with the licensing). It is a lot to ask, but with technologies like websockets, RDF (and JSON-LD), BlockChain, asymmetric cryptography, and creative commons along with implementations like IPFS (with its OrbitDB), gun.js, BigchainDB, most of the pieces are already here. The Wolfram data repository really nails the semantic knowledge issue along with proper attribution and clear licensing, Knora is an impressive and similarly poised project using standardized RDF/RDFS/OWL for its knowledge base (though neither appear to be truly distributed or real-time). And gun.js is very close also, only really lacking the semantic component and maybe simplified searching possibilities. It seems to me that we are now very close to a robust solution, its just a matter of time…
As part of my work with Scripta Qumranica Electronica I have been working together with Spencer Jones to get our project(s) up on GitHub in a way that others might be able to easily spin up a development environment and in such a way that our open-source data and software are freely accessible. Spencer put together a wonderful Docker container so that the a MariaDB instance can be loaded effortlessly on any developer’s computer. The task of making our unique data available in that container fell to me.
The Problem
I know very little about versioning databases, my sense is that it is not all that common, and when databases are available on GitHub they will be either an empty schema or a full database dump. I did look into Klonio a while ago, but it wasn’t yet ready for use. A database dump would have been fine, and in fact that is how we started out. But as the database size grew (now over 1GB uncompressed), it was no longer possible to fit even a compressed dump within GitHub’s file size limitation. So the first solution is to dump to individual tables to individual files, but we should also run some constraints on those dumps so that we are only uploading our own SQE data to GitHub, and not any data that the users of the database may have input. That requirement puts some limitations on the “dumping” procedure we use.
Wrangling Query Output into CSV Files
Enter SELECT INTO OUTFILE, a cool way to dump queries into text files—CSV text files by default. There are many settings that can be used to change the output, but I will just work with the defaults for now. A basic query with some filtering goes something like this:
SELECT column1, column2
INTO OUTFILE "/my/database/table.sql"
FROM table1
JOIN table2 USING(table1_id)
WHERE table2.user = 0
It is important to remember that the path you used here is relative to the server, not whatever client you may be using to execute the query. So, you will need access to the server hosting the database and the database will need to have write permissions to the folder you specify. Also, it will do nothing if such a file already exists.
Getting Automated
Ok, I am not going to write all these queries by hand. Fortunately Ingo Kottsieper, who is in charge of designing our database, has used some very good practices to regularize the tables and their relationships. Firstly the primary key for any table is labeled table_id, so joins become very easy and generally look something like:... FROM table1 JOIN table2 USING(table1_id) or ... FROM table1 JOIN table2 USING(table2_id). Also, all the tables into which users might insert data have a corresponding *_owner table, and this allows us to easily filter data that users have input from the base data that we have bulk imported.
So, since I was tired from all this research and various failed attempts, I wrote my backup script in python. Not only that, it isn’t very idiomatic python either—don’t judge. Also I don’t make use of the benefits of the mysql connection pool in the code below, it is just copied from some of my other bad python code. Anyway, we start by grabbing a list of all the tables in the database, specifying some tables that we will exclude from automated processing (stuff with user passwords, stuff that needs to be treated special, etc.), and filtering these in sets so we don’t accidentally try to back up a table twice:
Now that I have some lists of tables, I can write a simple loop to back them all up in one fell swoop—or more properly 100 fell swoops. The variable files will collect all the filenames if you want to print them out somewhere; and 1058 is a magic number for the point where automatically created data has already stopped and user data has just begun.
files = set()
for table in owner_tables:
if table not in exclude_tables:
print('Exporting table: %s' % table)
query1 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table + ' JOIN ' \
+ table + '_owner USING(' + table + '_id) WHERE ' + table + '_owner.scroll_version_id < 1058'
cursor.execute(query1)
print('Exporting table: %s_owner' % table)
query2 = 'SELECT * INTO OUTFILE "' + path + 'tables/' + table + '_owner.sql" FROM ' + table + '_owner WHERE ' \
+ table + '_owner.scroll_version_id < 1058'
cursor.execute(query2)
files.add(path + table + '.sql')
files.add(path + table + '_owner.sql')
for table in non_owner_tables:
if table not in exclude_tables:
query3 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table
cursor.execute(query3)
files.add(path + table + '.sql')
After this I still do need to run a couple custom query dumps (only 3). Mainly this was because some tables have Geometric data types, which are binary data types, and SELECT ... INTO OUTFILE doesn’t play well with that. So I convert those columns to ASCII representations in the exported CSV and then use a custom parser in my import script. I then gzip all the backed up tables, use bash script to compare them with the files I already have in the repository—cmp -i 8 /backup/file.gz /github/file.gz works the magic for comparing to gzipped files (it ignores the header info, but make sure your files have the same name). Finally I dump a bare schema from my DB into the GitHub project, and push it all up to the web.
Getting that Data into a New Database
Now you, the unsuspecting collaborator, pull the latest database down to your devel server. Which brings me to my even lazier import script. It is just simple bash since I wasn’t sure if everyone would have the right tools for a python script using a mysql connector. Everyone has bash right?
The bash script just uses the command line mysql client. It loads up the schema, unzips all the tables, and loads each backup file into the DB. It is important to turn off foreign key checks before running the import command and then turning them back on afterwards—I assume this is not being run on a production server. Hopefully the script does not run too fast; you do need a little time to grab a coffee. If anyone is interested, the command to load the backup files looks like this:
for file in tables/*.sql; do
table=${file%.sql}
printf "\rLoading table: ${table##*/}\n"
mysql --host=${host} --user=${user} --password=${password} --local-infile ${database} -e "SET FOREIGN_KEY_CHECKS=0;
LOAD DATA LOCAL INFILE '$cwd/$file' INTO TABLE ${table##*/};
SET FOREIGN_KEY_CHECKS=1;" &
pid=$! # Process Id of the previous running command
while kill -0 $pid 2>/dev/null
do
i=$(( (i+1) %4 ))
printf "\r${spin:$i:1}"
sleep .1
done
done
The while loop at the end just gives you a little spinner so you know your computer really is still working. Also, it provides some low-key entertainment while you sip that coffee you just got. Thankfully, Spencer Jones worked the import script right into the docker container script, so this install script meshes seamlessly with the installation of our Docker container. What to do now that you have a copy of the database working in your development environment? Finish that coffee and go get to work on some cgi scripts and queries…
Some final thoughts
If we didn’t gzip the individual table backups, we would have a fully readable versioning system for the DB on GitHub. Some folks may wish to do this if they can stay within the GitHub file limitations (100MB I think). We could not. In fact, my estimates show that one of our tables, the one with the Geometry data, will probably grow beyond 2GB by the date we get all of our data into the it. This means, I will be looking for another, better solution in the not too distant future. Helpful hints and comments are most certainly welcome! You can find the whole project, messy scripts and all, here
Most people who might be reading this blog will be aware of the new edition of the Hebrew Bible now being produced under the moniker of The Hebrew Bible: A Critical Edition—it started out as the Oxford Hebrew Bible. The main distinction of the project is its aim to produce eclectic editions of books in the Hebrew Bible, rather than the now traditional diplomatic editions of Codex Leningradensis (so largely BHS/BHL) or the Aleppo Codex (so HUB). Ron Hendel, the general editor, is to be congratulated for publishing alongside these new critical editions a monograph discussing the theoretical framework within which the editorial work is being carried out. The book has garnered quite a bit of attention since its publication and has been a topic of discussion within the scholarly community.
I have been working through the book rather closely for a while now since I am doing a large amount of text critical research these days. One of the difficulties in designing the purview of a project like this new edition of the Hebrew Bible is to define the and defend its constraints. One of the innovations of this Hebrew Bible edition is that it works to establish one or more literary archetypes which it presents in Hebrew script and language. This is, of course, problematized by the fact that some archetypes may currently be derived only from Greek texts (or the so called daughter versions of the Septuagint) and thus translation (back?) into Hebrew becomes necessary. A discussion of that decision notwithstanding, I would like to focus on a different problem that Hendel rightly connects to this editorial practice of trying to represent archetypes: the usage of Tiberian pointing.
Hendel acknowledges the logical difficulty to be overcome as follows:
The copy-text will be L, our oldest complete manuscript of the Hebrew
Bible. Since the accidentals of vocalization and accentuation in L are the
product of medieval scribes, our critical text is open to the complaint of
anachronism. (p. 31)
Hendel then goes on to lay out the following arguments in defence of this decision:
There are several ameliorating factors that lessen this dissonance. First, biblical scholars already know that the consonantal text is older than the medieval vocalization system. So a critical text with this overlay is not
strange. Second, critical editions in other fields use anachronistic accidentals, including editions of Greek texts (including the New Testament, the LXX , and classical literature) that use rough breathings, accents, punctuation,
and miniscule letters, all of which were scribal inventions of the Carolingian era (ninth century CE), roughly contemporary with the Tiberian Masoretes. Third, the phonology of the Tiberian vocalization system is not wholly or even mostly anachronistic. (p. 32)
I find this defence quite surprising, especially in regard to the Greek material. The Greek rough breathings, accents, and punctuation were most certainly not the product of the Carolingian era, though their ubiquitous usage in scholarly transmission is established at that point in time. The accents, breathings, and punctuation can be already found in various usages two centuries before the common era, and appear in our earliest Greek Bible codices and some of the papyri as well. That means, at least in the case of Septuagint, some of those markings could possibly stem from the first penning of the translations, though I do not expect that to be so. The minuscule letters alone belong to the 9th century.
The second surprising claim is that “the Tiberian vocalization system is not wholly or even mostly anachronistic.” I don’t know by what measure we might make decisions about the similarity of one phonological system to another, but even if the books of the Hebrew Bible were written over as short a time span as a few hundred years and a small geographical region, comparative evidence would suggest that the phonology of those individual works would have varied. Certainly the phonology imposed by the transmitters, and redactors, of these works would vary as well. Some of the spellings in Qumran belie phonologies more in line with that found in the Babylonian system. Attenuation of a -> i (maqtal -> miqtal) occurred in the Byzantine era. I could go on and on discussing the problems of reconstructing Hebrew phonology from a diachronic perspective, but the real crux is that the whole defence seems rather ad hoc.
I have no problem with anachronistically using Tiberian phonology as indicated by the accents in transcriptions. Ahituv did this in his Echos from the Past, where we find Moabite and other NWS languages/dialects presented with Tiberian pointing. The “point” is that the Tiberian phonology in that book was being used to help readers associate the “proper” lexeme and morphological parsing to any given word, not some historical phonological reconstruction such as that attempted in K. Beyer’s “Die Sprache der moabitischen Inschriften,” KUSATU 11 (2010): 5–41. But I find it untenable to claim that the phonology of the Tiberian vocalization system is quite similar to the vocalization of the archetypes of all the books in the Hebrew Bible as witnessed in the manuscript tradition. Perhaps Hendel should have stopped with his first argument in favor of the policy, a statement which contrary to the others is both factually correct and reasonable: we Bible scholars (should) know that the the Tiberian points are a late addition to traditions that only had consonants and we are nevertheless accustomed to editions with the those points. My only quibble is that I would change “the consonantal text” in his statement to a plural “the consonantal texts,” something that will certainly be clarified by the presentation of parallel versions in the HBCE and the accompanying notes (see also the discussion on p. 29).
The really nice thing about Hendel’s book is that it provides a clearly stated rationale behind the editorial policies of the project, and relates them to the currents within the larger field. I would agree that such an enterprise is worthy of book-length treatment, which can be far more detailed than the customary introductory chapter in the editions themselves. Such an expanded discussion enables the reader to use the editions much more judiciously and at times to infer the rationale behind certain editorial decisions in the absence of explicit commentary.
A colleague of mine, Spencer Alan Jones, recently made a blog post about people in the humanities entering the world of programming. I would encourage all humanists with an interest in digital humanities to give it a read. I think it clarifies in some important ways how the two fields can overlap with respect to cultivated skill sets. Read the full post here.
In the world of ancient Near Eastern studies, it can often be difficult to find fonts that properly represent the artefacts we are working with. My experience creating fonts began with a project with Doug Gropp around ten years ago. He was writing a grammar of the targumic Aramaic of Onkelos/Jonathan and wanted to use supralinear Babylonian vocalization in his text. So I prepared a workable font for him, and have been making my own fonts ever since, whenever there was a particular script I wanted to present in my work. I have begun using these custom fonts to visualize possible textual reconstructions. Feel free to try out the fonts in my repertoire here (please consider them cc-by-sa):
If you want to try your hand at creating your own fonts, the following should serve to get you started.
Creating Fonts
Creating fonts directly from images is rather trivial, though often time consuming. There are 2–3 steps involved depending on the level of detail one wishes to achieve:
Creating an inventory of characters from images in your image processor (either Gimp or Photoshop), binarizing them, and saving them together in one graphics file (png is good for this).
Loading the file created in step 1 into a font editor (Birdfont is particularly easy to use) and vectorizing each character in its appropriate unicode codepoint.
(optional) creating a kerning table (using Font Forge or Microsoft Volt)
With the exception of Photoshop, all software mentioned here is free. With the exception of Microsoft Volt, all software will run on all platforms (Windows, Mac, Linux).
Image Processing
The process of creating good black and white images for each character can be done to the level of precision desired. Making cleaner lines around the character will enhance the vectorization process and make the font clearer and faster to render, but it is really not necessary for the work we are doing to make estimates. You will find your own happy medium between time spent and level of precision.
The following video shows my process for creating a character map. The steps are essentially as follows:
load multiple images into the program as separate layers (since they all have DPI information, they should all be imported at the correct size relative to each other)
use the magic wand tool on the grayscale image to select the character
copy the character into a new layer
make guiding marks for the hanging line (I usually see these better in the color image)
repeat steps 2–4 for every character (save your work often!)
save the full project now
delete the layers containing the scroll images
arrange all the characters into some usable order
merge layers
convert to grayscale
save as png file (do not overwrite your original psd file!)
A video of the procedure in Photoshop is available here.
I will try to make a video with the procedure in Gimp ASAP, but I am not so used to that software yet.
Font Creation
Now that you have a grayscale or binarized graphics file containing all your characters, you can convert them into a font. I have used Birdfont in this demonstration, but you could also use the more advanced Font Forge.
The following video shows the font creation process. The steps are as follows:
create a new font
import as a background the graphics file you just created:
three parallel lines on the right upper corner>import and export>import background image (or simply Ctrl+B)
on the bottom of the left menu select “Add”
In the window just opened, select the file you just created> on the upper line, click on “open”
resize it so it fits the preset character bounds (once you have gotten it to the correct size, do not resize it again), make sure to turn on the hanging line:
On the left black menu under “background image”, click on the circle with the crossed four arrows
Use the small black triangle on the left bottom of your background image to resize
Use the horizontal line with the arrows to rotate the letters
In order to move the background image together with the guidlines, click on the hand button (in the menu under “background image” near the round button with the arrows)
in the overview tab, select unicode, then scroll down to aleph and double click it
go to your background tab and draw a box around the aleph (or the current character you are working on), the double click within the box
click in add to glyph on the left hand panel, then select aleph
now move the image into proper position using the guidelines you made in the image, rotate the image if necessary
now binarize, trace, and wait for the processing to finish
To binarize: under the “’background tool” menu click on the square divided to a black and white triangles (If you close the file, and reopen it, you might need to click on that
bottom again for each letter).
To trace: under the same menu click on the circle that half of it is dotted.
In case you wish to change a letter after tracing, you should move the letter. It will create another layer of the same letter. Then, delete the previous layer by clicking on the
x next to the word layer on the left.
select the guide lines you created and hit the delete key to remove them
repeat steps 4–9 for every character
set font attributes and names
export font
A video of the procedure for converting the characters in the graphics file to vector fonts in BirdFont is available here.
A video of the procedure for setting attributes and names, and exporting the font is available here.
Kerning and Metrics
After all this, you now have a font that works reasonably well. You can stop here and use the font as is (you can do kerning manually in Photoshop and other software). But if you want to make the font even more usable, then continue on to add kerning tables. You should be able to use BirdFont for kerning, and they have a very simple interface for this, but there seems to a bug with left-to-right fonts currently (at least on my machine). So, you must use Font Forge or Volt for this.
The steps for defining kerning pairs is as follows:
open your new font in Volt
Add a script, this must be exactly: Hebrew . That is Hebrew starting with a capital H followed by one space, then .
Add a feature, you should label this Kerning , which lets the processor know this is a kerning table.
Now add a positioning lookup, but clicking “Add Positioning”. You can name this anything you want.
Now open the positioning you just created by clicking “Edit Lookup”, set it to “Pair Adjustment” and change from LTR to RTL.
Now you can click on “Edit Glyphs” to see a list of the characters you have created. Then simply drag them into either the first or second box in the Lookup Edit window.
Drag the left hand glyph to set kerning for the pair. You can also change values manually. It is best not to change values for the first glyph, only the second.
To add more kerning pairs, just one of the gyphs in the left hand panes and hit enter to start a new entry.
You can create character groups for kerning, and this certainly saves some time (for instance ז, י, ו may constitute a thin letter group, then you can just kern the group instead of individual letters). But I will leave groups for further discussions.
Once you are happy with all you kerning pairs, move the “New Positioning” (or whatever name you gave it) from under the “Lookups” to the left, under the “Kerning ”. Then cick on the “Compile” button, and then you should click on “Proofing Tool” which displays a window in which you can test the font by typing in the input field and turn your kerning tables on and off. Don’t forget to switch the Text Flow to RTL (Right to Left). Mark a v next to the work “kerning” on the left.
You might not be able to see the letters in Hebrew, but do it anyway. It will make the kerning work.
Click on the play button until the blue rectangle encircles all the letters (or the little squares that are displayed instead of them).
You should save regularly, and when you are finally ready to finish the font, go to “File” and then “Ship Font”, choose a name and you are done.
A video of the procedure for adding kerning tables to your new font can be found here.