Data Data Everywhere and not a Drop to Drink

Data are the lifeblood of the computational humanities. My first serious encounter with the usage of computers to benefit humanistic research was the Perseus project. It was at the turn of the new millennium and I had just begun my Classics studies at Loyola Marymount University in Los Angeles. I remember marvelling at how quickly I could look up words as I was reading Plato or the Aeneid. Not only that, Perseus seemed to have all the major works I might want to read available at my fingertips, along with Smyth’s famous grammar, the “middle” Liddell, Lewis and Short, and even the big LSJ—the only thing we need now is the OLD…or the TLL! As a poor kid in a relatively wealthy university, it was empowering to have access to all these resources which I could never dream of acquiring on my own (Sarah, my wife, did eventually buy me Smyth and I got an ancient copy of LSJ when the Semitic library at the Catholic University of America was cleaning its stacks). Something about having these works appearing on my own computer screen made me feel that I could own these monuments of western learning too, if only in the form of 1’s and 0’s flitting through the aether.

But feeling like one has ownership of those bits of data is not enough. There is often a sense that if something is on the internet, it is free for everyone to use, which is not technically correct. Licensing still applies, and this is one of the main problems plaguing computational humanities projects. With the explosion of open source everything, the world is becoming more comfortable with “open” data and publicly funded research is thankfully beginning to be required to provide it. I see that my old mainstay the Perseus project is using a CC-BY-SA license, which is cool because it allows products like Logos to package the Perseus data into a software platform familiar to their user base. If data are the lifeblood of computational humanities, restrictive licenses, and even a failure to provide any license at all, is a death sentence. And given its etymological background can data (think Latin do, dare) really be held back? While there are many ways to copywrite data these days, the Creative Commons family of licenses provides clarity and simplicity, and would be a welcome addition to any DH project. Using such a license ensures the continued usability of a project’s hard earned data beyond the financial, institutional, and temporal bounds of the project itself, thus integrating the work of individual projects into the wider DH collective. A project’s code can also be openly licensed, such as the MIT license, even if making the right choice for code licensing can become complicated by the usage of libraries with possibly conflicting licenses (Apache, BSD, GPL, LGPL…)—a complication greatly simplified by tools like FOSSA.

Even when data are clearly licensed in a permissible form, that is only the first step to making those data productive within the computational humanities community. The data need to be accessible in some recognizable data format (CSV, JSON, SQL) and have some sort of understandable semantic and syntax (e.g., TEI and XML respectively). I currently use data from many sources in my professional capacity within the Scripta Qumranica Electronica project and in my own personal research. Much of my time in such endeavors is bound up in transforming the data from the format it came packaged in to a format I can compute it in and a format that can link with the other data sources I might be using. This may involve reading the data into objects/structs in whatever programming language I’m using (Go, Python, Perl, Javascript, etc.) for in memory processing, parsing it into a database (PostgreSQL, MySQL, GraphDB, Jena, etc.) for logical queries, or converting into whatever format is required for a particular processing suite that I want to experiment with (TRACER, OpenNMT, Ugarit). This is simply inefficient and results in a lot of duplicated work of inconsistent quality. What is needed is a common data interchange format and a way to provide an ontology for the data stored in it.

I guess that last sentence was a pretty overt setup for JSON-LD. But let’s not get ahead of ourselves. There are many data formats to choose from, and perhaps CSV is the most versatile since you can open it in anything, even Excel, but it can be a real pain with complex linked data. And since I am interested in simplifying the display of data as much as access to processing, CSV is a poor choice. A desire for human readable access to primary research data has led to the popularity of XML (an acceptable, though unwieldy, format for linked data). But this confuses the matter, human readability and digital portability (in all its complexity), represent two axes and accommodating one often comes at the expense of the other. What I would rather see is a very clear and extensible data representation that does justice to the data and provides the necessary tools to easily extract human readable readable representations (along with useful forms for computation). Enter data ontology.

As a classicist ontology is a funny sounding word—the participle of ειμι + λογια, so the study of being? That sounds cool, so we store not only our data, but also information about what our data actually is, its state of being. This reminds me of what is partially accomplished in SQL databases: we define our data types, Integer, Float, Decimal. But what is the String “לא”? Is it a Hebrew word “no”, an Aramaic word “no”, a number “31”, something else? MySQL data types like TIMESTAMP give us more information about what they are since we know that whatever data stored in such a format provide a UTC designation of time; that is, we get not only the data, but also know exactly what it means, a 4 byte integer representing the number of seconds since January 1, 1970. But creating a new data type for MYSQL or any other database is no easy thing. An SQL database (with InnoDB tables) allows you to specify relations very easily via the schema, as do graph databases of various sorts, but the definition of data type is very closed, thus limiting a the ontological horizons. So are we trying to fit a rug into a room that is too small?

Well, the semantic web has long ago developed a way to deal with this. The goal there has been to store data in a very simple format and to couple that with a mechanism to describe that data. The backbone of the semantic web, which doesn’t exactly exist yet, is the triplestore (a type of database graph) conforming to the remote data framework (RDF) with a separate schema (RDFS) providing the ontology, which can be chosen from a preexisting schema or described ad-hoc. So what is a triplestore? Person -> has a name -> name would be an example of a triple store, we have a subject “person”, a predicate “has a name”, and a “name” is the object—three values, hence “triple”, in a clearly described relationship. These triplestores can be represented in various ways, including good old plain text. So we can link anything to anything else with this triplestore data structure, for instance a “work” can “have a chapter” “chapter”, a “chapter” can “have a sentence” “sentence”, and this “sentence” can “have a word” a word, and that “word” can “be a part of speech” “noun”, and that “noun” can “have a syntactic role” “agent”, but it could also “have a semantic domain” “beverage”, or anything else. I should ask here, though, what is a word, is it a morpheme, a lexeme, a vocable, all letters between spaces in text? Is it:

<line>
  <word>arma</word>
  <word>virumque</word>
  <word>cano</word>
</line>

or should it be:

<line>
  <word>arma</word>
  <word>virum</word>
  <word>que</word>
  <word>cano</word>
</line>

Or should it be something different from either of those? Having an ontology to begin defining the meaning of each piece of data begins to help here when we want to make clear what our data means and thus make it easier for others to know how to use it for their own purposes.

Let’s take names as an example. What if I want to store some information about people with names? I can use the common ontology of foaf for a person: “The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.” I guess maybe I would have suggested that a “person” is, was, will be, or is imagined to be a member of the species homo sapiens (sorry to all you dog and cat lovers out there), but no matter, we will follow the FOAF ontology. So we have a person and we will provide data for that person’s name, whether a “family name” (foaf:familyName) or “first name” (foaf:givenName):

  • “The familyName property is provided (alongside givenName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”
  • “The givenName property is provided (alongside familyName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”

So, with Bronson Brown-deVost I would store “Bronson” as the “givenName” and “Brown-deVost” as “familyName”, but for someone like Kim Jong-Un, I would describe “Kim Jong” as the “familyName” and “Un” as the “givenName”. So in RDF/Turtle syntax that would be:

@prefix xsd: <http://www.w3.org/2001/xmlschema#> .
@prefix bbd: <https://www.brown-devost.com/> .
@prefix bbdpeople: <http://www.brown-devost.com/people/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
bbdpeople:bronson foaf:givenname "Bronson"^^xsd:string .
bbdpeople:bronson foaf:familyname "Brown-deVost"^^xsd:string .
bbdpeople:un foaf:givenName "Un"^^xsd:string .
bbdpeople:un foaf:familyname "Kim Jong"^^xsd:string .

Even though the names are written in a different syntax, the ontology of the names is made clear, or at least described in a consistent way. This ontology certainly misses that Kim is a family name and Jong is a more specific clan name, and that Brown is a family name I acquired through marriage, and deVost or Devost is a family name I acquired though birth. But if I wanted a more fine grained ontology for personal names, I could always create one, and these data, which are probably just UTF-8 strings can be given a more distinctive data type and related in very specific ways to other data types (I could make “marriedName” and “clanName” subclasses of the foaf:familyName property).

Moving to the broader picture, using an ontology provides a semantic schema to data that makes it much easier to know exactly what each piece of data really is intended to represent. What is more, making that data available as JSON (here JSON-LD) makes it easily accessible and consumable for a variety of purposes. Up to now, there are software solutions that can be used today for this type of data access (think SPARQL endpoints with things like Apache Jena, etc.). But let’s push further here, we have our data and it is curated to have a consistent schema and ontology so we know what our data means, or is supposed to mean. What we still lack though is simplified accessibility, attribution, and versioning of these data. Let me tackle these one by one.

I am very interested in thinning out the data access chain. In SQE we have a long chain of data handlers. Basically, the user of our website requests data by clicking somewhere, the browser-side javascript data models make a request over HTTP to our Perl cgi scripts, which query the MariaDB database, which returns the result to the Perl cgi scripts, which send those data back to the user’s browser, which updates the Javascript data model, which the Vue.js template then automatically reacts to and changes the display the user sees on the screen. We are even thinking of throwing in another middleman like socket.io for realtime data synchronization. These are tried and true technologies, but I still am not thrilled about how many “hands” the data needs to pass through to go between user and the single “source of truth” (those database files on the SQE server’s disk) and the various transformations made along the way to the users browser. It is a very difficult juggling act to maintain that single “source of truth” as the data are passed back and forth between the browser and the database. One user could change some datum and a fraction of a second later another user could try to change that same datum. The first user would receive back an all clear that their write was successful, and not learn until the next complete browser reload that the successfully written data is already out of date. And the second user would perhaps never learn of the first user’s write. I would love to have the browser more directly interact with the data, but a number of issues, most importantly security, stand in the way of this. No developer in her right mind would expose a database to the world wide web, we simply can’t let everyone have direct access to the datastore…or can we?

The issue of attribution, and indeed ownership, of data is a real problem. Most systems are based on trust: I create data attached to my user credentials and I trust that no one will remove it. The list of people with admin access to a database, and hence the ability to alter my data, is often small. Nonetheless, part of a robust system is not relying on trust but rather engineering security from the ground up. That is why we want the systems that handle our finances and our personal data to be open source, and thus open to public scrutiny. Why should our academic data be any different? Asymmetric encryption works with a set of keys, one public and one private. These keys are used to alter data, one basically turning it into a mess of unintelligible characters, and the other restoring it to its original form. Basically, data is locked with a private key, and only people with the corresponding public key can open it, this also allows them to verify the originator of the content. People can also encrypt data using a public key, which can only be unlocked by the owner of the private key (we assume people know better than to give out their private key). This allows only the owner of the keypair full access to her or his data, since only that person can unlock messages in both directions. Thus private/public key encryption provides two benefits for shared data, it place control of data access back into the hands of the content originator (if you lock something with your public key, only you can access it, since that requires the corresponding private key), and public verification of the content originator. Let’s try a little example slightly modified from the one jsencrypt provides in their readme:


The message hasn’t been encrypted yet.


The message hasn’t been unencrypted yet.

Using encryption keys like these to govern data ownership would be especially useful in a database system with uniqueness constraints, so when someone tries to create a content that already exists, the uniqueness constraint ensures that the first originator is always credited—Ingo Kottsieper has already integrated a different method for achieving this in the SQL database for the SQE project. Such constraints becomes complex with atomistic linked data, but should be possible to a large extent, especially with robust ontological schemas.

Along with this true ownership of data comes the desire to update it. As humans, we do not have a permanent state, we change in relation to the stimuli around us, and sometimes this means changing our opinions and perceptions—in programming this is mutability. In programming, mutability is a big problem because most programs rely on a consistent state. Changing data in a program is not necessarily the problem, the issue is making sure every aspect of the program shares the same state at the same time. Managing this can become difficult when data is shared across many parts of a local program, across an intranet, or especially across the world wide web. What happens when someone is viewing some data locally and it gets changed on the server? Or what happens when someone loses connectivity to the network and wants to change some data they are already working with? RDBMS’s have long incorporated complex routines to synchronize data between query and execution as well as master and slave database servers in closed systems. But we probably also want access to the history of changes, something BigchainDB is all about.

We are now seeing the entrance of new technologies to deal with sharing the state of data across the internet in real-time such as IPFS and various other solutions built on websockets (even high level frameworks like Meteor.js and Django support real-time updates). One neat implementation I have been playing with is Gun.js, which is based on the concept of eventual consistency and cryptographic permissions. Basically data can be shared between any number of clients (ok, server memory limitations do apply, but you can always add more servers). Each user’s web app is able to store data locally when the websocket connection is not available and updates the server when a connection is established again—thus, eventually consistent. Gun.js will choose the highest update value (each time a datum, or “node”, is updated it increments an update counter), and in a race condition (two or more users are trying to update the same datum at the same time), the one with the lexically longest value wins. Hey, you have to decide somehow. Supposedly the whole update history of each datum is accessible, so a version history can be accessed.

If you want to try this out, change the color in the dropdown box below and watch the background color of this paragraph change. Perhaps it has already been changing as you were reading this, since whenever anyone anywhere changes this value, it updates for everyone everywhere. Go ahead, call up your friend in Australia and start a color battle! My son and I did a little test sending the update back and forth from Germany to Australia a couple times and experienced real-time synchronisation of a second or less!

Or perhaps we could do something graph like that looks a bit more similar to a triplestore. Try typing a relationship in the following fields:
Subject:

Predicate:

Object:

Now let’s make a little query:
Find every
of

And our results are:

No search run yet.

You can always try a new triple of your own, just pop in any subject, predicate, and object. You can add as many objects to the same subject and predicate as you want, and they will magically appear in your search results. Or delete a triple and watch it disappear. In fact, it is never deleted, the node is just given a value of null, or “tombstoned” to use the technical term. We could always revive the data later or assign it new values.

In the end, what I am looking for is a distributed, real-time, semantic, versioned, licensed web of knowledge. That is, my dream system, if it could be cobbled together, would be to have a data set turned into a “knowledge base” via an ontological schema, licensed in a clear way, encoded via the private key of the person who generated it (anyone with that person’s public key can then access it), served dynamically from a truly distributed data store (not just master/slave), with a clear versioning record (and the versioning should maintain compatibility with the licensing). It is a lot to ask, but with technologies like websockets, RDF (and JSON-LD), BlockChain, asymmetric cryptography, and creative commons along with implementations like IPFS (with its OrbitDB), gun.js, BigchainDB, most of the pieces are already here. The Wolfram data repository really nails the semantic knowledge issue along with proper attribution and clear licensing, Knora is an impressive and similarly poised project using standardized RDF/RDFS/OWL for its knowledge base (though neither appear to be truly distributed or real-time). And gun.js is very close also, only really lacking the semantic component and maybe simplified searching possibilities. It seems to me that we are now very close to a robust solution, its just a matter of time…

A Thought on Steps to a New Edition of the Hebrew Bible

Most people who might be reading this blog will be aware of the new edition of the Hebrew Bible now being produced under the moniker of The Hebrew Bible: A Critical Edition—it started out as the Oxford Hebrew Bible. The main distinction of the project is its aim to produce eclectic editions of books in the Hebrew Bible, rather than the now traditional diplomatic editions of Codex Leningradensis (so largely BHS/BHL) or the Aleppo Codex (so HUB). Ron Hendel, the general editor, is to be congratulated for publishing alongside these new critical editions a monograph discussing the theoretical framework within which the editorial work is being carried out. The book has garnered quite a bit of attention since its publication and has been a topic of discussion within the scholarly community.

I have been working through the book rather closely for a while now since I am doing a large amount of text critical research these days. One of the difficulties in designing the purview of a project like this new edition of the Hebrew Bible is to define the and defend its constraints. One of the innovations of this Hebrew Bible edition is that it works to establish one or more literary archetypes which it presents in Hebrew script and language. This is, of course, problematized by the fact that some archetypes may currently be derived only from Greek texts (or the so called daughter versions of the Septuagint) and thus translation (back?) into Hebrew becomes necessary. A discussion of that decision notwithstanding, I would like to focus on a different problem that Hendel rightly connects to this editorial practice of trying to represent archetypes: the usage of Tiberian pointing.

Hendel acknowledges the logical difficulty to be overcome as follows:

The copy-text will be L, our oldest complete manuscript of the Hebrew
Bible. Since the accidentals of vocalization and accentuation in L are the
product of medieval scribes, our critical text is open to the complaint of
anachronism. (p. 31)

Hendel then goes on to lay out the following arguments in defence of this decision:

There are several ameliorating factors that lessen this dissonance. First, biblical scholars already know that the consonantal text is older than the medieval vocalization system. So a critical text with this overlay is not
strange. Second, critical editions in other fields use anachronistic accidentals, including editions of Greek texts (including the New Testament, the LXX , and classical literature) that use rough breathings, accents, punctuation,
and miniscule letters, all of which were scribal inventions of the Carolingian era (ninth century CE), roughly contemporary with the Tiberian Masoretes. Third, the phonology of the Tiberian vocalization system is not wholly or even mostly anachronistic. (p. 32)

I find this defence quite surprising, especially in regard to the Greek material. The Greek rough breathings, accents, and punctuation were most certainly not the product of the Carolingian era, though their ubiquitous usage in scholarly transmission is established at that point in time. The accents, breathings, and punctuation can be already found in various usages two centuries before the common era, and appear in our earliest Greek Bible codices and some of the papyri as well. That means, at least in the case of Septuagint, some of those markings could possibly stem from the first penning of the translations, though I do not expect that to be so. The minuscule letters alone belong to the 9th century.

The second surprising claim is that “the Tiberian vocalization system is not wholly or even mostly anachronistic.” I don’t know by what measure we might make decisions about the similarity of one phonological system to another, but even if the books of the Hebrew Bible were written over as short a time span as a few hundred years and a small geographical region, comparative evidence would suggest that the phonology of those individual works would have varied. Certainly the phonology imposed by the transmitters, and redactors, of these works would vary as well. Some of the spellings in Qumran belie phonologies more in line with that found in the Babylonian system. Attenuation of a -> i (maqtal -> miqtal) occurred in the Byzantine era. I could go on and on discussing the problems of reconstructing Hebrew phonology from a diachronic perspective, but the real crux is that the whole defence seems rather ad hoc.

I have no problem with anachronistically using Tiberian phonology as indicated by the accents in transcriptions. Ahituv did this in his Echos from the Past, where we find Moabite and other NWS languages/dialects presented with Tiberian pointing. The “point” is that the Tiberian phonology in that book was being used to help readers associate the “proper” lexeme and morphological parsing to any given word, not some historical phonological reconstruction such as that attempted in K. Beyer’s “Die Sprache der moabitischen Inschriften,” KUSATU 11 (2010): 5–41. But I find it untenable to claim that the phonology of the Tiberian vocalization system is quite similar to the vocalization of the archetypes of all the books in the Hebrew Bible as witnessed in the manuscript tradition. Perhaps Hendel should have stopped with his first argument in favor of the policy, a statement which contrary to the others is both factually correct and reasonable: we Bible scholars (should) know that the the Tiberian points are a late addition to traditions that only had consonants and we are nevertheless accustomed to editions with the those points. My only quibble is that I would change “the consonantal text” in his statement to a plural “the consonantal texts,” something that will certainly be clarified by the presentation of parallel versions in the HBCE and the accompanying notes (see also the discussion on p. 29).

The really nice thing about Hendel’s book is that it provides a clearly stated rationale behind the editorial policies of the project, and relates them to the currents within the larger field. I would agree that such an enterprise is worthy of book-length treatment, which can be far more detailed than the customary introductory chapter in the editions themselves. Such an expanded discussion enables the reader to use the editions much more judiciously and at times to infer the rationale behind certain editorial decisions in the absence of explicit commentary.

Pin It on Pinterest