Data Data Everywhere and not a Drop to Drink

Data are the lifeblood of the computational humanities. My first serious encounter with the usage of computers to benefit humanistic research was the Perseus project. It was at the turn of the new millennium and I had just begun my Classics studies at Loyola Marymount University in Los Angeles. I remember marvelling at how quickly I could look up words as I was reading Plato or the Aeneid. Not only that, Perseus seemed to have all the major works I might want to read available at my fingertips, along with Smyth’s famous grammar, the “middle” Liddell, Lewis and Short, and even the big LSJ—the only thing we need now is the OLD…or the TLL! As a poor kid in a relatively wealthy university, it was empowering to have access to all these resources which I could never dream of acquiring on my own (Sarah, my wife, did eventually buy me Smyth and I got an ancient copy of LSJ when the Semitic library at the Catholic University of America was cleaning its stacks). Something about having these works appearing on my own computer screen made me feel that I could own these monuments of western learning too, if only in the form of 1’s and 0’s flitting through the aether.

But feeling like one has ownership of those bits of data is not enough. There is often a sense that if something is on the internet, it is free for everyone to use, which is not technically correct. Licensing still applies, and this is one of the main problems plaguing computational humanities projects. With the explosion of open source everything, the world is becoming more comfortable with “open” data and publicly funded research is thankfully beginning to be required to provide it. I see that my old mainstay the Perseus project is using a CC-BY-SA license, which is cool because it allows products like Logos to package the Perseus data into a software platform familiar to their user base. If data are the lifeblood of computational humanities, restrictive licenses, and even a failure to provide any license at all, is a death sentence. And given its etymological background can data (think Latin do, dare) really be held back? While there are many ways to copywrite data these days, the Creative Commons family of licenses provides clarity and simplicity, and would be a welcome addition to any DH project. Using such a license ensures the continued usability of a project’s hard earned data beyond the financial, institutional, and temporal bounds of the project itself, thus integrating the work of individual projects into the wider DH collective. A project’s code can also be openly licensed, such as the MIT license, even if making the right choice for code licensing can become complicated by the usage of libraries with possibly conflicting licenses (Apache, BSD, GPL, LGPL…)—a complication greatly simplified by tools like FOSSA.

Even when data are clearly licensed in a permissible form, that is only the first step to making those data productive within the computational humanities community. The data need to be accessible in some recognizable data format (CSV, JSON, SQL) and have some sort of understandable semantic and syntax (e.g., TEI and XML respectively). I currently use data from many sources in my professional capacity within the Scripta Qumranica Electronica project and in my own personal research. Much of my time in such endeavors is bound up in transforming the data from the format it came packaged in to a format I can compute it in and a format that can link with the other data sources I might be using. This may involve reading the data into objects/structs in whatever programming language I’m using (Go, Python, Perl, Javascript, etc.) for in memory processing, parsing it into a database (PostgreSQL, MySQL, GraphDB, Jena, etc.) for logical queries, or converting into whatever format is required for a particular processing suite that I want to experiment with (TRACER, OpenNMT, Ugarit). This is simply inefficient and results in a lot of duplicated work of inconsistent quality. What is needed is a common data interchange format and a way to provide an ontology for the data stored in it.

I guess that last sentence was a pretty overt setup for JSON-LD. But let’s not get ahead of ourselves. There are many data formats to choose from, and perhaps CSV is the most versatile since you can open it in anything, even Excel, but it can be a real pain with complex linked data. And since I am interested in simplifying the display of data as much as access to processing, CSV is a poor choice. A desire for human readable access to primary research data has led to the popularity of XML (an acceptable, though unwieldy, format for linked data). But this confuses the matter, human readability and digital portability (in all its complexity), represent two axes and accommodating one often comes at the expense of the other. What I would rather see is a very clear and extensible data representation that does justice to the data and provides the necessary tools to easily extract human readable readable representations (along with useful forms for computation). Enter data ontology.

As a classicist ontology is a funny sounding word—the participle of ειμι + λογια, so the study of being? That sounds cool, so we store not only our data, but also information about what our data actually is, its state of being. This reminds me of what is partially accomplished in SQL databases: we define our data types, Integer, Float, Decimal. But what is the String “לא”? Is it a Hebrew word “no”, an Aramaic word “no”, a number “31”, something else? MySQL data types like TIMESTAMP give us more information about what they are since we know that whatever data stored in such a format provide a UTC designation of time; that is, we get not only the data, but also know exactly what it means, a 4 byte integer representing the number of seconds since January 1, 1970. But creating a new data type for MYSQL or any other database is no easy thing. An SQL database (with InnoDB tables) allows you to specify relations very easily via the schema, as do graph databases of various sorts, but the definition of data type is very closed, thus limiting a the ontological horizons. So are we trying to fit a rug into a room that is too small?

Well, the semantic web has long ago developed a way to deal with this. The goal there has been to store data in a very simple format and to couple that with a mechanism to describe that data. The backbone of the semantic web, which doesn’t exactly exist yet, is the triplestore (a type of database graph) conforming to the remote data framework (RDF) with a separate schema (RDFS) providing the ontology, which can be chosen from a preexisting schema or described ad-hoc. So what is a triplestore? Person -> has a name -> name would be an example of a triple store, we have a subject “person”, a predicate “has a name”, and a “name” is the object—three values, hence “triple”, in a clearly described relationship. These triplestores can be represented in various ways, including good old plain text. So we can link anything to anything else with this triplestore data structure, for instance a “work” can “have a chapter” “chapter”, a “chapter” can “have a sentence” “sentence”, and this “sentence” can “have a word” a word, and that “word” can “be a part of speech” “noun”, and that “noun” can “have a syntactic role” “agent”, but it could also “have a semantic domain” “beverage”, or anything else. I should ask here, though, what is a word, is it a morpheme, a lexeme, a vocable, all letters between spaces in text? Is it:


or should it be:


Or should it be something different from either of those? Having an ontology to begin defining the meaning of each piece of data begins to help here when we want to make clear what our data means and thus make it easier for others to know how to use it for their own purposes.

Let’s take names as an example. What if I want to store some information about people with names? I can use the common ontology of foaf for a person: “The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.” I guess maybe I would have suggested that a “person” is, was, will be, or is imagined to be a member of the species homo sapiens (sorry to all you dog and cat lovers out there), but no matter, we will follow the FOAF ontology. So we have a person and we will provide data for that person’s name, whether a “family name” (foaf:familyName) or “first name” (foaf:givenName):

  • “The familyName property is provided (alongside givenName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”
  • “The givenName property is provided (alongside familyName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”

So, with Bronson Brown-deVost I would store “Bronson” as the “givenName” and “Brown-deVost” as “familyName”, but for someone like Kim Jong-Un, I would describe “Kim Jong” as the “familyName” and “Un” as the “givenName”. So in RDF/Turtle syntax that would be:

@prefix xsd: <> .
@prefix bbd: <> .
@prefix bbdpeople: <> .
@prefix foaf: <> .
bbdpeople:bronson foaf:givenname "Bronson"^^xsd:string .
bbdpeople:bronson foaf:familyname "Brown-deVost"^^xsd:string .
bbdpeople:un foaf:givenName "Un"^^xsd:string .
bbdpeople:un foaf:familyname "Kim Jong"^^xsd:string .

Even though the names are written in a different syntax, the ontology of the names is made clear, or at least described in a consistent way. This ontology certainly misses that Kim is a family name and Jong is a more specific clan name, and that Brown is a family name I acquired through marriage, and deVost or Devost is a family name I acquired though birth. But if I wanted a more fine grained ontology for personal names, I could always create one, and these data, which are probably just UTF-8 strings can be given a more distinctive data type and related in very specific ways to other data types (I could make “marriedName” and “clanName” subclasses of the foaf:familyName property).

Moving to the broader picture, using an ontology provides a semantic schema to data that makes it much easier to know exactly what each piece of data really is intended to represent. What is more, making that data available as JSON (here JSON-LD) makes it easily accessible and consumable for a variety of purposes. Up to now, there are software solutions that can be used today for this type of data access (think SPARQL endpoints with things like Apache Jena, etc.). But let’s push further here, we have our data and it is curated to have a consistent schema and ontology so we know what our data means, or is supposed to mean. What we still lack though is simplified accessibility, attribution, and versioning of these data. Let me tackle these one by one.

I am very interested in thinning out the data access chain. In SQE we have a long chain of data handlers. Basically, the user of our website requests data by clicking somewhere, the browser-side javascript data models make a request over HTTP to our Perl cgi scripts, which query the MariaDB database, which returns the result to the Perl cgi scripts, which send those data back to the user’s browser, which updates the Javascript data model, which the Vue.js template then automatically reacts to and changes the display the user sees on the screen. We are even thinking of throwing in another middleman like for realtime data synchronization. These are tried and true technologies, but I still am not thrilled about how many “hands” the data needs to pass through to go between user and the single “source of truth” (those database files on the SQE server’s disk) and the various transformations made along the way to the users browser. It is a very difficult juggling act to maintain that single “source of truth” as the data are passed back and forth between the browser and the database. One user could change some datum and a fraction of a second later another user could try to change that same datum. The first user would receive back an all clear that their write was successful, and not learn until the next complete browser reload that the successfully written data is already out of date. And the second user would perhaps never learn of the first user’s write. I would love to have the browser more directly interact with the data, but a number of issues, most importantly security, stand in the way of this. No developer in her right mind would expose a database to the world wide web, we simply can’t let everyone have direct access to the datastore…or can we?

The issue of attribution, and indeed ownership, of data is a real problem. Most systems are based on trust: I create data attached to my user credentials and I trust that no one will remove it. The list of people with admin access to a database, and hence the ability to alter my data, is often small. Nonetheless, part of a robust system is not relying on trust but rather engineering security from the ground up. That is why we want the systems that handle our finances and our personal data to be open source, and thus open to public scrutiny. Why should our academic data be any different? Asymmetric encryption works with a set of keys, one public and one private. These keys are used to alter data, one basically turning it into a mess of unintelligible characters, and the other restoring it to its original form. Basically, data is locked with a private key, and only people with the corresponding public key can open it, this also allows them to verify the originator of the content. People can also encrypt data using a public key, which can only be unlocked by the owner of the private key (we assume people know better than to give out their private key). This allows only the owner of the keypair full access to her or his data, since only that person can unlock messages in both directions. Thus private/public key encryption provides two benefits for shared data, it place control of data access back into the hands of the content originator (if you lock something with your public key, only you can access it, since that requires the corresponding private key), and public verification of the content originator. Let’s try a little example slightly modified from the one jsencrypt provides in their readme:

The message hasn’t been encrypted yet.

The message hasn’t been unencrypted yet.

Using encryption keys like these to govern data ownership would be especially useful in a database system with uniqueness constraints, so when someone tries to create a content that already exists, the uniqueness constraint ensures that the first originator is always credited—Ingo Kottsieper has already integrated a different method for achieving this in the SQL database for the SQE project. Such constraints becomes complex with atomistic linked data, but should be possible to a large extent, especially with robust ontological schemas.

Along with this true ownership of data comes the desire to update it. As humans, we do not have a permanent state, we change in relation to the stimuli around us, and sometimes this means changing our opinions and perceptions—in programming this is mutability. In programming, mutability is a big problem because most programs rely on a consistent state. Changing data in a program is not necessarily the problem, the issue is making sure every aspect of the program shares the same state at the same time. Managing this can become difficult when data is shared across many parts of a local program, across an intranet, or especially across the world wide web. What happens when someone is viewing some data locally and it gets changed on the server? Or what happens when someone loses connectivity to the network and wants to change some data they are already working with? RDBMS’s have long incorporated complex routines to synchronize data between query and execution as well as master and slave database servers in closed systems. But we probably also want access to the history of changes, something BigchainDB is all about.

We are now seeing the entrance of new technologies to deal with sharing the state of data across the internet in real-time such as IPFS and various other solutions built on websockets (even high level frameworks like Meteor.js and Django support real-time updates). One neat implementation I have been playing with is Gun.js, which is based on the concept of eventual consistency and cryptographic permissions. Basically data can be shared between any number of clients (ok, server memory limitations do apply, but you can always add more servers). Each user’s web app is able to store data locally when the websocket connection is not available and updates the server when a connection is established again—thus, eventually consistent. Gun.js will choose the highest update value (each time a datum, or “node”, is updated it increments an update counter), and in a race condition (two or more users are trying to update the same datum at the same time), the one with the lexically longest value wins. Hey, you have to decide somehow. Supposedly the whole update history of each datum is accessible, so a version history can be accessed.

If you want to try this out, change the color in the dropdown box below and watch the background color of this paragraph change. Perhaps it has already been changing as you were reading this, since whenever anyone anywhere changes this value, it updates for everyone everywhere. Go ahead, call up your friend in Australia and start a color battle! My son and I did a little test sending the update back and forth from Germany to Australia a couple times and experienced real-time synchronisation of a second or less!

Or perhaps we could do something graph like that looks a bit more similar to a triplestore. Try typing a relationship in the following fields:



Now let’s make a little query:
Find every

And our results are:

No search run yet.

You can always try a new triple of your own, just pop in any subject, predicate, and object. You can add as many objects to the same subject and predicate as you want, and they will magically appear in your search results. Or delete a triple and watch it disappear. In fact, it is never deleted, the node is just given a value of null, or “tombstoned” to use the technical term. We could always revive the data later or assign it new values.

In the end, what I am looking for is a distributed, real-time, semantic, versioned, licensed web of knowledge. That is, my dream system, if it could be cobbled together, would be to have a data set turned into a “knowledge base” via an ontological schema, licensed in a clear way, encoded via the private key of the person who generated it (anyone with that person’s public key can then access it), served dynamically from a truly distributed data store (not just master/slave), with a clear versioning record (and the versioning should maintain compatibility with the licensing). It is a lot to ask, but with technologies like websockets, RDF (and JSON-LD), BlockChain, asymmetric cryptography, and creative commons along with implementations like IPFS (with its OrbitDB), gun.js, BigchainDB, most of the pieces are already here. The Wolfram data repository really nails the semantic knowledge issue along with proper attribution and clear licensing, Knora is an impressive and similarly poised project using standardized RDF/RDFS/OWL for its knowledge base (though neither appear to be truly distributed or real-time). And gun.js is very close also, only really lacking the semantic component and maybe simplified searching possibilities. It seems to me that we are now very close to a robust solution, its just a matter of time…

Pin It on Pinterest