Data Data Everywhere and not a Drop to Drink

Data are the lifeblood of the computational humanities. My first serious encounter with the usage of computers to benefit humanistic research was the Perseus project. It was at the turn of the new millennium and I had just begun my Classics studies at Loyola Marymount University in Los Angeles. I remember marvelling at how quickly I could look up words as I was reading Plato or the Aeneid. Not only that, Perseus seemed to have all the major works I might want to read available at my fingertips, along with Smyth’s famous grammar, the “middle” Liddell, Lewis and Short, and even the big LSJ—the only thing we need now is the OLD…or the TLL! As a poor kid in a relatively wealthy university, it was empowering to have access to all these resources which I could never dream of acquiring on my own (Sarah, my wife, did eventually buy me Smyth and I got an ancient copy of LSJ when the Semitic library at the Catholic University of America was cleaning its stacks). Something about having these works appearing on my own computer screen made me feel that I could own these monuments of western learning too, if only in the form of 1’s and 0’s flitting through the aether.

But feeling like one has ownership of those bits of data is not enough. There is often a sense that if something is on the internet, it is free for everyone to use, which is not technically correct. Licensing still applies, and this is one of the main problems plaguing computational humanities projects. With the explosion of open source everything, the world is becoming more comfortable with “open” data and publicly funded research is thankfully beginning to be required to provide it. I see that my old mainstay the Perseus project is using a CC-BY-SA license, which is cool because it allows products like Logos to package the Perseus data into a software platform familiar to their user base. If data are the lifeblood of computational humanities, restrictive licenses, and even a failure to provide any license at all, is a death sentence. And given its etymological background can data (think Latin do, dare) really be held back? While there are many ways to copywrite data these days, the Creative Commons family of licenses provides clarity and simplicity, and would be a welcome addition to any DH project. Using such a license ensures the continued usability of a project’s hard earned data beyond the financial, institutional, and temporal bounds of the project itself, thus integrating the work of individual projects into the wider DH collective. A project’s code can also be openly licensed, such as the MIT license, even if making the right choice for code licensing can become complicated by the usage of libraries with possibly conflicting licenses (Apache, BSD, GPL, LGPL…)—a complication greatly simplified by tools like FOSSA.

Even when data are clearly licensed in a permissible form, that is only the first step to making those data productive within the computational humanities community. The data need to be accessible in some recognizable data format (CSV, JSON, SQL) and have some sort of understandable semantic and syntax (e.g., TEI and XML respectively). I currently use data from many sources in my professional capacity within the Scripta Qumranica Electronica project and in my own personal research. Much of my time in such endeavors is bound up in transforming the data from the format it came packaged in to a format I can compute it in and a format that can link with the other data sources I might be using. This may involve reading the data into objects/structs in whatever programming language I’m using (Go, Python, Perl, Javascript, etc.) for in memory processing, parsing it into a database (PostgreSQL, MySQL, GraphDB, Jena, etc.) for logical queries, or converting into whatever format is required for a particular processing suite that I want to experiment with (TRACER, OpenNMT, Ugarit). This is simply inefficient and results in a lot of duplicated work of inconsistent quality. What is needed is a common data interchange format and a way to provide an ontology for the data stored in it.

I guess that last sentence was a pretty overt setup for JSON-LD. But let’s not get ahead of ourselves. There are many data formats to choose from, and perhaps CSV is the most versatile since you can open it in anything, even Excel, but it can be a real pain with complex linked data. And since I am interested in simplifying the display of data as much as access to processing, CSV is a poor choice. A desire for human readable access to primary research data has led to the popularity of XML (an acceptable, though unwieldy, format for linked data). But this confuses the matter, human readability and digital portability (in all its complexity), represent two axes and accommodating one often comes at the expense of the other. What I would rather see is a very clear and extensible data representation that does justice to the data and provides the necessary tools to easily extract human readable readable representations (along with useful forms for computation). Enter data ontology.

As a classicist ontology is a funny sounding word—the participle of ειμι + λογια, so the study of being? That sounds cool, so we store not only our data, but also information about what our data actually is, its state of being. This reminds me of what is partially accomplished in SQL databases: we define our data types, Integer, Float, Decimal. But what is the String “לא”? Is it a Hebrew word “no”, an Aramaic word “no”, a number “31”, something else? MySQL data types like TIMESTAMP give us more information about what they are since we know that whatever data stored in such a format provide a UTC designation of time; that is, we get not only the data, but also know exactly what it means, a 4 byte integer representing the number of seconds since January 1, 1970. But creating a new data type for MYSQL or any other database is no easy thing. An SQL database (with InnoDB tables) allows you to specify relations very easily via the schema, as do graph databases of various sorts, but the definition of data type is very closed, thus limiting a the ontological horizons. So are we trying to fit a rug into a room that is too small?

Well, the semantic web has long ago developed a way to deal with this. The goal there has been to store data in a very simple format and to couple that with a mechanism to describe that data. The backbone of the semantic web, which doesn’t exactly exist yet, is the triplestore (a type of database graph) conforming to the remote data framework (RDF) with a separate schema (RDFS) providing the ontology, which can be chosen from a preexisting schema or described ad-hoc. So what is a triplestore? Person -> has a name -> name would be an example of a triple store, we have a subject “person”, a predicate “has a name”, and a “name” is the object—three values, hence “triple”, in a clearly described relationship. These triplestores can be represented in various ways, including good old plain text. So we can link anything to anything else with this triplestore data structure, for instance a “work” can “have a chapter” “chapter”, a “chapter” can “have a sentence” “sentence”, and this “sentence” can “have a word” a word, and that “word” can “be a part of speech” “noun”, and that “noun” can “have a syntactic role” “agent”, but it could also “have a semantic domain” “beverage”, or anything else. I should ask here, though, what is a word, is it a morpheme, a lexeme, a vocable, all letters between spaces in text? Is it:

<line>
  <word>arma</word>
  <word>virumque</word>
  <word>cano</word>
</line>

or should it be:

<line>
  <word>arma</word>
  <word>virum</word>
  <word>que</word>
  <word>cano</word>
</line>

Or should it be something different from either of those? Having an ontology to begin defining the meaning of each piece of data begins to help here when we want to make clear what our data means and thus make it easier for others to know how to use it for their own purposes.

Let’s take names as an example. What if I want to store some information about people with names? I can use the common ontology of foaf for a person: “The Person class represents people. Something is a Person if it is a person. We don’t nitpic about whether they’re alive, dead, real, or imaginary. The Person class is a sub-class of the Agent class, since all people are considered ‘agents’ in FOAF.” I guess maybe I would have suggested that a “person” is, was, will be, or is imagined to be a member of the species homo sapiens (sorry to all you dog and cat lovers out there), but no matter, we will follow the FOAF ontology. So we have a person and we will provide data for that person’s name, whether a “family name” (foaf:familyName) or “first name” (foaf:givenName):

  • “The familyName property is provided (alongside givenName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”
  • “The givenName property is provided (alongside familyName) for use when describing parts of people’s names. Although these concepts do not capture the full range of personal naming styles found world-wide, they are commonly used and have some value.”

So, with Bronson Brown-deVost I would store “Bronson” as the “givenName” and “Brown-deVost” as “familyName”, but for someone like Kim Jong-Un, I would describe “Kim Jong” as the “familyName” and “Un” as the “givenName”. So in RDF/Turtle syntax that would be:

@prefix xsd: <http://www.w3.org/2001/xmlschema#> .
@prefix bbd: <https://www.brown-devost.com/> .
@prefix bbdpeople: <http://www.brown-devost.com/people/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
bbdpeople:bronson foaf:givenname "Bronson"^^xsd:string .
bbdpeople:bronson foaf:familyname "Brown-deVost"^^xsd:string .
bbdpeople:un foaf:givenName "Un"^^xsd:string .
bbdpeople:un foaf:familyname "Kim Jong"^^xsd:string .

Even though the names are written in a different syntax, the ontology of the names is made clear, or at least described in a consistent way. This ontology certainly misses that Kim is a family name and Jong is a more specific clan name, and that Brown is a family name I acquired through marriage, and deVost or Devost is a family name I acquired though birth. But if I wanted a more fine grained ontology for personal names, I could always create one, and these data, which are probably just UTF-8 strings can be given a more distinctive data type and related in very specific ways to other data types (I could make “marriedName” and “clanName” subclasses of the foaf:familyName property).

Moving to the broader picture, using an ontology provides a semantic schema to data that makes it much easier to know exactly what each piece of data really is intended to represent. What is more, making that data available as JSON (here JSON-LD) makes it easily accessible and consumable for a variety of purposes. Up to now, there are software solutions that can be used today for this type of data access (think SPARQL endpoints with things like Apache Jena, etc.). But let’s push further here, we have our data and it is curated to have a consistent schema and ontology so we know what our data means, or is supposed to mean. What we still lack though is simplified accessibility, attribution, and versioning of these data. Let me tackle these one by one.

I am very interested in thinning out the data access chain. In SQE we have a long chain of data handlers. Basically, the user of our website requests data by clicking somewhere, the browser-side javascript data models make a request over HTTP to our Perl cgi scripts, which query the MariaDB database, which returns the result to the Perl cgi scripts, which send those data back to the user’s browser, which updates the Javascript data model, which the Vue.js template then automatically reacts to and changes the display the user sees on the screen. We are even thinking of throwing in another middleman like socket.io for realtime data synchronization. These are tried and true technologies, but I still am not thrilled about how many “hands” the data needs to pass through to go between user and the single “source of truth” (those database files on the SQE server’s disk) and the various transformations made along the way to the users browser. It is a very difficult juggling act to maintain that single “source of truth” as the data are passed back and forth between the browser and the database. One user could change some datum and a fraction of a second later another user could try to change that same datum. The first user would receive back an all clear that their write was successful, and not learn until the next complete browser reload that the successfully written data is already out of date. And the second user would perhaps never learn of the first user’s write. I would love to have the browser more directly interact with the data, but a number of issues, most importantly security, stand in the way of this. No developer in her right mind would expose a database to the world wide web, we simply can’t let everyone have direct access to the datastore…or can we?

The issue of attribution, and indeed ownership, of data is a real problem. Most systems are based on trust: I create data attached to my user credentials and I trust that no one will remove it. The list of people with admin access to a database, and hence the ability to alter my data, is often small. Nonetheless, part of a robust system is not relying on trust but rather engineering security from the ground up. That is why we want the systems that handle our finances and our personal data to be open source, and thus open to public scrutiny. Why should our academic data be any different? Asymmetric encryption works with a set of keys, one public and one private. These keys are used to alter data, one basically turning it into a mess of unintelligible characters, and the other restoring it to its original form. Basically, data is locked with a private key, and only people with the corresponding public key can open it, this also allows them to verify the originator of the content. People can also encrypt data using a public key, which can only be unlocked by the owner of the private key (we assume people know better than to give out their private key). This allows only the owner of the keypair full access to her or his data, since only that person can unlock messages in both directions. Thus private/public key encryption provides two benefits for shared data, it place control of data access back into the hands of the content originator (if you lock something with your public key, only you can access it, since that requires the corresponding private key), and public verification of the content originator. Let’s try a little example slightly modified from the one jsencrypt provides in their readme:


The message hasn’t been encrypted yet.


The message hasn’t been unencrypted yet.

Using encryption keys like these to govern data ownership would be especially useful in a database system with uniqueness constraints, so when someone tries to create a content that already exists, the uniqueness constraint ensures that the first originator is always credited—Ingo Kottsieper has already integrated a different method for achieving this in the SQL database for the SQE project. Such constraints becomes complex with atomistic linked data, but should be possible to a large extent, especially with robust ontological schemas.

Along with this true ownership of data comes the desire to update it. As humans, we do not have a permanent state, we change in relation to the stimuli around us, and sometimes this means changing our opinions and perceptions—in programming this is mutability. In programming, mutability is a big problem because most programs rely on a consistent state. Changing data in a program is not necessarily the problem, the issue is making sure every aspect of the program shares the same state at the same time. Managing this can become difficult when data is shared across many parts of a local program, across an intranet, or especially across the world wide web. What happens when someone is viewing some data locally and it gets changed on the server? Or what happens when someone loses connectivity to the network and wants to change some data they are already working with? RDBMS’s have long incorporated complex routines to synchronize data between query and execution as well as master and slave database servers in closed systems. But we probably also want access to the history of changes, something BigchainDB is all about.

We are now seeing the entrance of new technologies to deal with sharing the state of data across the internet in real-time such as IPFS and various other solutions built on websockets (even high level frameworks like Meteor.js and Django support real-time updates). One neat implementation I have been playing with is Gun.js, which is based on the concept of eventual consistency and cryptographic permissions. Basically data can be shared between any number of clients (ok, server memory limitations do apply, but you can always add more servers). Each user’s web app is able to store data locally when the websocket connection is not available and updates the server when a connection is established again—thus, eventually consistent. Gun.js will choose the highest update value (each time a datum, or “node”, is updated it increments an update counter), and in a race condition (two or more users are trying to update the same datum at the same time), the one with the lexically longest value wins. Hey, you have to decide somehow. Supposedly the whole update history of each datum is accessible, so a version history can be accessed.

If you want to try this out, change the color in the dropdown box below and watch the background color of this paragraph change. Perhaps it has already been changing as you were reading this, since whenever anyone anywhere changes this value, it updates for everyone everywhere. Go ahead, call up your friend in Australia and start a color battle! My son and I did a little test sending the update back and forth from Germany to Australia a couple times and experienced real-time synchronisation of a second or less!

Or perhaps we could do something graph like that looks a bit more similar to a triplestore. Try typing a relationship in the following fields:
Subject:

Predicate:

Object:

Now let’s make a little query:
Find every
of

And our results are:

No search run yet.

You can always try a new triple of your own, just pop in any subject, predicate, and object. You can add as many objects to the same subject and predicate as you want, and they will magically appear in your search results. Or delete a triple and watch it disappear. In fact, it is never deleted, the node is just given a value of null, or “tombstoned” to use the technical term. We could always revive the data later or assign it new values.

In the end, what I am looking for is a distributed, real-time, semantic, versioned, licensed web of knowledge. That is, my dream system, if it could be cobbled together, would be to have a data set turned into a “knowledge base” via an ontological schema, licensed in a clear way, encoded via the private key of the person who generated it (anyone with that person’s public key can then access it), served dynamically from a truly distributed data store (not just master/slave), with a clear versioning record (and the versioning should maintain compatibility with the licensing). It is a lot to ask, but with technologies like websockets, RDF (and JSON-LD), BlockChain, asymmetric cryptography, and creative commons along with implementations like IPFS (with its OrbitDB), gun.js, BigchainDB, most of the pieces are already here. The Wolfram data repository really nails the semantic knowledge issue along with proper attribution and clear licensing, Knora is an impressive and similarly poised project using standardized RDF/RDFS/OWL for its knowledge base (though neither appear to be truly distributed or real-time). And gun.js is very close also, only really lacking the semantic component and maybe simplified searching possibilities. It seems to me that we are now very close to a robust solution, its just a matter of time…

Versioning a MySQL Database on GitHub

As part of my work with Scripta Qumranica Electronica I have been working together with Spencer Jones to get our project(s) up on GitHub in a way that others might be able to easily spin up a development environment and in such a way that our open-source data and software are freely accessible. Spencer put together a wonderful Docker container so that the a MariaDB instance can be loaded effortlessly on any developer’s computer. The task of making our unique data available in that container fell to me.

The Problem

I know very little about versioning databases, my sense is that it is not all that common, and when databases are available on GitHub they will be either an empty schema or a full database dump. I did look into Klonio a while ago, but it wasn’t yet ready for use. A database dump would have been fine, and in fact that is how we started out. But as the database size grew (now over 1GB uncompressed), it was no longer possible to fit even a compressed dump within GitHub’s file size limitation. So the first solution is to dump to individual tables to individual files, but we should also run some constraints on those dumps so that we are only uploading our own SQE data to GitHub, and not any data that the users of the database may have input. That requirement puts some limitations on the “dumping” procedure we use.

Wrangling Query Output into CSV Files

Enter SELECT INTO OUTFILE, a cool way to dump queries into text files—CSV text files by default. There are many settings that can be used to change the output, but I will just work with the defaults for now. A basic query with some filtering goes something like this:

SELECT column1, column2
INTO OUTFILE "/my/database/table.sql"
FROM table1
JOIN table2 USING(table1_id)
WHERE table2.user = 0

It is important to remember that the path you used here is relative to the server, not whatever client you may be using to execute the query. So, you will need access to the server hosting the database and the database will need to have write permissions to the folder you specify. Also, it will do nothing if such a file already exists.

Getting Automated

Ok, I am not going to write all these queries by hand. Fortunately Ingo Kottsieper, who is in charge of designing our database, has used some very good practices to regularize the tables and their relationships. Firstly the primary key for any table is labeled table_id, so joins become very easy and generally look something like:... FROM table1 JOIN table2 USING(table1_id) or ... FROM table1 JOIN table2 USING(table2_id). Also, all the tables into which users might insert data have a corresponding *_owner table, and this allows us to easily filter data that users have input from the base data that we have bulk imported.

So, since I was tired from all this research and various failed attempts, I wrote my backup script in python. Not only that, it isn’t very idiomatic python either—don’t judge. Also I don’t make use of the benefits of the mysql connection pool in the code below, it is just copied from some of my other bad python code. Anyway, we start by grabbing a list of all the tables in the database, specifying some tables that we will exclude from automated processing (stuff with user passwords, stuff that needs to be treated special, etc.), and filtering these in sets so we don’t accidentally try to back up a table twice:

import mysql.connector
from mysql.connector.pooling import MySQLConnectionPool

dbconfig = {'host': "localhost",
            'user': "username",
            'password': "password",
            'database': "database_name"
            }

cnxpool = mysql.connector.pooling.MySQLConnectionPool(pool_name = "mypool",
                                                      pool_size = 30,
                                                      **dbconfig)

db = cnxpool.get_connection()
cursor = db.cursor()
sql = 'SHOW TABLES'
cursor.execute(sql)
result_set = cursor.fetchall()
path = '/Users/bronson/sqe-mysql-backup/'
owner_tables = set()
non_owner_tables = set()
exclude_tables = {'user', 'user_sessions', 'sqe_session', 'artefact', 'scroll_version',
                  'external_font_glyph', 'image_to_image_map', 'single_action', 'main_action'}
for result in result_set:
    if 'owner' in result[0]:
        owner_tables.add(result[0].replace("_owner", ""))
    else:
        non_owner_tables.add(result[0])
non_owner_tables = non_owner_tables - owner_tables

Now that I have some lists of tables, I can write a simple loop to back them all up in one fell swoop—or more properly 100 fell swoops. The variable files will collect all the filenames if you want to print them out somewhere; and 1058 is a magic number for the point where automatically created data has already stopped and user data has just begun.

files = set()
for table in owner_tables:
    if table not in exclude_tables:
        print('Exporting table: %s' % table)
        query1 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table + ' JOIN ' \
                + table + '_owner USING(' + table + '_id) WHERE ' + table + '_owner.scroll_version_id < 1058'
        cursor.execute(query1)

    print('Exporting table: %s_owner' % table)
    query2 = 'SELECT * INTO OUTFILE "' + path + 'tables/' + table + '_owner.sql" FROM ' + table + '_owner WHERE ' \
             + table + '_owner.scroll_version_id < 1058'
    cursor.execute(query2)
    files.add(path + table + '.sql')
    files.add(path + table + '_owner.sql')
for table in non_owner_tables:
    if table not in exclude_tables:
        query3 = 'SELECT ' + table + '.* INTO OUTFILE "' + path + 'tables/' + table + '.sql" FROM ' + table
        cursor.execute(query3)
        files.add(path + table + '.sql')

After this I still do need to run a couple custom query dumps (only 3). Mainly this was because some tables have Geometric data types, which are binary data types, and SELECT ... INTO OUTFILE doesn’t play well with that. So I convert those columns to ASCII representations in the exported CSV and then use a custom parser in my import script. I then gzip all the backed up tables, use bash script to compare them with the files I already have in the repository—cmp -i 8 /backup/file.gz /github/file.gz works the magic for comparing to gzipped files (it ignores the header info, but make sure your files have the same name). Finally I dump a bare schema from my DB into the GitHub project, and push it all up to the web.

Getting that Data into a New Database

Now you, the unsuspecting collaborator, pull the latest database down to your devel server. Which brings me to my even lazier import script. It is just simple bash since I wasn’t sure if everyone would have the right tools for a python script using a mysql connector. Everyone has bash right?

The bash script just uses the command line mysql client. It loads up the schema, unzips all the tables, and loads each backup file into the DB. It is important to turn off foreign key checks before running the import command and then turning them back on afterwards—I assume this is not being run on a production server. Hopefully the script does not run too fast; you do need a little time to grab a coffee. If anyone is interested, the command to load the backup files looks like this:

for file in tables/*.sql; do
    table=${file%.sql}
    printf "\rLoading table: ${table##*/}\n"
    mysql --host=${host} --user=${user} --password=${password} --local-infile ${database} -e "SET FOREIGN_KEY_CHECKS=0;
    LOAD DATA LOCAL INFILE '$cwd/$file' INTO TABLE ${table##*/};
    SET FOREIGN_KEY_CHECKS=1;" &

    pid=$! # Process Id of the previous running command

    while kill -0 $pid 2>/dev/null
    do
      i=$(( (i+1) %4 ))
      printf "\r${spin:$i:1}"
      sleep .1
    done
done

The while loop at the end just gives you a little spinner so you know your computer really is still working. Also, it provides some low-key entertainment while you sip that coffee you just got. Thankfully, Spencer Jones worked the import script right into the docker container script, so this install script meshes seamlessly with the installation of our Docker container. What to do now that you have a copy of the database working in your development environment? Finish that coffee and go get to work on some cgi scripts and queries…

Some final thoughts

If we didn’t gzip the individual table backups, we would have a fully readable versioning system for the DB on GitHub. Some folks may wish to do this if they can stay within the GitHub file limitations (100MB I think). We could not. In fact, my estimates show that one of our tables, the one with the Geometry data, will probably grow beyond 2GB by the date we get all of our data into the it. This means, I will be looking for another, better solution in the not too distant future. Helpful hints and comments are most certainly welcome! You can find the whole project, messy scripts and all, here

Custom Fonts

Custom fonts

In the world of ancient Near Eastern studies, it can often be difficult to find fonts that properly represent the artefacts we are working with.  My experience creating fonts began with a project with Doug Gropp around ten years ago.  He was writing a grammar of the targumic Aramaic of Onkelos/Jonathan and wanted to use supralinear Babylonian vocalization in his text.  So I prepared a workable font for him, and have been making my own fonts ever since, whenever there was a particular script I wanted to present in my work.  I have begun using these custom fonts to visualize possible textual reconstructions.  Feel free to try out the fonts in my repertoire here (please consider them cc-by-sa):

If you want to try your hand at creating your own fonts, the following should serve to get you started.

Creating Fonts

Creating fonts directly from images is rather trivial, though often time consuming. There are 2–3 steps involved depending on the level of detail one wishes to achieve:

  1. Creating an inventory of characters from images in your image processor (either Gimp or Photoshop), binarizing them, and saving them together in one graphics file (png is good for this).
  2. Loading the file created in step 1 into a font editor (Birdfont is particularly easy to use) and vectorizing each character in its appropriate unicode codepoint.
  3. (optional) creating a kerning table (using Font Forge or Microsoft Volt)

With the exception of Photoshop, all software mentioned here is free. With the exception of Microsoft Volt, all software will run on all platforms (Windows, Mac, Linux).

Image Processing

The process of creating good black and white images for each character can be done to the level of precision desired. Making cleaner lines around the character will enhance the vectorization process and make the font clearer and faster to render, but it is really not necessary for the work we are doing to make estimates. You will find your own happy medium between time spent and level of precision.

The following video shows my process for creating a character map. The steps are essentially as follows:

  1. load multiple images into the program as separate layers (since they all have DPI information, they should all be imported at the correct size relative to each other)
  2. use the magic wand tool on the grayscale image to select the character
  3. copy the character into a new layer
  4. make guiding marks for the hanging line (I usually see these better in the color image)
  5. repeat steps 2–4 for every character (save your work often!)
  6. save the full project now
  7. delete the layers containing the scroll images
  8. arrange all the characters into some usable order
  9. merge layers
  10. convert to grayscale
  11. save as png file (do not overwrite your original psd file!)

A video of the procedure in Photoshop is available here.

I will try to make a video with the procedure in Gimp ASAP, but I am not so used to that software yet.

Font Creation

Now that you have a grayscale or binarized graphics file containing all your characters, you can convert them into a font. I have used Birdfont in this demonstration, but you could also use the more advanced Font Forge.

The following video shows the font creation process. The steps are as follows:

  1. create a new font
  2. import as a background the graphics file you just created:
    three parallel lines on the right upper corner>import and export>import background image (or simply Ctrl+B)
    on the bottom of the left menu select “Add”
    In the window just opened, select the file you just created> on the upper line, click on “open”
  3. resize it so it fits the preset character bounds (once you have gotten it to the correct size, do not resize it again), make sure to turn on the hanging line:
    On the left black menu under “background image”, click on the circle with the crossed four arrows
    Use the small black triangle on the left bottom of your background image to resize
    Use the horizontal line with the arrows to rotate the letters
    In order to move the background image together with the guidlines, click on the hand button (in the menu under “background image” near the round button with the arrows)
  4. in the overview tab, select unicode, then scroll down to aleph and double click it
  5. go to your background tab and draw a box around the aleph (or the current character you are working on), the double click within the box
  6. click in add to glyph on the left hand panel, then select aleph
  7. now move the image into proper position using the guidelines you made in the image, rotate the image if necessary
  8. now binarize, trace, and wait for the processing to finish
    To binarize: under the “’background tool” menu click on the square divided to a black and white triangles (If you close the file, and reopen it, you might need to click on that
    bottom again for each letter).
    To trace: under the same menu click on the circle that half of it is dotted.
    In case you wish to change a letter after tracing, you should move the letter. It will create another layer of the same letter. Then, delete the previous layer by clicking on the
    x next to the word layer on the left.
  9. select the guide lines you created and hit the delete key to remove them
  10. repeat steps 4–9 for every character
  11. set font attributes and names
  12. export font

A video of the procedure for converting the characters in the graphics file to vector fonts in BirdFont is available here.

A video of the procedure for setting attributes and names, and exporting the font is available here.

Kerning and Metrics

After all this, you now have a font that works reasonably well. You can stop here and use the font as is (you can do kerning manually in Photoshop and other software). But if you want to make the font even more usable, then continue on to add kerning tables. You should be able to use BirdFont for kerning, and they have a very simple interface for this, but there seems to a bug with left-to-right fonts currently (at least on my machine). So, you must use Font Forge or Volt for this.

The steps for defining kerning pairs is as follows:

  1. open your new font in Volt
  2. Add a script, this must be exactly: Hebrew . That is Hebrew starting with a capital H followed by one space, then .
  3. Add a feature, you should label this Kerning , which lets the processor know this is a kerning table.
  4. Now add a positioning lookup, but clicking “Add Positioning”. You can name this anything you want.
  5. Now open the positioning you just created by clicking “Edit Lookup”, set it to “Pair Adjustment” and change from LTR to RTL.
  6. Now you can click on “Edit Glyphs” to see a list of the characters you have created. Then simply drag them into either the first or second box in the Lookup Edit window.
  7. Drag the left hand glyph to set kerning for the pair. You can also change values manually. It is best not to change values for the first glyph, only the second.
  8. To add more kerning pairs, just one of the gyphs in the left hand panes and hit enter to start a new entry.
  9. You can create character groups for kerning, and this certainly saves some time (for instance ז, י, ו may constitute a thin letter group, then you can just kern the group instead of individual letters). But I will leave groups for further discussions.
  10. Once you are happy with all you kerning pairs, move the “New Positioning” (or whatever name you gave it) from under the “Lookups” to the left, under the “Kerning ”. Then cick on the “Compile” button, and then you should click on “Proofing Tool” which displays a window in which you can test the font by typing in the input field and turn your kerning tables on and off. Don’t forget to switch the Text Flow to RTL (Right to Left). Mark a v next to the work “kerning” on the left.
    You might not be able to see the letters in Hebrew, but do it anyway. It will make the kerning work.
    Click on the play button until the blue rectangle encircles all the letters (or the little squares that are displayed instead of them).
  11. You should save regularly, and when you are finally ready to finish the font, go to “File” and then “Ship Font”, choose a name and you are done.

A video of the procedure for adding kerning tables to your new font can be found here.

Pin It on Pinterest