The importance of using standards


Today’s post is a short rant about the importance of using standards. We have recently been commissioned to develop a website for the South African History Archive (SAHA) and the SABC about the Truth and Reconciliation Commission (TRC). The idea of the site is to make the TRC accessible by linking the actual hearing transcripts to the episodes of a special report TV series that was produced by the SABC at the time. The relevant part of this project for my rant today is the part where we have to import the transcripts of all the hearings into a database, and make them searchable.

Transcripts

The transcripts come to us in an HTML format. There are several types of hearing, but the main ones are amnesty and human rights violation hearings. All together, there are over 3000 transcript files. This all sounds fine, right? HTML is a common format, so all we need to do is write a script that will run through the transcripts, extract the information and insert it into a database. Easy.

Unfortunately not. The transcripts have gone through two sets of hands before they got to us, and there were no common standards applied. Firstly, the transcripts were created at the hearing by the stenographers. In most cases (but not all), general conventions were followed, like putting a speaker’s name in uppercase, and putting each new speaker on a new line, but beyond that, there is almost no common ground between transcripts. The second level involved compiling the transcript into the HTML format we currently have. Again, in most cases (but not all), each new line is identified by a <P> tag, but apart from that, just about anything goes.

Formatting headers

The most difficult part of the process is that there is no standard format for the header of each transcript. The header should contain all the meta-data about the hearing – the date, location, case number, applicants, etc. The header should also be in a standard format – each one laid out in the same way (with one header per line, for example), using the same header labels each time. Unfortunately not so – many different labels are used in the headers for the same thing, and most transcripts do not contain all the necessary headers at all.

A lot of the transcripts do not include date and location information, possibly assuming that this can be gleaned from the file name itself. This is a problematic approach, because it means that the contents of the file can only be relevant while the file itself is kept within the structure of the whole transcript file system. For example, a file might have the path and name, “amntrans/jburg/200218.htm” (which translates as 18 February 2000, in Johannesburg). This file has no date or location information in its header, so as soon as we remove it from the file system (eg: we send the transcript to someone who requests it, perhaps renaming it as “Joseph_Mockena_Amnesty_Transcript.htm”), there is now no way anyone can identify the where or the when of the hearing.

Handling the variations

So the scripts I’m writing have to be very long, and extremely complicated, in an attempt to handle all the possible variations, and catch and cover all the many conflicts between the variations (eg: one variation might use a certain string to separate sessions, while another might use the same string for a completely different purpose. When we find the string, we must then first determine which variation we’re looking at in order to decide how to handle it). This is not a problem in itself – we expected this to be the case, so the project has time and budget for it, and the challenge makes the project fun, but it does provide a small lesson for anyone setting up a project that will create and record any kind of data, but especially transcripts.

Defining standards for data recording

The lesson, of course, is: decide on a standard format for your generated content before you start, and stick to it. It will take a relatively short time to define and set your standards, but will save many, many hours of effort afterwards. I strongly suspect that one reason we have had to wait until 2011 to get searchable details of the South African Truth and Reconciliation Commission, which finished in 2000, is that the task of importing all the transcripts was just a little too daunting and, possibly, expensive.

One Comment so far:

  • I am so pleased to hear that this project is underway! I actually managed the website at the end of the TRC process and I do feel your pain. (Long story, original developer fired, website added to my workload as there were no resources to hire someone to maintain it). The site was a mess, and with transcripts rolling in it was all we could do to upload them in whatever form they were received before people got more incensed that they weren’t up yet. There was never time to do any more than that. The website was sadly not a priority at all for the organisation (this was the late 90s after all) so absolutely no money or resources were invested in it. It’s always been a major regret of mine that I couldn’t do more with it (particularly in terms of tying together the database of victims/ amnesty applications with the transcripts and making it a major searchable resource) so I really am delighted to hear about your work.
    If there is anything I can do to assist, don’t hesitate to contact me.