Online Archives – A Coding Challenge


Every now and again I am given suggestions about what to write in these blog posts, and one suggestion comes up frequently: “Why don’t you write about the stuff you did for SAHA”, they will say. I’ll usually respond with something like, “Well that’s quite a difficult thing to put into a blog post”, but for some reason they never seem to care. So here goes…

SAHA, or the South African History Archives, is an organisation whose primary function is to collect and store historical documents related to South African history, particularly from the 20th century. The documents are all organised into collections, and archived using international standard practices. Our project was to develop a website that could make these archives (or at least the archival records) available online. The idea was not to make sure every piece of paper or photograph was available directly from the website, but to provide a list of all the things contained in the archives, so that researchers could use the website to find useful and relevant historical documents.

The first step was to import all the existing collections, and convert them to using the EAD international standard for archival listings. The listings came to me as a series of CSV files which, unfortunately, had not been created to any consistent format. I had to write a parser script to interpret these inconsistent CSVs, extract the relevant information, and insert that information into a database I had designed. Once in the database, the information is carefully managed to ensure it remains EAD compatible, and can easily be exported into an EAD file.

EAD is an XML specification that describes how all the information about the collection, including a complete list of its contents, as well as a whole range of meta-data (collectively called the finding aid) should be organised. An EAD finding aid is machine-readable, which makes sharing archival information infinitely easier. The SAHA online archives are now fully EAD compatible, which means that the administrators can export any archive into EAD for backup or distribution, and can import an EAD file to create a new collection, or update an existing one. Researchers can also download the EAD file for each collection. They can also download a PDF version of the finding aid, which is formatted for more friendly human reading.

The next step was to import actual digitised items. These are sometimes scanned versions of actual documents, but most often they are visual media, such as JPEG versions of posters. They may also be audio or video files, or in fact any other kind of file. Digitised items also have meta-data, and these use a different standard – Dublin Core (that’s Dublin, Oregon, not Dublin, Ireland). This is also an XML standard, which records information about a particular digital item. Dublin Core is used extensively in a great many varying situations, as I keep discovering. The standard for embedding meta-data into JPEG images is Dublin Core compatible. One of the most common extensions to the RSS standard adds Dublin Core meta-data to items in an RSS feed.

In our case, we needed to be able to add Dublin Core data and the actual digital file together as a single entity to the collection. The collections are all organised as hierarchical organisations of folders, so we provide a facility to add the item directly to a folder. The administrator can export to or import from a Dublin Core XML file, and researchers on the website can export the item meta-data in Dublin Core as well as downloading the actual digitised item.

The biggest issue we have with using these two standards is that we haven’t yet found a way to join them together. We can export a collection into EAD, which will describe the collection, but it won’t contain the Dublin Core information for items added to the collection – those need to be exported separately and individually, which is generally too time-consuming to be done. The main reason we might need this function is for backups. Currently the entire archive is backed up as a whole, but it would be very useful to be able to export an individual collection in its digital entirety. We will have to give this some thought.

We are currently working to add a full search function to the archive. This will allow researchers to do a keyword search through the entire archives, including both the finding aid data and the Dublin core data for digitised items. This has turned out to be a lot more complicated than we had anticipated, but we should have the system up and running in the next few weeks. The details of how that all works are for another post…

Comments are closed.