banner
banner image
banner image
Partner's Area:   Login  |  Help

Sustainability of word processing documents

Ian Barnes

Friday, 20 January 2006, 3:59:23 PM


Word processing formats are a major problem for digital repositories. A large fraction of the material we want to preserve is created in these formats, but they are generally not suitable for long-term preservation because:

  • they are proprietary, and the owners can change either the format or the conditions of use at any time; and

  • they are flat (rather than structured), which makes information retrieval, viewing, printing and reuse more difficult.

We need to consider converting documents into a better format for long-term storage. A suitable format should be widely-used, a stable and recognised standard, and versatile enough to handle all documents from research monographs to the minutes of the finance committee to lecture notes to articles ready for submission to scholarly journals. It should also be easy to process using standard software tools. This last requirement is a strong argument for choosing an XML format. Candidate formats are XHTML, DocBook and TEI. I believe that DocBook & TEI are better than XHTML because they are more structured.

The problem with richly structured formats like DocBook XML and TEI is that word processing documents generally do not contain enough structural information to allow for an automated conversion process. There are a few possibilities.

  • The best scenario is that the document was created using a well-designed word processor template, so that every paragraph has a style name attached to it. These style names can then be used as hooks by an automated conversion process in order to deduce structure. The prototype Digital Scholar’s Workbench described below uses this strategy.

  • For legacy documents or for authors who refuse to use a template, the word processing document will have to be edited by an digital document archivist to get it into a state where it can be converted to DocBook (or TEI). The obvious way to do this is to open it in a word processor, import the template, and then go through the document applying the template styles. With help from keyboard shortcuts and macros (available in both Word and Writer), this might not be too painful, at least for relatively simple documents. Another possibility is to create a specialised digital document archivist’s workbench application for doing this kind of work.

  • For documents that are extremely poorly formatted, or that exist only on paper, a third alternative is to send them out to be rekeyed. This is expensive, but for high-value documents it may be worth it.