This report contains the results of the PRESTA - PREMIS Requirements Statement project undertaken by the National Library of Australia from December 2005 to June 2006 for the Australian Partnership for Sustainable Repositories (APSR).
APSR aims to establish a centre of excellence for the management of scholarly assets in digital format.
It has an overall focus on the critical issues of the access continuity and the sustainability of digital collections. It is building on a base of demonstrators in developmental repositories within partner institutions. It is contributing to national strength in this area by encouraging the development of skills and expertise and providing coordination throughout the sector. It is actively providing international linkages and national services.
APSR is supported by the Systemic Infrastructure Initiative as part of the Australian Government's Backing Australia's Ability - An Innovative Action Plan for the Future. The current partners are Australian National University, National Library of Australia, University of Queensland, University of Sydney, University of Melbourne, University of Technology Sydney and the Australian Partnership for Advanced Computing.
"PRESTA - PREMIS Requirement Statement" is one of the projects of APSR. It has as its aims:
to specify requirements for the collection of metadata needed for preservation management purposes and help these to be applied to selected repository implementations of APSR partners.
This report covers work done at the National Library of Australia during the 6 month period funded by the APSR from December 2005 to June 2006. The National Library of Australia has two digital repositories: PANDORA, Australia's web archive, and a digital repository for storing its own digital collections which is managed using a system developed inhouse called the Digital Collections Manager (DCM). The National Library of Australia has been actively involved in national and international digital preservation initiatives and information about its activities can be found on the Digital Preservation part of the National Library of Australia's website.
The "selected repository implementations" studied were the Australian National University's (ANU) Demetrius repository based on DSpace and the University of Queensland's (UQ) eScholarship repository based on Fez and Fedora.
The original draft workplan implied an expectation that functional specifications for collection of metadata during submission and ingest would be written which the repositories would then implement. However it was felt not to be appropriate for the selected repositories, because their systems were already implemented with established business and submission models. It was decided there would be more emphasis on what metadata was collected than how it was collected. The project would specify the metadata needed for preservation purposes, identify metadata that was not currently being collected and make recommendations for enhancements, but leave decisions on how to implement those enhancements to the repositories themselves.
Use cases were written for preservation events and their metadata, as this was an area lacking in both ANU and UQ repositories which has not been covered elsewhere. The other significant gap, preservation risk monitoring, is being addressed in the AONS (Automatic Obsolescence Notification System) project.
Although the title of the project was "PREMIS Requirement Statement" the project did not confine itself to PREMIS but considered all metadata, including PREMIS, necessary to support long term sustainability. For "implementing" PREMIS, the project did not specify how metadata was to be stored but recommended:
METS was chosen for the profile because it was the best understood of the standards for exchanging metadata about digital objects, it was the standard being used and discussed in the PREMIS implementors' group, and it could be applied to different types of digital objects.
The profile provides a concrete framework in which to implement PREMIS. Repositories could demonstrate they met the preservation metadata requirements by being able to produce documents conforming with the profile.
From the original draft workplan the following tasks were carried forward:
The main products of the project were:
Other products were included as part of the original work plan. The products are in the appendices.

This diagram shows part of an archive service framework (for more see for example the Fedora service framework (2005-2007). SIP (Submission Information Package), AIP (Archival Information Package), DIP (Dissemination Information Package) are concepts from Reference Model for an Open Archival Information System (OAIS).
The AIP in the archive should contain all the metadata required for long term sustainability and access. The AIP is conceptual: the metadata will not necessarily be stored as a single package in the repository and some metadata may be implicit (e.g. because it applies to all objects in the repository) rather than stored explicitly. The first product, the "List of preservation metadata elements", applies to the AIP.
The SIP is the object and metadata submitted, for instance by a depositor or harvester, through a submission system. It would have contained a subset of what is in the AIP. The product "Gap reports for ANU and UQ" look at what metadata is collected on submission and ingest.
The DIP containing an object and metadata sent to a delivery system for presentation to a user will also contain a subset of the AIP's metadata. This project didn't look at this area.
The preservation monitoring and management system may identify objects which need some preservation action through a search protocol and report and may then perform those actions. The product "Preservation event use cases" pertains to this area.
The archive may produce a DIP when transferring custody of an object to a partner archive. This becomes a SIP for ingest into the partner archive. The product "Profile for exchanging metadata" pertains to this area.
This report in Appendix 1 details the metadata elements required for preservation purposes, i.e. metadata needed in order to provide meaningful long term access to digital objects. Metadata includes
This report also includes a list of "mandatory" elements, that is, things a repository should know about every object. PREMIS and this report do not specify how metadata is to be stored or even if it is stored, perhaps because it applies to every object, for instance storageMedium. However if an element is not stored explicitly for each object it should be documented explicitly somewhere, e.g. in policy or procedures.
This report in Appendix 2 is a "recommended list of supported formats". It is divided into material type, then for image, audio and video is further subdivided into recommended archival formats, formats in common usage which may be supported, e.g. formats produced by digital cameras and recording equipment, and unsupported formats. Most repositories will not support all of the recommended formats. File formats under "unsupported formats" and others not on this list should be converted to another format before being accepted by a repository because they are likely to be difficult to support in the long term.
Not included are specialist file formats which would be kept in specialist data repositories e.g. FITS (Flexible Image Transport System) used to manage astronomical data. Formats intended for delivery purposes, such as streaming media, are also not included.
This report in Appendix 3 recommends tools for identifying file formats and automatically extracting metadata. It includes an evaluation of the tools' capabilities and examples of output showing metadata that can be automatically generated. Output does not include empty elements for metadata not present in or not applicable to the files the tools are used on, and therefore the examples may not fully represent the capabilities of the tools. The National Library of Australia intends to do a more detailed audit to align metadata able to be output against recommended preservation metadata elements.
Gap reports for ANU and UQ are in Appendix 4 of this report. The report assesses the extent to which the ANU and UQ repositories already support the collection of preservation metadata elements and includes recommendations for enhancements where gaps were identified. The most significant gaps were:
This document describes the requirements for actions that need to be taken on objects in a digital preservation repository and recording those actions or events. The following use cases are described:
A draft METS profile is proposed in Appendix 8 in the form of a table of rules and recommendations. The National Library of Australia needs to test the profile with ANU and UQ and after further consultation with them and the wider digital preservation community, revise the profile. It can then expressed in xml using the formal METS profile schema and submitted to METS for registration. The scenario this profile addresses is transferring custody of an object from one repository to another because this is the scenario that requires the full set of preservation metadata. The draft profile is meant to be a common non-system specific profile which APSR partner repositories can map their system-specific requirements to.
These are the recommendations arising from the products above.
This part of the report details the metadata elements required for preservation purposes, i.e. metadata needed in order to provide meaningful long term access to digital objects. It also comments on elements which should be "mandatory".
The following diagram shows part of an archive service framework (for more see for example the Fedora service framework (2005-2007). SIP (Submission Information Package), AIP (Archival Information Package), DIP (Dissemination Information Package) are concepts from the Reference Model for an Open Archival Information System (OAIS).

This report is concerned with the AIP which should contain all the metadata required for long term sustainability and access. The SIPs (e.g. the object and metadata submitted by a depositor or harvester) and the DIPs (e.g. object and metadata sent to a system for presentation to a user) will contain subsets of the AIP's metadata. The AIP is conceptual: the metadata will not necessarily be stored as a single package in the repository and some metadata may be implicit (e.g. because it applies to all objects in the repository) rather than stored explicitly.
To determine the requirements for preservation metadata, one needs to consider the uses to which the metadata will be put. Preservation metadata will be needed to support the following general scenarios:
While these general scenarios above seem straightforward, they could represent many different specific scenarios. For instance a repository receives an access request for an object, retrieves the object but then is unable to render it correctly. It is hard to imagine all the problems that might occur even 10 years years in the future, let alone over a longer period. Therefore it seems wise to collect as much metadata as possible "just in case", as the metadata may not be able to be obtained "just in time" when problems arise in the future.
The PREMIS Data Dictionary provides the "core preservation metadata element set". It was accepted that the elements were all necessary where applicable. It only remained to examine PREMIS to see how it should be interpreted and implemented.
However PREMIS's scope is deliberately limited to metadata which could apply to all digital objects regardless of format. It does not include file format specific metadata. It includes but does not go into detail about Intellectual Entity because "descriptive metadata is well served by existing standards". It also limits itself to "characteristics of rights and permissions concerned with preservation activities, not those associated with access and/or distribution".
Metadata needed for sustainable long-term access should therefore include not only PREMIS metadata but also:
This report should be used by repositories as a checklist against which to compare their own preservation metadata specification. It does not specify how or even whether metadata elements should be stored. There is no expectation, for instance, that the PREMIS elements will be stored as a group or as a discrete set of metadata, although they could be. On the contrary, the elements are likely to be stored in various places and some will be implicit perhaps because they apply to every object in the repository. However it is important to note that metadata not stored explicitly for each object should be documented explicitly somewhere e.g. in repository policy or procedures.
A repository can demonstrate its ability to meet these requirements by producing a document conforming to the draft METS profile in Appendix 8.
In terms of this Appendix, "mandatory" means a piece of metadata a repository is expected to "know" about each object to which the metadata applies, whether the metadata is stored explicitly or not.
In the draft METS profile for metadata exchange in Appendix 8, "mandatory" means the element must be present in a conforming METS document. More mandatory elements are specified in the METS profile as stricter requirements aid system interoperability. If data is unable to be supplied for a mandatory element, the element may contain values "not_applicable" or "unknown".
The elements in the PREMIS Data Dictionary are "core preservation metadata" elements. ALL elements should be collected by the repository if applicable. The elements are listed below with some brief notes on applying them. Object entity elements apply to the objectCategory "file".
See the PREMIS Data Dictionary for fuller definitions, rationale, obligation (i.e. mandatory or optional), repeatability, usage and other notes.
Comments have been made against some elements on how to apply them in the draft APSR METS profile.
For more information on events, see the event use cases in Appendix 6.
Agents may be persons, organisations, or software, associated with rights management and preservation events in the life of a data object.
This is the list of PREMIS elements mandatory for APSR repositories. It is a checklist of things a repository should know about EVERY object in the repository. If the information is not recorded explicitly about each object, it should be able to be determined from the repository itself or from repository documentation of policies, procedures etc. The draft METS profile in Appendix 8 also specifies mandatory elements for conforming METS documents.
The following elements are mandatory in the PREMIS data dictionary for objectCategory "file". They are not necessarily mandatory in the PREMIS xml schema since they may not apply to all types of objectCategory.
The following additional elements from PREMIS were regarded as important enough to be mandatory for APSR repositories by the project working group.
PREMIS does not mandate the existence of an Event. An Event can be linked to an Object through the Object entity's optional relationship or linkingEventIdentifier elements, or it can be linked to an Object through the Event's optional linkingObjectIdentifier element.
However we recommend the following be mandatory:
Validation events should be recorded e.g. that a file is of the format it says it is. Where validation is done on every file on ingest or a validation tool is run over a whole repository at a particular point in time, this fact should be recorded by the repository. Events are examined in more detail in the Event use cases in Appendix 6.
The mandatory elements (i.e. things the repository must know about every event) are:
The Event should "know" about the Object/s it acted on. However PREMIS does not specify a mandatory link between Events and Objects either in the Object entity or the Event entity. In the APSR draft METS profile, it will be mandatory for the Object entity to contain a linkingEventIdentifier to the (mandatory) Ingest event, and the Event entity will not need to contain the reciprocal linkingObjectIdentifier.
PREMIS does not mandate the existence of an Agent entity since linkingAgentIdentifier is optional in the Event entity.
However we recommend that repositories should know about Agent if the Event is one which changes an Object. Agent should, for example, identify the software used. Although the software may already be described in the Object entity (in creatingApplication) or in the Event entity, placing it in Agent in a document conforming with the APSR METS profile will facilitate mapping in the receiving repository's database. Agent should also be used for an organisation if an organisation other than the transferring repository was responsible for an Event.
PREMIS does not mandate the existence of Rights since linkingPermissionStatementIdentifier is optional in the Object entity. PREMIS concentrated on rights concerned with preservation activities.
Rights should be mandatory in so far as repositories should have agreements, or some conditions which depositors agree to when they deposit material in the repository, in place, but this may not apply when material is out of copyright.
Descriptive metadata is considered mandatory for APSR repositories but this project did not examine this area in detail and does not prescribe a particular metadata scheme. Repositories will need to be able to output descriptive metadata in MODS to conform with the APSR METS profile for metadata exchange, but should store and be able to output descriptive metadata in a form which retains the granularity of all available metadata.
Descriptive metadata includes not only metadata such as creator, title, date, subjects, but also contextual metadata. Contextual metadata provides meaning to or aids interpretation of an object.
For some objects structural metadata is needed for a repository to be able to reconstitute a whole digital object from its parts. A repository also needs to be able to display or present an object in a way that allows a user to understand how an object is related to its parts or to a greater whole. It may be stored in a PREMIS Object entity "relationship" element but may be better stored as a structural map, a manifest of files or a set of relationships.
File format specific metadata is needed to record the characteristics of a digital object so that it can be accurately rendered. In some cases, without file format specific metadata a system may not be able to render a digital object at all.
The following metadata schemes and extensions to them proposed by this project are recommended for use in the APSR METS profile. The schemas may include mandatory elements. The extensions will be published on the National Library of Australia website.
This section is intended to be used as a supplement to the Library of Congress Audiovisual Prototyping Project, http://www.loc.gov/rr/mopic/avprot/ (2004). Indicated in this section are alternative field names which have been used in the National Library of Australia’s Digital Collections Manager (DCM) as well as a set of additional metadata fields which is itself an extension to the Library of Congress METS extension schema.
It is likely that automated harvesting of data for many of these metadata fields is currently not be possible, however recording such data will assist in long-term management of the files.
It should also be noted that the set of suggested additional metadata fields is not necessarily complete and it is intended that other organisations/institutions provide further input.
MIX is the recommended extension schema to be used for image metadata. MIX is a schema endorsed by the METS Editorial Board for use with METS. The following is the introduction from the MIX home page:
The Library of Congress' Network Development and MARC Standards Office, in partnership with the NISO Technical Metadata for Digital Still Images Standards Committee and other interested experts, is developing an XML schema for a set of technical data elements required to manage digital image collections. The schema provides a format for interchange and/or storage of the data specified in the NISO Draft Standard Data Dictionary: Technical Metadata for Digital Still Images (Version 1.2). This schema is currently in draft status and is being referred to as "NISO Metadata for Images in XML (NISO MIX)". MIX is expressed using the XML schema language of the World Wide Web Consortium. MIX is maintained for NISO by the Network Development and MARC Standards Office of the Library of Congress with input from users.
MIX is the recommended extension schema to be used for image metadata.
The Library of Congress Audio (Source) Data Dictionary, which was developed as part of the Audio-Visual Prototyping Project, is the recommended base extension schema for audio metadata. It can be found at http://www.loc.gov/rr/mopic/avprot/DD_ASMD.html. The following metadata fields are intended as a further extension to the Library of Congress Audio (Source) Data Dictionary.
The Library of Congress Video (Source) Data Dictionary, which was developed
as part of the Audio-Visual Prototyping Project, is the recommended base extension
schema for video metadata. It can be found at http://www.loc.gov/rr/mopic/avprot/DD_VSMD.html.
The following metadata fields are intended as a further extension to the Library
of Congress Video (Source) Data Dictionary.
Schema for Technical Metadata for Text (created by Jerome McDonough, Elmer
Bobst Library, New York University) is endorsed by the METS Editorial Board
for use with METS.
Schema
Documentation
Further analysis of additional metadata fields required for text documents should be carried out.
Terminology can vary. The following list of alternative names is provided for clarity.
It should be mandatory to record access rights for materials with restricted access conditions. If no access rights are recorded it would be assumed that there are no access restrictions.
This report does not prescribe a particular metadata scheme. Possibilities include METS Rights, PREMIS Rights, Creative Commons licences and XACML.
Appendix 2 comprises a list of formats likely to be supported by repositories. Most repositories will not support all of these formats. File formats not on this list are likely to be more difficult to support in the long-term. It is acknowledged that constant Information Technology development will produce new and improved archival formats, and it is intended that any new additions to this list be included where appropriate. The list not only includes recommended archival formats but also other formats likely to be accepted by repositories e.g. formats produced by digital cameras or recording equipment.
Archival formats should ideally be based on open standards, but widely used and supported, well documented proprietary formats may be acceptable. It should be noted that while appropriate archival and "commonly in use" formats have been listed here - this document does not indicate recommended quality standards for digital media items, and appropriate guidelines for such should be sought. Files containing any form of compression should be carefully considered.
Not included are specialist file formats which would be kept in specialist data repositories e.g. FITS (Flexible Image Transport System) used to manage astronomical data. Formats intended for delivery purposes, such as streaming media, particularly where formats are non-stand-alone and and dependent on specific protocols for access (such as RTSP), are also not included.
This list was developed in consultation with Kevin Bradley who co-authored Survey of data collections: a research project undertaken for the Australian Partnership for Sustainable Repositories.
These formats are recommended archival formats and are included here in order of preference of the preferred archival format.
While TIFF is the recommended archival format, both Multi-part TIFF files and Multi-layered TIFF files are not necessarily considered archival file formats and where possible Multi-part TIFF files should be stored as sets of single images and Multi-layered TIFF files should be flattened to single layer images. Each repository may decide to develop their own policies regarding these variations of the TIFF format.
These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.
These formats are not archival formats and are not recommended as supported formats by repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.
It should be noted that some audio formats are a combination of a container or "wrapper" format and a file content format, and so are essentially a combination of two formats.
These formats are recommended archival formats and are included here in order of preference of the preferred archival format. It should be noted that with some AV formats they are a combination of a wrapper format as well as a file content format, and so are essentially a combination of two formats.
These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.
These formats are not archival formats and are not recommended as supported formats for repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.
These formats are recommended archival formats and are included here in order of preference of the preferred archival format. It should be noted that with some AV formats they are a combination of a wrapper format as well as a file content format, and so are essentially a combination of two formats.
Currently there is no archival video standard, however a number of options are available. Unlike other media types such as audio or image, video requires large amounts of storage space. For this reason, some compressed formats are currently considered to be suitable (for the time being) as archival formats until storage of large video files plus recommended archival video standard becomes a reality. These formats are recommended archival formats and are included here in order of preference of the preferred archival format.
These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.
These formats are not archival formats and are not recommended as supported formats for repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.
These formats are recommended archival formats and are included here in order of preference of the preferred archival format. While formats such as Microsoft Word are commonplace, it should be noted that this is a proprietary format and is likely to be difficult to support in the long-term.
These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.
These formats are not archival formats and are not recommended as supported formats by repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term. However, it should be noted that some companies creating proprietary formats are considering developing future open format versions.
Databases contain a larger degree of complexity than other individual files. While a full analysis of database formats was not carried out, only databases with a simple structure are able to be supported. Databases containing complex relationships cannot be supported at this stage. In general, documentation of databases including rules and relationships should also be archived.
These formats are recommended archival formats and are included here in order of preference of the preferred archival format. Only simple databases whose raw data can be turned into structured text, such as databases where all data can be extracted via a single join query, are considered a recommended archival format.
These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible as proprietary formats are likely to be difficult to support in the long-term.
Complex databases were considered out-of-scope for this project and so are considered to be unsupported formats.
While the Portable Document Format (PDF) is a proprietary format, and proprietary formats are normally considered to be unsupported formats, PDF should currently be the exception. This is largely because it is a format in common usage and the large degree of academic papers are published and distributed in this format. Further work would need to be done on this format as it contains both text and image, and because there are several types of PDF, including PDF/A, a proposed archival standard for PDF accepted as an ISO standard in 2005.
This project did not address websites specifically. The National Library of Australia is part of the International Internet Preservation Consortium which among other things is fostering the development of common tools, techniques and standards for website archiving
Multimedia files (such as Director, Flash and Microsoft Powerpoint) were considered out-of-scope for this project. However, example output files from metadata extraction tools have been provided for a range of multimedia formats.
Other formats that were considered out-of-scope of this project are considered unsupported formats currently.
Recommendations on the range of metadata elements to be collected by repositories are set out in Appendix 1 of this report. The degree to which repositories can meet such recommendations will depend on the metadata that can be re-used from existing records, policies and documentation, supplied by depositors, recorded as part of repository processes, or extracted from the materials themselves.
Given the volume of metadata that may be required or available, automated processes for collection of metadata are preferable, especially for metadata extraction from the materials themselves. A number of tools are available to address these needs in varying degrees and to provide some of the details required in an automated way. A selection of such tools are briefly described and compared in this Appendix. A more detailed alignment of metadata output from these tools against element recommendations will be made available when completed.
There are several aspects of metadata collection and the archiving process that may be addressed by tools:
At present, tools tend to cover one or more aspects of the archiving process and metadata collection, but no one tool yet covers all. Tools may also cover these aspects to varying degrees.
The range of formats covered by tools can also vary, and it may be useful to divide available tools into several classes, based on their format coverage:
For the range of formats intended to be supported in APSR repositories, several tools may be suitable. It is likely that more than one will be needed to obtain a full range of metadata.
Only tools in the first category, those able to extract metadata from a range of materials, are discussed below. Enhancements to the PRONOM service of The National Archives (UK) may, in the future, assist in locating tools capable of extracting metadata from single specific formats.
Available from The National Archives (UK) - http://www.nationalarchives.gov.uk/aboutapps/pronom/tools.htm
DROID is a platform-independent Java-based application which identifies the format and version of files based upon comparison of file data streams against a set of known signature byte sequences. The signature byte sequences are held in a signature file, which may be updated automatically from The National Archives web site by the DROID application. In March, 2006, Version 9 of the signature file contained signature byte sequences for 57 named file formats (including 159 versions of those formats), and a further 387 tentative file format indicators based on file extension alone.
The main function of DROID is to identify a wide range of file formats as conclusively as possible, including versions. Where a number of possible matches are identified, for example, where multiple versions of a format contain the same signature byte sequences, all matches are listed, along with an indication of the degree of match (e.g. Tentative, Positive). DROID may also notify of suspected mismatches between the format as identified by internal signatures and the filename extension.
DROID, identifies a wider range of formats than the other tools noted (JHOVE and the National Library of New Zealand Metadata Extraction Tool), and, where available, indicates the Persistent Unique Identifier (PUID) that has been assigned to the identified format within The National Archives format registry, PRONOM. However, it does not extract any further metadata from files, nor generic metadata about them (e.g. creation date etc.).
DROID could be used by repositories at least to provide file format identity information to fulfil the PREMIS mandatory elements:
Further format specific tools for metadata extraction might then be invoked based on format identifications from the DROID output.
Samples of output:
Available from the National Library of New Zealand - http://www.natlib.govt.nz/en/whatsnew/4initiatives.html#extraction
The National Library of New Zealand Metadata Extraction Tool is also a platform-independent Java-based application, designed to extract preservation metadata from a range of formats. Metadata may be extracted for each format by a specific modular "adapter", and can be output to XML in either an "adapter-native" schema or in a schema complying with the National Library of New Zealand's Preservation Metadata scheme. The tool is designed to be extensible, allowing creation of additional adapter plug-ins by other parties and the structuring of output via XSLT to suit alternative metadata schemes. The tool is capable of recognising and processing a range of formats and versions of formats, but does not currently appear to validate files against their identified format.
The range of formats which can be recognised and for which metadata can be extracted are currently:
Although the range of formats for which there are adapters is currently small, these cover file formats that may be commonly encountered, and the amount of metadata that is extracted can be quite extensive, particularly in "native" mode. If a format is not recognised, generic file metadata can nonetheless be collected, such as filename, size and date created. The tool can be run via either a Windows interface or from a command line.
Samples of output:
Available from Harvard University Library: http://hul.harvard.edu/jhove/
JHOVE is also a platform-independent Java-based application, primarily designed to identify a range of formats and validate files against their purported formats. It can also recognise format sub-types and versions. In characterising files, JHOVE is also capable of extracting technical metadata from the range of formats and producing XML-encoded or plain text output. JHOVE is also modular and extensible in design, allowing creation of additional modules as needed.
There are currently modules available for characterisation of 12 main format types, comprising around 52 versions or distinct subtypes of those formats. The main formats recognised and for which metadata can currently be extracted are:
If a format is not recognised, it is classed as a "bytestream" and is always well-formed and valid. The tool can be run via either a Windows interface or from a command line.
The metadata extracted can be quite extensive. For images and audio, XML output can be generated according to the MIX schema for still images and the Audio Engineering Society (AES) schemas for audio objects and time code formats.
Again, not all the formats to be accepted by APSR repositories are recognised by JHOVE, and other tools may also be required.
Samples of output:
| Tool | Identify format (Tentative) | Identify format (Confirm) | Identify versions | Validate format | Collect generic file MD | Collect material type MD | Collect file format MD |
|---|---|---|---|---|---|---|---|
| DROID | Yes [546 formats] |
Yes [159 formats] |
Yes | No | No | No | No |
| NLNZ-MET | Yes [15 formats] |
(Some) | (Some) | No | Yes | Yes | Yes |
| JHOVE | Yes [52 formats] |
Yes [52 formats] |
Yes | Yes | Yes | Yes | Yes |
This analysis was current at 19 May 2006. The reports look at the level of support for the core preservation metadata elements (i.e. PREMIS semantic units) and include recommendations for enhancements where gaps were identified.
| PREMIS semantic unit | Supported? | Comments on current level of support | Possible enhancements |
|---|---|---|---|
| Object Identifier | Supported | Items are given globally unique Handles. Files (DSpace bitstreams) are given a local database identifier only. |
DSpace are planning to use infoURIs for bitstreams which would be globally unique. |
| Preservation Level | Supported | DSpace has 3 support levels (Supported, Known, Unsupported) but ANU doesn't assign them. Can be defaulted from the file format. | Content policy development around these levels. The number of levels could be increased if necessary to conform with a generic set of service levels. |
| Object Category | Supported | ||
| Composition Level | Not supported; not applicable | Default would be 0 for all files in supported formats. Files that have a composition level of higher than 0 (e.g. zip files) would fall into the category of unknown, unsupported format. |
|
| Fixity | Supported | DSpace calculates checksum on ingest. | Checksum checker is coming in next version of DSpace (v1.4) |
| Size | Supported | DSpace records size on ingest. | |
| Format | Supported | DSpace determines this from the filename extension. Format version is not determined at present. |
Format validation by running a tool such as JHOVE or DROID over the repository or on ingest. Tools could also determine format version. |
| Significant Properties | Not supported; not applicable | If required, submission forms could be modified to ask for this information. | |
| Inhibitors | Not supported; not applicable | Policy would be not to support files to which this applies. | |
| Creating Application | Not supported | ANU policy is to avoid providing preservation level support for formats where creating application is important, and instead promote popular, open formats. The date the file was originally created not supported unless explicitly provided in the metadata being submitted. | Present or future tools may be able to provide this information automatically. If required, submission forms could be modified to ask for this information. Descriptive metadata may indicate date of creation for born digital items. |
| Original Name | Supported | ||
| Storage | Supported | ||
| Environment | Not supported | Not an issue for supported formats. | Global format or environment registries (under development) will meet this need. If required for special cases, submission forms could be modified to ask for this information. |
| Signature Information | Not applicable | ||
| Relationship | Not supported | Only supported currently through DC.Relation, which is at the item, not the file level. | Relationships including structural maps could be stored as a serialised bitstream with the object. |
| Linking Event | Partially supported | Ingest event can be determined from database. History logging module exists though it doesn't work properly, has performance issues, and is not being used. The checksum checker will be separate from the history module and the logging systems for the database (eg editing and viewing) are also separate. Theoretically events could be got from logs but it might not be easy. | Fixity check logging will be possible in next version of DSpace. If JHOVE or DROID are run over the repository, the validation event could be determined from the JHOVE or DROID output stored with the object. Ideally the logging systems should be integrated and work properly to record events and their outcome for a particular object. |
| Linking Intellectual Entity (Descriptive metadata) | Supported | Each item has a qualified Dublin Core record. Other descriptive metadata may be held in serialised bitstreams. | |
| Linking Permission Statement | Supported | Some rights may be stored in DC.Rights. Licences (including Creative Commons) may be stored with the object. |
| PREMIS semantic unit | Supported? | Comments on current level of support | Possible enhancements |
|---|---|---|---|
| Object Identifier | Supported | Persistent identifier (PID) at item level is UQ prefix followed by a number assigned by Fedora. Datastreams (files) associated with an item are identified by their filenames. infoURIs containing the PID and filenames can be constructed. | |
| Preservation Level | Not supported | Haven't needed it yet. Could be defaulted to a single level. They have just received a request for quotation for repository services. There is nowhere specific to store service levels. |
Could add field to descriptive metadata form or store service level agreement as datastream with the object. |
| Object Category | Supported | Preservation metadata derived from JHOVE is stored at file level. | |
| Composition Level | Not applicable | Default would be 0 for all files in supported formats. | |
| Fixity | Not supported | Checksums are not being generated. | Checksums should be generated on ingest and stored. |
| Size | Supported | Is in JHOVE metadata. | |
| Format | Supported | Is in JHOVE metadata. Includes version. | |
| Significant Properties | Not supported | Could be stored in description. | If required, could be added to the submission forms. |
| Inhibitors | Not applicable | ||
| Creating Application | Not supported | JHOVE doesn't do application names and versions (but does get camera names
for JPEGs). Original creation date of files not stored either. |
If required, could use another tool which detects versions. Could be added
to submission forms. Descriptive metadata may indicate date of creation for born digital items. |
| Original Name | Supported | File keeps its original name unless it doesn't conform to NCName. Not sure if original name is kept in this case. | |
| Storage | Supported | ||
| Environment | Not supported | Not an issue for supported formats. | Global format or environment registries (under development) will meet this need. If required for special cases, submission forms could be modified to ask for this information. |
| Signature Information | Not applicable | ||
| Relationship | Not supported | Fedora RELS-EXT is being used for relationships to other items. RELS-INT for relationships between datastreams is in the current version of Fedora but Fez is not using it yet. Internal relationships are only implicit through filenaming conventions at present - there is no metadata about relationships. | Implementation of Fedora's RELS-INT. |
| Linking Event | Not supported yet | Fedora has some audit trail recording. Fez is currently being developed to use this. History logging could be done automatically or manually. | Continued implementation of history logging. |
| Linking Intellectual Entity (Descriptive metadata) | Supported | Dublin Core record. Additional descriptive metadata may be stored. | If required, additional fields can be added to submission forms. |
| Linking Permission Statement | Supported | Fez has sophisticated and flexible rights management. Roles and groups (eg Fez groups, Shibboleth groups, targeted IDs) can be linked to different actions. |
The aim of this product was to look at workflow models for different types of digital content e.g. electronic publishing, digitisation of physical object, and to recommend how metadata should be acquired and what metadata a SIP should contain.
At this stage it was felt not to be appropriate to develop submission models for ANU and UQ as their systems already have underlying data models and submission processes, both established and under development. This project has specified in Appendices 1, 4 and 6 what metadata is required, but leaves decisions on how to enhance systems to collect it to the repository administrators and developers. The National Library of Australia has been reviewing the architecture of its Digital Collections Manager and may in future develop general submission models which may be useful to other repositories.
Regardless of the type of digital content, the main methods of submission involve:
In each case a SIP is compiled which the repository can ingest. In the first two cases, the SIP is compiled after the web form is completed. In the latter two cases, the batch or harvested submission may already be in the form of a SIP compiled by an external workflow system or tool. Other APSR projects are developing examples of these tools e.g. in the Bidwern project and FIDAS (Fieldwork Data Sustainability) project (the tool is called FieldHelper). Among other things, these tools help researchers organise and tag their files, then automatically prepare the data for uploading to institutional repositories, for instance, by compiling SIP packages as METS documents for ingest to DSpace. Work is also being done with electronic journal publishing systems.
Whatever method is used for the actual submission, the aim should be to capture as much metadata as possible (automatically where possible) as a by-product of creating a digital object.
This document describes the requirements for
Lavoie says about actions in the OAIS Functional Model:
" ..the Archival Storage function is responsible for ensuring that archived content resides in appropriate forms of storage ... and that the bit streams comprising the preserved information remain complete and renderable over the long-term. To meet this responsibility, Archival Storage periodically undertakes procedures such as media refreshment or format migration. The Archival Storage function also implements various safeguard mechanisms, such as error-checking procedures, to evaluate the outcome of preservation processes, as well as disaster recovery policies to mitigate the effects of catastrophic events .."
The PREMIS Data Dictionary says about documenting events:
"An Event is an action that involves at least one object or agent known to the preservation repository." "Documentation of actions that modify (that is, create a new version of) a digital object is critical to maintaining digital provenance, a key element of authenticity." "Even actions that alter nothing, such as validity and integrity checks on objects, can be important to record for management purposes."
These requirements are primarily concerned with events in the above context, that is, actions, relevant to preservation, on "master" or archival copies of objects. It is recognised that repositories usually have other purposes in addition to preservation and that display copies, supporting files, metadata etc may exist in repositories as digital objects in addition to the archival "content" object. A repository may log actions and events for various purposes. The requirements listed here may therefore only be a subset of an individual repository's requirements.
These requirements are deliberately generalised in order to be applicable to any repository, regardless of any particular software, implementation or architecture. Repository administrators and developers would need to determine more specifically how the requirements would be implemented in their repositories.
The use cases below apply both to actions that are performed on a single object and actions that are performed on a batch of objects (or all objects) in a repository.
These are the Actors (roles) in the use cases below. The Actors are "systems" but these may be manual systems (i.e. people), automated systems or a mixture.
This use case applies to, for instance, PREMIS eventType
Message digest calculation and format validation, and if applicable, fixity check and virus check, should ideally be done on ingest of an object. In this case they may or may not be recorded as separate events but if not, they should be noted in the event details or in repository policy and procedures.
If not done at ingest, these events may occur some time later, when they may be recorded as separate events.
Trigger: Identification of preservation risk (by a person or preservation monitoring system) or part of an auditing process (one off or regularly scheduled)
This use case would apply to, for instance, PREMIS eventType
An event which changes the preservation copy of an object should always be recorded.
Trigger: Identification of preservation risk (by a person or preservation monitoring system) or implementation of a policy decision e.g. to migrate all files of a certain format to a newer, better supported format.
Base course: A new object is created and the old object is kept.
Alternative course: A new object is created and the old object is not kept.
After step 9:
Repositories will have their own policies on the circumstances where deleting an object is allowed. It is expected that some metadata about an object will be kept even though the object itself is removed. This should be at least the object identifier and some descriptive metadata (or a link to it e.g. through a relationship with a current object).
Trigger: Implementation of a policy decision to delete an object. For instance, a decision to delete all objects of a certain type e.g. non-current versions of masters, or a policy to only keep certain objects for 10 years.
An example of this use case is a depositor changing the content of a document.
Particularly for internal documents, reports etc, the Workflow System may well assign a version number to the new document. However although this can be regarded as a new "version" of the old object, it is different from Use case 2 above. The repository should differentiate between different versions of the same content, and "versions" where the content is not the same.
Instead, this new "version" should be regarded as a new "work" (PREMIS Intellectual Entity) with a relationship to the old "work". It should have its own descriptive metadata distinct from the descriptive metadata of the old work, similar to the way different editions of a book have their own records in a library catalogue.
This new work should be able to have its own preservation policy. For example, the latest "version" may need to be kept indefinitely, whereas the earlier "versions" may only need to be kept for a defined period. Or the policy may be to only keep the latest "version" (which has been authorised through the Workflow System before submission to the Repository) and delete previous "versions" immediately.
Trigger: The depositor may use a Workflow System to take a copy of the object in the repository, edit it and re-submit it, or the depositor may edit their own local copy of the original and submit it through the Workflow System as a new "version" of the original object.
For example, the descriptive metadata about a photograph may need to be changed when new information comes to light about the people depicted in it.
This use case may be applied to administrative, structural etc as well as descriptive metadata.
It is up to individual repositories whether only one version or different versions of the metadata are kept, or even if a record of changes is kept. From a preservation point of view the authenticity of the content object is most important and keeping a record of changes to the content object is mandatory, but it is optional for the metadata. Whether or not to keep previous versions of the metadata depends on its significance and what it might be used for. It is however usual to keep at least the date the metadata was originally created and by whom (organisation rather than person) and the date it was last updated and by whom.
What details are recorded about an event and how they are stored will depend on the particular Repository's data model and architecture.
For the purposes of publishing or exchanging metadata about an archival object, the Repository should be able to conform to the proposed APSR METS profile. This profile specifies that a history of events describing an object's provenance be output in digiprovMD using the schema for the PREMIS Event Entity (see the PREMIS Data Dictionary.)
These are the semantic units of the PREMIS Event Entity (NR=not repeatable; R=repeatable; M=mandatory; O=optional):
The profile also says that additional information about agents associated with
events may optionally be recorded. Agents may be persons, organisations or software.
The PREMIS Agent Entity has the following semantic units:
Even if there is no additional information for software or a device, placing it in Agent in a document to conform with the draft APSR METS profile will facilitate mapping in the receiving repository's database.
Additional information considered useful but not covered by PREMIS should be recorded. Other more detailed schemas to describe events may emerge and/or PREMIS Event may be enhanced in the future
This project is using PREMIS in two ways:
The APSR repositories will aim to be PREMIS conformant in as far as being able to produce a METS document with metadata in a container using a PREMIS namespace valid according to the PREMIS xml schemas. If the data were not available they would have to be included with values of "unknown" or "not applicable". However in this case, i.e. if the repository could not supply a real value for a mandatory semantic unit, they could be regarded as not PREMIS conformant.
Some issues encountered while examining the PREMIS Data Dictionary were raised with the PREMIS Implementors' Group and added to the errata for fixing in the next version of the data dictionary e.g.
An interpretation issue was found with "relationship" in the PREMIS Data Dictionary :
STRUCTURAL RELATIONSHIPS: Under relationshipType on page 2-62 it says "structural=a relationship between parts of an object". This accords with what PREMIS says on page 1-8 i.e. structural relationships are about how to put back together a digital object which consists of more than one part or file. However the paragraph under Derivation relationships on page 1-9 says "A structural relationship among objects can be established by an act of derivation before the objects were ingested by the repository ... " and "..They do not have derivation relationships with each other, but do have a structural relationship as siblings (children of a common parent)". It's confusing to describe this as a structural relationship because the 'siblings' are not part of the same digital object - they belong to different representations.
"PARENT" AND "CHILD": On page 2-63 it says "is child of = the object is directly subordinate in a hierarchy to the related object ..." and "is parent of = the object is directly superior in a hierarchy to the related object ...", but it doesn't say what the hierarchy relates to. In the paragraph (on page 1-9) referred to above, "parent" refers to the object from which the "children" are derived, whereas on page 6-5 "children" is used to describe components of a web site. In the former case the parent has a "source of" relationship with the children; in the latter case the children have an "is part of" relationship with the (parent) website. In NLA's Digital Collections Manager system the term "child" is used to denote "part of" at the Intellectual Entity level. Because "parent" and "child" can be used in various contexts, it is recommended to avoid "is parent of", "is child of", "has child" and "has parent" in relationshipSubType and that more precise terms such as "source of", "derived from", "is part of", "has part" be used instead.
Allowing reciprocal relationships to be described in two places can give rise to data integrity problems e.g. one object may have an "is part of" relationship to a second object, which may in turn have an "is part of" relationship to the first object.
Integrity problems could also arise because two way linking is allowed between Object and Event. An Object can be linked to an Event through the Object's semantic units "relatedEventIdentification" or "linkingEventIdentifier"and an Event can be linked to an Object through the Event's semantic unit "linkingObjectIdentifier".
Another issue with Events and linkingObjectIdentifier is that there is no way of saying (if applicable) which was the "source" object and which the "output" object other than by referring back to one of the objects to find its relationship to the other object, and again there is the potential for inconsistency.
At this stage, other than fixing issues in the first version of the data dictionary, no particular enhancements have been identified. However once repositories begin to use PREMIS desired enhancements will probably be identified.
Proposals for enhancements will be sent for discussion to the PREMIS Implementors' Group list and will be formally submitted to the Editorial Committee for the PREMIS Maintenance Activity, which the National Library of Australia has been invited to join.
Other schemas and protocols recommended in this report are