banner
banner image
banner image
Partner's Area:   Login  |  Help

Australian Partnership for Sustainable Repositories
PREMIS Requirement Statement Project Report

Bronwyn Lee, Gerard Clifton and Somaya Langley
National Library of Australia
July 2006

PDF file View PDF version of report

Contents

  1. Report and recommendations
  2. Appendix 1: Preservation metadata elements
  3. Appendix 2: List of supported formats
  4. Appendix 3: Tools for automated metadata collection
  5. Appendix 4: Gap reports for ANU DSpace and UQ Fez/Fedora repositories
  6. Appendix 5: Submission models for key digital content categories
  7. Appendix 6: Preservation Event use cases and functional requirements
  8. Appendix 7: Proposals for enhancements to PREMIS and existing schemas and protocols that might be used.
  9. Appendix 8: Proposed profile for exchanging metadata.
  10. Appendix 9: Glossary
  11. Appendix 10: Bibliography

Report and recommendations

This report contains the results of the PRESTA - PREMIS Requirements Statement project undertaken by the National Library of Australia from December 2005 to June 2006 for the Australian Partnership for Sustainable Repositories (APSR).

1. Australian Partnership for Sustainable Repositories (APSR)

APSR aims to establish a centre of excellence for the management of scholarly assets in digital format.

It has an overall focus on the critical issues of the access continuity and the sustainability of digital collections. It is building on a base of demonstrators in developmental repositories within partner institutions. It is contributing to national strength in this area by encouraging the development of skills and expertise and providing coordination throughout the sector. It is actively providing international linkages and national services.

APSR is supported by the Systemic Infrastructure Initiative as part of the Australian Government's Backing Australia's Ability - An Innovative Action Plan for the Future. The current partners are Australian National University, National Library of Australia, University of Queensland, University of Sydney, University of Melbourne, University of Technology Sydney and the Australian Partnership for Advanced Computing.

2. PREMIS Requirement Statement project (PRESTA)

"PRESTA - PREMIS Requirement Statement" is one of the projects of APSR. It has as its aims:

to specify requirements for the collection of metadata needed for preservation management purposes and help these to be applied to selected repository implementations of APSR partners.

This report covers work done at the National Library of Australia during the 6 month period funded by the APSR from December 2005 to June 2006. The National Library of Australia has two digital repositories: PANDORA, Australia's web archive, and a digital repository for storing its own digital collections which is managed using a system developed inhouse called the Digital Collections Manager (DCM). The National Library of Australia has been actively involved in national and international digital preservation initiatives and information about its activities can be found on the Digital Preservation part of the National Library of Australia's website.

The "selected repository implementations" studied were the Australian National University's (ANU) Demetrius repository based on DSpace and the University of Queensland's (UQ) eScholarship repository based on Fez and Fedora.

3. Scope of the project

The original draft workplan implied an expectation that functional specifications for collection of metadata during submission and ingest would be written which the repositories would then implement. However it was felt not to be appropriate for the selected repositories, because their systems were already implemented with established business and submission models. It was decided there would be more emphasis on what metadata was collected than how it was collected. The project would specify the metadata needed for preservation purposes, identify metadata that was not currently being collected and make recommendations for enhancements, but leave decisions on how to implement those enhancements to the repositories themselves.

Use cases were written for preservation events and their metadata, as this was an area lacking in both ANU and UQ repositories which has not been covered elsewhere. The other significant gap, preservation risk monitoring, is being addressed in the AONS (Automatic Obsolescence Notification System) project.

Although the title of the project was "PREMIS Requirement Statement" the project did not confine itself to PREMIS but considered all metadata, including PREMIS, necessary to support long term sustainability. For "implementing" PREMIS, the project did not specify how metadata was to be stored but recommended:

METS was chosen for the profile because it was the best understood of the standards for exchanging metadata about digital objects, it was the standard being used and discussed in the PREMIS implementors' group, and it could be applied to different types of digital objects.

The profile provides a concrete framework in which to implement PREMIS. Repositories could demonstrate they met the preservation metadata requirements by being able to produce documents conforming with the profile.

From the original draft workplan the following tasks were carried forward:

4. Products of the project

The main products of the project were:

Other products were included as part of the original work plan. The products are in the appendices.

4.1. Service framework

Digital archive service framework

This diagram shows part of an archive service framework (for more see for example the Fedora service framework (2005-2007). SIP (Submission Information Package), AIP (Archival Information Package), DIP (Dissemination Information Package) are concepts from Reference Model for an Open Archival Information System (OAIS).

The AIP in the archive should contain all the metadata required for long term sustainability and access. The AIP is conceptual: the metadata will not necessarily be stored as a single package in the repository and some metadata may be implicit (e.g. because it applies to all objects in the repository) rather than stored explicitly. The first product, the "List of preservation metadata elements", applies to the AIP.

The SIP is the object and metadata submitted, for instance by a depositor or harvester, through a submission system. It would have contained a subset of what is in the AIP. The product "Gap reports for ANU and UQ" look at what metadata is collected on submission and ingest.

The DIP containing an object and metadata sent to a delivery system for presentation to a user will also contain a subset of the AIP's metadata. This project didn't look at this area.

The preservation monitoring and management system may identify objects which need some preservation action through a search protocol and report and may then perform those actions. The product "Preservation event use cases" pertains to this area.

The archive may produce a DIP when transferring custody of an object to a partner archive. This becomes a SIP for ingest into the partner archive. The product "Profile for exchanging metadata" pertains to this area.

4.2 List of preservation metadata elements

This report in Appendix 1 details the metadata elements required for preservation purposes, i.e. metadata needed in order to provide meaningful long term access to digital objects. Metadata includes

This report also includes a list of "mandatory" elements, that is, things a repository should know about every object. PREMIS and this report do not specify how metadata is to be stored or even if it is stored, perhaps because it applies to every object, for instance storageMedium. However if an element is not stored explicitly for each object it should be documented explicitly somewhere, e.g. in policy or procedures.

4.3 Recommended list of supported formats

This report in Appendix 2 is a "recommended list of supported formats". It is divided into material type, then for image, audio and video is further subdivided into recommended archival formats, formats in common usage which may be supported, e.g. formats produced by digital cameras and recording equipment, and unsupported formats. Most repositories will not support all of the recommended formats. File formats under "unsupported formats" and others not on this list should be converted to another format before being accepted by a repository because they are likely to be difficult to support in the long term.

Not included are specialist file formats which would be kept in specialist data repositories e.g. FITS (Flexible Image Transport System) used to manage astronomical data. Formats intended for delivery purposes, such as streaming media, are also not included.

4.4 Tools for automated metadata collection

This report in Appendix 3 recommends tools for identifying file formats and automatically extracting metadata. It includes an evaluation of the tools' capabilities and examples of output showing metadata that can be automatically generated. Output does not include empty elements for metadata not present in or not applicable to the files the tools are used on, and therefore the examples may not fully represent the capabilities of the tools. The National Library of Australia intends to do a more detailed audit to align metadata able to be output against recommended preservation metadata elements.

4.5 Gap reports for ANU DSpace and UQ Fez/Fedora repositories

Gap reports for ANU and UQ are in Appendix 4 of this report. The report assesses the extent to which the ANU and UQ repositories already support the collection of preservation metadata elements and includes recommendations for enhancements where gaps were identified. The most significant gaps were:

4.6 Preservation Event use cases

This document describes the requirements for actions that need to be taken on objects in a digital preservation repository and recording those actions or events. The following use cases are described:

4.7 Profile for exchanging metadata

A draft METS profile is proposed in Appendix 8 in the form of a table of rules and recommendations. The National Library of Australia needs to test the profile with ANU and UQ and after further consultation with them and the wider digital preservation community, revise the profile. It can then expressed in xml using the formal METS profile schema and submitted to METS for registration. The scenario this profile addresses is transferring custody of an object from one repository to another because this is the scenario that requires the full set of preservation metadata. The draft profile is meant to be a common non-system specific profile which APSR partner repositories can map their system-specific requirements to.

5. Summary of recommendations from the project

These are the recommendations arising from the products above.

  1. Repositories collect the full range of metadata necessary to provide meaningful long-term access to digital objects:
    • core preservation metadata (PREMIS)
    • descriptive metadata (describes content including metadata providing context or meaning to a digital object)
    • structural metadata (how parts relate to the whole and to each other)
    • file format specific metadata (e.g. image, audio formats)
    • access rights metadata (so material can be made available in accordance with rightsholders' conditions)
  2. Repositories ensure they collect the mandatory PREMIS core preservation metadata elements in section A1.6.
  3. Repositories aim to collect non-mandatory PREMIS metadata where applicable.
  4. Repositories have policies and procedures which encourage deposit of digital material in open, standard formats.
  5. Repositories have policies and procedures which articulate the level of support provided for particular formats.
  6. Repositories identify and validate the file formats of objects on ingest or shortly thereafter.
  7. Repositories use tools on ingest of an object or periodically on new objects, to collect extra metadata and/or metadata which can't be easily supplied during submission.
  8. The National Library of Australia embark on a more detailed audit to align metadata able to be output against recommended preservation metadata elements and make results of this work available when ready.
  9. The Australian National University (ANU) and the University of Queensland (UQ) repositories consider implementing the enhancements suggested in the gap reports, particularly
    • recording of preservation events
    • recording of structural relationships
    • file format validation (ANU)
    • checksum generation (UQ)
  10. ANU and UQ repositories particularly take note of the functional requirements for preservation events in Appendix 6 and bring them to the attention of their open source communities as they begin to develop event logging functionality.
  11. Australian repositories, particularly the National Library of Australia, continue to actively participate in development of standards relevant to digital preservation.
  12. Australian repositories continue to actively participate in development of open source software for digital repositories, encouraging support for digital preservation metadata and standards in these developments.
  13. The National Library of Australia develop crosswalks, if not already available, to map elements from schemas output by automated tools to PREMIS where an equivalent element exists.
  14. The National Library of Australia continue to develop and test the proposed METS profile for metadata exchange with input from the Australian National University and the University of Queensland and consultation with the wider digital preservation community with a view to registering the profile formally.
Back to Contents

Appendix 1: Preservation metadata elements

This part of the report details the metadata elements required for preservation purposes, i.e. metadata needed in order to provide meaningful long term access to digital objects. It also comments on elements which should be "mandatory".

A1.1 Service framework

The following diagram shows part of an archive service framework (for more see for example the Fedora service framework (2005-2007). SIP (Submission Information Package), AIP (Archival Information Package), DIP (Dissemination Information Package) are concepts from the Reference Model for an Open Archival Information System (OAIS).

digital archive service framework

This report is concerned with the AIP which should contain all the metadata required for long term sustainability and access. The SIPs (e.g. the object and metadata submitted by a depositor or harvester) and the DIPs (e.g. object and metadata sent to a system for presentation to a user) will contain subsets of the AIP's metadata. The AIP is conceptual: the metadata will not necessarily be stored as a single package in the repository and some metadata may be implicit (e.g. because it applies to all objects in the repository) rather than stored explicitly.

A1.2 Scenarios

To determine the requirements for preservation metadata, one needs to consider the uses to which the metadata will be put. Preservation metadata will be needed to support the following general scenarios:

While these general scenarios above seem straightforward, they could represent many different specific scenarios. For instance a repository receives an access request for an object, retrieves the object but then is unable to render it correctly. It is hard to imagine all the problems that might occur even 10 years years in the future, let alone over a longer period. Therefore it seems wise to collect as much metadata as possible "just in case", as the metadata may not be able to be obtained "just in time" when problems arise in the future.

A1.3 Metadata elements

The PREMIS Data Dictionary provides the "core preservation metadata element set". It was accepted that the elements were all necessary where applicable. It only remained to examine PREMIS to see how it should be interpreted and implemented.

However PREMIS's scope is deliberately limited to metadata which could apply to all digital objects regardless of format. It does not include file format specific metadata. It includes but does not go into detail about Intellectual Entity because "descriptive metadata is well served by existing standards". It also limits itself to "characteristics of rights and permissions concerned with preservation activities, not those associated with access and/or distribution".

Metadata needed for sustainable long-term access should therefore include not only PREMIS metadata but also:

This report should be used by repositories as a checklist against which to compare their own preservation metadata specification. It does not specify how or even whether metadata elements should be stored. There is no expectation, for instance, that the PREMIS elements will be stored as a group or as a discrete set of metadata, although they could be. On the contrary, the elements are likely to be stored in various places and some will be implicit perhaps because they apply to every object in the repository. However it is important to note that metadata not stored explicitly for each object should be documented explicitly somewhere e.g. in repository policy or procedures.

A repository can demonstrate its ability to meet these requirements by producing a document conforming to the draft METS profile in Appendix 8.

A1.4 What does "mandatory" mean?

In terms of this Appendix, "mandatory" means a piece of metadata a repository is expected to "know" about each object to which the metadata applies, whether the metadata is stored explicitly or not.

In the draft METS profile for metadata exchange in Appendix 8, "mandatory" means the element must be present in a conforming METS document. More mandatory elements are specified in the METS profile as stricter requirements aid system interoperability. If data is unable to be supplied for a mandatory element, the element may contain values "not_applicable" or "unknown".

A1.5 Core preservation metadata (PREMIS elements)

The elements in the PREMIS Data Dictionary are "core preservation metadata" elements. ALL elements should be collected by the repository if applicable. The elements are listed below with some brief notes on applying them. Object entity elements apply to the objectCategory "file".

See the PREMIS Data Dictionary for fuller definitions, rationale, obligation (i.e. mandatory or optional), repeatability, usage and other notes.

Comments have been made against some elements on how to apply them in the draft APSR METS profile.

A1.5.1 Object entity elements

A1.5.2 Event entity elements

For more information on events, see the event use cases in Appendix 6.

A1.5.3 Agent semantic units

Agents may be persons, organisations, or software, associated with rights management and preservation events in the life of a data object.

A1.5.4 Rights semantic units

A1.6 Mandatory PREMIS elements

This is the list of PREMIS elements mandatory for APSR repositories. It is a checklist of things a repository should know about EVERY object in the repository. If the information is not recorded explicitly about each object, it should be able to be determined from the repository itself or from repository documentation of policies, procedures etc. The draft METS profile in Appendix 8 also specifies mandatory elements for conforming METS documents.

A1.6.1 Object Entity elements:

The following elements are mandatory in the PREMIS data dictionary for objectCategory "file". They are not necessarily mandatory in the PREMIS xml schema since they may not apply to all types of objectCategory.

The following additional elements from PREMIS were regarded as important enough to be mandatory for APSR repositories by the project working group.

A1.6.2 Event Entity elements:

PREMIS does not mandate the existence of an Event. An Event can be linked to an Object through the Object entity's optional relationship or linkingEventIdentifier elements, or it can be linked to an Object through the Event's optional linkingObjectIdentifier element.

However we recommend the following be mandatory:

Validation events should be recorded e.g. that a file is of the format it says it is. Where validation is done on every file on ingest or a validation tool is run over a whole repository at a particular point in time, this fact should be recorded by the repository. Events are examined in more detail in the Event use cases in Appendix 6.

The mandatory elements (i.e. things the repository must know about every event) are:

The Event should "know" about the Object/s it acted on. However PREMIS does not specify a mandatory link between Events and Objects either in the Object entity or the Event entity. In the APSR draft METS profile, it will be mandatory for the Object entity to contain a linkingEventIdentifier to the (mandatory) Ingest event, and the Event entity will not need to contain the reciprocal linkingObjectIdentifier.

A1.6.3 Agent

PREMIS does not mandate the existence of an Agent entity since linkingAgentIdentifier is optional in the Event entity.

However we recommend that repositories should know about Agent if the Event is one which changes an Object. Agent should, for example, identify the software used. Although the software may already be described in the Object entity (in creatingApplication) or in the Event entity, placing it in Agent in a document conforming with the APSR METS profile will facilitate mapping in the receiving repository's database. Agent should also be used for an organisation if an organisation other than the transferring repository was responsible for an Event.

A1.6.4 Rights entity

PREMIS does not mandate the existence of Rights since linkingPermissionStatementIdentifier is optional in the Object entity. PREMIS concentrated on rights concerned with preservation activities.

Rights should be mandatory in so far as repositories should have agreements, or some conditions which depositors agree to when they deposit material in the repository, in place, but this may not apply when material is out of copyright.

A1.7 Descriptive metadata

Descriptive metadata is considered mandatory for APSR repositories but this project did not examine this area in detail and does not prescribe a particular metadata scheme. Repositories will need to be able to output descriptive metadata in MODS to conform with the APSR METS profile for metadata exchange, but should store and be able to output descriptive metadata in a form which retains the granularity of all available metadata.

Descriptive metadata includes not only metadata such as creator, title, date, subjects, but also contextual metadata. Contextual metadata provides meaning to or aids interpretation of an object.

A1.8 Structural metadata

For some objects structural metadata is needed for a repository to be able to reconstitute a whole digital object from its parts. A repository also needs to be able to display or present an object in a way that allows a user to understand how an object is related to its parts or to a greater whole. It may be stored in a PREMIS Object entity "relationship" element but may be better stored as a structural map, a manifest of files or a set of relationships.

A1.9 File format specific metadata

File format specific metadata is needed to record the characteristics of a digital object so that it can be accurately rendered. In some cases, without file format specific metadata a system may not be able to render a digital object at all.

The following metadata schemes and extensions to them proposed by this project are recommended for use in the APSR METS profile. The schemas may include mandatory elements. The extensions will be published on the National Library of Australia website.

This section is intended to be used as a supplement to the Library of Congress Audiovisual Prototyping Project, http://www.loc.gov/rr/mopic/avprot/ (2004). Indicated in this section are alternative field names which have been used in the National Library of Australia’s Digital Collections Manager (DCM) as well as a set of additional metadata fields which is itself an extension to the Library of Congress METS extension schema.

It is likely that automated harvesting of data for many of these metadata fields is currently not be possible, however recording such data will assist in long-term management of the files.

It should also be noted that the set of suggested additional metadata fields is not necessarily complete and it is intended that other organisations/institutions provide further input.

A1.9.1 Image

MIX is the recommended extension schema to be used for image metadata. MIX is a schema endorsed by the METS Editorial Board for use with METS. The following is the introduction from the MIX home page:

The Library of Congress' Network Development and MARC Standards Office, in partnership with the NISO Technical Metadata for Digital Still Images Standards Committee and other interested experts, is developing an XML schema for a set of technical data elements required to manage digital image collections. The schema provides a format for interchange and/or storage of the data specified in the NISO Draft Standard Data Dictionary: Technical Metadata for Digital Still Images (Version 1.2). This schema is currently in draft status and is being referred to as "NISO Metadata for Images in XML (NISO MIX)". MIX is expressed using the XML schema language of the World Wide Web Consortium. MIX is maintained for NISO by the Network Development and MARC Standards Office of the Library of Congress with input from users.

MIX is the recommended extension schema to be used for image metadata.

MIX schema

A1.9.2 Audio and Video

A1.9.2.1 Audio

The Library of Congress Audio (Source) Data Dictionary, which was developed as part of the Audio-Visual Prototyping Project, is the recommended base extension schema for audio metadata. It can be found at http://www.loc.gov/rr/mopic/avprot/DD_ASMD.html. The following metadata fields are intended as a further extension to the Library of Congress Audio (Source) Data Dictionary.

file_format
The type of audio file for any audio file, for example a WAV file is a Microsoft WAVE file and an AIF is the Audio Interchange File Format.
file_version
The version of the file format used.
coding_history
Indicates the file format history (and devices) that the file has been through.
mime_type
The MIME type helps web browsers associate particular files with suitable player applications or plug-ins.
compression
The type of compression used on the file – for "non-archival" quality files where an archival copy is not available – this may be something such as MPEG compression such as in an MPEG 1 Layer 3 file (MP3).
codec_version
The version of the codec used (if appropriate).
file_container
The type of file format that is used to hold another file format, for example the Broadcast Wave Format (BWF) is a file container for a Microsoft WAV file.
file_container_version
The version of the file container format used.
frame_rate
The number of frames per second.
byte_order
For example "Big Endian" or "Litte Endian".
timecode_type
Type of time code recorded on the audio source item, for example: SMPTE drop frame, SMPTE non drop frame, etc.
channel_num
Indicates the specific channel number, for example channel number 0. This is a repeatable field.
channel_num_map_loc
This is tied to the specific channel number and should indicate the position of channel, for example, channel number 0 in a stereo file for the channel number map location may indicate "left".
channel_map_config
The configuration of the mapping of channels. This information is important for multichannel works. Examples of configuration are the "shoebox" and "double diamond".
delivery_type
For streaming media files (this, for instance, could indicate RTSP). While streaming media files are not recommended for inclusion as they are of a non-archival format, in the instance that an exception is made to include streaming media, it is necessary for a record of intended delivery protocol to be available. For example: QuickTime files that are produced with the settings, “hinted for streaming” indicate that these files can only be accessed using the RTSP protocol.
encoding_software
Software used to encode the software for delivery files (only necessary in the exception of non-archival quality delivery files).
codec_essence
The particular type or “flavour” of the codec used for example RealMedia “Music” or “Voice” codec.
codec_essence_version
The version of the codec essence used.

A1.9.2.2 Video

The Library of Congress Video (Source) Data Dictionary, which was developed as part of the Audio-Visual Prototyping Project, is the recommended base extension schema for video metadata. It can be found at http://www.loc.gov/rr/mopic/avprot/DD_VSMD.html. The following metadata fields are intended as a further extension to the Library of Congress Video (Source) Data Dictionary.

file_format
The type of video file, for example a MOV file is a QuickTime file format, however it should be noted that with video there is quite often both a file format and a container format.
file_version
The version of the file format used.
coding_history
Indicates the file format history (and devices) that the file has been through.
mime_type
The MIME type helps web browsers associate particular files with suitable player applications or plug-ins.
compression
The type of compression used on the file – for “non-archival” quality files where an archival copy is not available – this may be something such as MPEG compression such as in an MPEG 2, which is the format used for DVD presentation. While archival materials should not be stored in a compressed format, it should be noted that at the current period in time, video files are very large and due to other restraints (such as cost of large scale data storage infrastructures) storing uncompressed video is currently not always possible. Video is still a relatively unexplored field in relation to archiving and preservation and over time it is assumed that practices and standards for video archiving will change.
codec_version
The version of the codec used (if appropriate).
file_container
The type of file format that is used to hold another file format, for example the QuickTime (MOV) is a file container for other files formats. There can be similarities between the file format and file container formats.
file_container_version
The version of the file container format used.
byte_order
For example “Big Endian” or “Little Endian”.
counting_mode
NTSC drop-frame or non-drop frame.
track_num
Indicates the specific track number, for example channel number 0. This is a repeatable field. (This is similar to the audio metadata field channel_num.)
track_num_map_loc
This is tied to the specific track number and should indicate the position.
track_map_config
The configuration of the mapping of tracks. This information is important for (rare) works where more than one video track is present.
delivery_type
For streaming media files (this, for instance, could indicate RTSP). While streaming media files are not recommended for inclusion as they are of a non-archival format, in the instance that an exception is made to include streaming media, it is necessary for a record of intended delivery protocol to be available). For example: QuickTime files that are produced with the settings, “hinted for streaming” indicate that these files can only be accessed using the RTSP protocol.
encoding_software
Software used to encode the software for delivery files (only necessary in the exception of non-archival quality delivery files).
broadcast_standard
This includes PAL, NTSC, SECAM, DV, HDV etc
anamorphic
A playback presentation setting related to how DVD video has been mastered and whether the video is capable of being played back on screens with different aspect ratios. For example, this would include being able to play the video material on screens with either 4:3 and 16:9 aspect ratios without the video being “squashed” to fit. Values for this metadata field should either be “true” or “false”.
field_dominance
This is set to either lower (even) or upper (odd).
alpha_channel
Whether or not the video has an alpha_channel.
codec_essence
The particular type or “flavour” of the codec used for example RealMedia “Music” or “Voice” codec.
codec_essence_version
The version of the codec essence used.

A1.9.4 Text, HTML and XML

Schema for Technical Metadata for Text (created by Jerome McDonough, Elmer Bobst Library, New York University) is endorsed by the METS Editorial Board for use with METS.
Schema
Documentation

Further analysis of additional metadata fields required for text documents should be carried out.

A1.9.4.1 Additional metadata fields

markup_nature
Whether the mark-up style is strict or transitional (in the case of HTML and XHTML)

A1.9.5 Alternative naming

Terminology can vary. The following list of alternative names is provided for clarity.

audio_block_size
block align
audio_data_encoding
encoding
bits_per_sample
bit depth
codec_name
codec
num_channel
channels
sound_field
recording mode
file_container
wrapper or container
sampling_frequency
sampling rate
data_rate
bit rate
timecode_type
display format
pixels_horizontal
image width
pixels_vertical
image height
charset
encoding

1.10 Access rights metadata

It should be mandatory to record access rights for materials with restricted access conditions. If no access rights are recorded it would be assumed that there are no access restrictions.

This report does not prescribe a particular metadata scheme. Possibilities include METS Rights, PREMIS Rights, Creative Commons licences and XACML.

1.11 Recommendations

  1. Repositories collect the full range of metadata necessary to provide meaningful long-term access to digital objects:
    • core preservation metadata (PREMIS)
    • descriptive metadata (describes content including metadata providing context or meaning to a digital object)
    • structural metadata (how parts relate to the whole and to each other)
    • file format specific metadata (e.g. image, audio formats)
    • access rights metadata (so material can be made available in accordance with rightsholders' conditions)
  2. Repositories ensure they collect the mandatory PREMIS core preservation metadata elements in section A1.6.
  3. Repositories aim to collect non-mandatory PREMIS metadata where applicable.

Back to Contents

Appendix 2: Recommended list of supported formats

Appendix 2 comprises a list of formats likely to be supported by repositories. Most repositories will not support all of these formats. File formats not on this list are likely to be more difficult to support in the long-term. It is acknowledged that constant Information Technology development will produce new and improved archival formats, and it is intended that any new additions to this list be included where appropriate. The list not only includes recommended archival formats but also other formats likely to be accepted by repositories e.g. formats produced by digital cameras or recording equipment.

Archival formats should ideally be based on open standards, but widely used and supported, well documented proprietary formats may be acceptable. It should be noted that while appropriate archival and "commonly in use" formats have been listed here - this document does not indicate recommended quality standards for digital media items, and appropriate guidelines for such should be sought. Files containing any form of compression should be carefully considered.

Not included are specialist file formats which would be kept in specialist data repositories e.g. FITS (Flexible Image Transport System) used to manage astronomical data. Formats intended for delivery purposes, such as streaming media, particularly where formats are non-stand-alone and and dependent on specific protocols for access (such as RTSP), are also not included.

This list was developed in consultation with Kevin Bradley who co-authored Survey of data collections: a research project undertaken for the Australian Partnership for Sustainable Repositories.

A2.1 Images

2.1.1 Recommended Archival Formats

These formats are recommended archival formats and are included here in order of preference of the preferred archival format.

  1. Tagged Image File Format (TIFF)

While TIFF is the recommended archival format, both Multi-part TIFF files and Multi-layered TIFF files are not necessarily considered archival file formats and where possible Multi-part TIFF files should be stored as sets of single images and Multi-layered TIFF files should be flattened to single layer images. Each repository may decide to develop their own policies regarding these variations of the TIFF format.

2.1.2 Formats in Common Usage

These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.

2.1.3 Unsupported Formats

These formats are not archival formats and are not recommended as supported formats by repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.

A2.2 Audio

It should be noted that some audio formats are a combination of a container or "wrapper" format and a file content format, and so are essentially a combination of two formats.

2.2.1 Recommended Archival Formats

These formats are recommended archival formats and are included here in order of preference of the preferred archival format. It should be noted that with some AV formats they are a combination of a wrapper format as well as a file content format, and so are essentially a combination of two formats.

  1. Broadcast Wave Format (BWF) - wrapper that contains the WAV file format. The wrapper holds additional metadata
  2. Waveform Audio (WAV)
  3. Audio Interchange File (AIFF)

2.2.2 Formats in Common Usage

These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.

2.2.3 Unsupported Formats

These formats are not archival formats and are not recommended as supported formats for repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.

A2.3 Video

These formats are recommended archival formats and are included here in order of preference of the preferred archival format. It should be noted that with some AV formats they are a combination of a wrapper format as well as a file content format, and so are essentially a combination of two formats.

2.3.1 Recommended Archival Formats

Currently there is no archival video standard, however a number of options are available. Unlike other media types such as audio or image, video requires large amounts of storage space. For this reason, some compressed formats are currently considered to be suitable (for the time being) as archival formats until storage of large video files plus recommended archival video standard becomes a reality. These formats are recommended archival formats and are included here in order of preference of the preferred archival format.

  1. Material Exchange Format (MXF) - wrapper that contains a range of "essence" or "content" file formats. The wrapper holds additional metadata
  2. Advanced Authoring Format (AAF) - wrapper that contains a range of "essence" or "content" file formats. The wrapper holds additional metadata
  3. MOTION JPEG2000 (MJ2) - this is a newly emerging lossless compression format, however implementation of the standard has been relatively slow, and the majority of software in common usage is currently unable to read this image file
  4. MPEG-2 - lossy compression format. This is the standard used for DVD

2.3.2 Formats in Common Usage

These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.

2.3.3 Unsupported Formats

These formats are not archival formats and are not recommended as supported formats for repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term.

A2.4 Text

2.4.1 Recommended Archival Formats

These formats are recommended archival formats and are included here in order of preference of the preferred archival format. While formats such as Microsoft Word are commonplace, it should be noted that this is a proprietary format and is likely to be difficult to support in the long-term.

  1. Extensible Markup Language (XML)
  2. American Standard Code for Information Interchange (ASCII) Text (TXT)
  3. 8-bit Unicode Transformation Format (UTF-8) Text (TXT)
  4. 16-bit Unicode Transformation Format (UTF-16) Text (TXT)

2.4.2 Formats in Common Usage

These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible.

2.4.3 Unsupported Formats

These formats are not archival formats and are not recommended as supported formats by repositories. Files in these formats should be converted to recommended archival formats before being accepted by a repository as they are likely to be difficult to support in the long-term. However, it should be noted that some companies creating proprietary formats are considering developing future open format versions.

A2.5 Databases

Databases contain a larger degree of complexity than other individual files. While a full analysis of database formats was not carried out, only databases with a simple structure are able to be supported. Databases containing complex relationships cannot be supported at this stage. In general, documentation of databases including rules and relationships should also be archived.

2.5.1 Recommended Archival Formats

These formats are recommended archival formats and are included here in order of preference of the preferred archival format. Only simple databases whose raw data can be turned into structured text, such as databases where all data can be extracted via a single join query, are considered a recommended archival format.

  1. Extensible Markup Language (XML) - simple databases only
  2. Comma-Separated Variables (CSV) - simple databases only

2.5.2 Formats in Common Usage

These formats are not recommended as archival formats, however they are in common usage and so are included. As they are not archival formats no order of preference is indicated. Files in the following formats should preferably have a copy created in an archival format where possible as proprietary formats are likely to be difficult to support in the long-term.

2.5.3 Out of Scope Formats

Complex databases were considered out-of-scope for this project and so are considered to be unsupported formats.

A2.6 Portable Document Format (PDF)

While the Portable Document Format (PDF) is a proprietary format, and proprietary formats are normally considered to be unsupported formats, PDF should currently be the exception. This is largely because it is a format in common usage and the large degree of academic papers are published and distributed in this format. Further work would need to be done on this format as it contains both text and image, and because there are several types of PDF, including PDF/A, a proposed archival standard for PDF accepted as an ISO standard in 2005.

A2.7 Websites

This project did not address websites specifically. The National Library of Australia is part of the International Internet Preservation Consortium which among other things is fostering the development of common tools, techniques and standards for website archiving

A2.8 Multimedia

Multimedia files (such as Director, Flash and Microsoft Powerpoint) were considered out-of-scope for this project. However, example output files from metadata extraction tools have been provided for a range of multimedia formats.

A2.9 Other Objects and Formats

Other formats that were considered out-of-scope of this project are considered unsupported formats currently.

A2.8 Recommendations

  1. Repositories have policies and procedures which encourage deposit of digital material in open, standard formats.
  2. Repositories have policies and procedures which articulate the level of support provided for particular formats.

Back to Contents

Appendix 3: Tools for automated metadata collection

A3.1 Introduction

Recommendations on the range of metadata elements to be collected by repositories are set out in Appendix 1 of this report. The degree to which repositories can meet such recommendations will depend on the metadata that can be re-used from existing records, policies and documentation, supplied by depositors, recorded as part of repository processes, or extracted from the materials themselves. 

Given the volume of metadata that may be required or available, automated processes for collection of metadata are preferable, especially for metadata extraction from the materials themselves. A number of tools are available to address these needs in varying degrees and to provide some of the details required in an automated way. A selection of such tools are briefly described and compared in this Appendix. A more detailed alignment of metadata output from these tools against element recommendations will be made available when completed.

There are several aspects of metadata collection and the archiving process that may be addressed by tools:

At present, tools tend to cover one or more aspects of the archiving process and metadata collection, but no one tool yet covers all. Tools may also cover these aspects to varying degrees. 

The range of formats covered by tools can also vary, and it may be useful to divide available tools into several classes, based on their format coverage:

For the range of formats intended to be supported in APSR repositories, several tools may be suitable. It is likely that more than one will be needed to obtain a full range of metadata. 

Only tools in the first category, those able to extract metadata from a range of materials, are discussed below. Enhancements to the PRONOM service of The National Archives (UK) may, in the future, assist in locating tools capable of extracting metadata from single specific formats.

A3.2 DROID (Digital Record Object Identification)

Available from The National Archives (UK) - http://www.nationalarchives.gov.uk/aboutapps/pronom/tools.htm

DROID is a platform-independent Java-based application which identifies the format and version of files based upon comparison of file data streams against a set of known signature byte sequences. The signature byte sequences are held in a signature file, which may be updated automatically from The National Archives web site by the DROID application. In March, 2006, Version 9 of the signature file contained signature byte sequences for 57 named file formats (including 159 versions of those formats), and a further 387 tentative file format indicators based on file extension alone.

The main function of DROID is to identify a wide range of file formats as conclusively as possible, including versions. Where a number of possible matches are identified, for example, where multiple versions of a format contain the same signature byte sequences, all matches are listed, along with an indication of the degree of match (e.g. Tentative, Positive). DROID may also notify of suspected mismatches between the format as identified by internal signatures and the filename extension.

DROID, identifies a wider range of formats than the other tools noted (JHOVE and the National Library of New Zealand Metadata Extraction Tool), and, where available, indicates the Persistent Unique Identifier (PUID) that has been assigned to the identified format within The National Archives format registry, PRONOM. However, it does not extract any further metadata from files, nor generic metadata about them (e.g. creation date etc.).

DROID could be used by repositories at least to provide file format identity information to fulfil the PREMIS mandatory elements:

Further format specific tools for metadata extraction might then be invoked based on format identifications from the DROID output.

Samples of output:

A3.3 National Library of New Zealand Metadata Extraction Tool

Available from the National Library of New Zealand - http://www.natlib.govt.nz/en/whatsnew/4initiatives.html#extraction

The National Library of New Zealand Metadata Extraction Tool is also a platform-independent Java-based application, designed to extract preservation metadata from a range of formats. Metadata may be extracted for each format by a specific modular "adapter", and can be output to XML in either an "adapter-native" schema or in a schema complying with the National Library of New Zealand's Preservation Metadata scheme. The tool is designed to be extensible, allowing creation of additional adapter plug-ins by other parties and the structuring of output via XSLT to suit alternative metadata schemes. The tool is capable of recognising and processing a range of formats and versions of formats, but does not currently appear to validate files against their identified format.

The range of formats which can be recognised and for which metadata can be extracted are currently:

Although the range of formats for which there are adapters is currently small, these cover file formats that may be commonly encountered, and the amount of metadata that is extracted can be quite extensive, particularly in "native" mode. If a format is not recognised, generic file metadata can nonetheless be collected, such as filename, size and date created. The tool can be run via either a Windows interface or from a command line.

Samples of output:

A3.4 JHOVE (JSTOR/Harvard Object Validation Environment)

Available from Harvard University Library: http://hul.harvard.edu/jhove/

JHOVE is also a platform-independent Java-based application, primarily designed to identify a range of formats and validate files against their purported formats. It can also recognise format sub-types and versions. In characterising files, JHOVE is also capable of extracting technical metadata from the range of formats and producing XML-encoded or plain text output. JHOVE is also modular and extensible in design, allowing creation of additional modules as needed.

There are currently modules available for characterisation of 12 main format types, comprising around 52 versions or distinct subtypes of those formats. The main formats recognised and for which metadata can currently be extracted are:

If a format is not recognised, it is classed as a "bytestream" and is always well-formed and valid. The tool can be run via either a Windows interface or from a command line.

The metadata extracted can be quite extensive. For images and audio, XML output can be generated according to the MIX schema for still images and the Audio Engineering Society (AES) schemas for audio objects and time code formats.

Again, not all the formats to be accepted by APSR repositories are recognised by JHOVE, and other tools may also be required.

Samples of output:

A3.5 Summary of functions covered by tools

Tool Identify format (Tentative) Identify format (Confirm) Identify versions Validate format Collect generic file MD Collect material type MD Collect file format MD
DROID Yes
[546 formats]
Yes
[159 formats]
Yes No No No No
NLNZ-MET  Yes
[15 formats]
(Some) (Some) No Yes Yes Yes
JHOVE Yes
[52 formats]
Yes
[52 formats]
Yes Yes Yes Yes Yes

A3.6 Recommendations

  1. Repositories identify and validate the file formats of objects on ingest or shortly thereafter.
  2. Repositories use tools on ingest of an object or periodically on new objects, to collect extra metadata and/or metadata which can't be easily supplied during submission.
  3. The National Library of Australia embark on a more detailed audit to align metadata able to be output against recommended preservation metadata elements and make results of this work available when ready.

Back to Contents

Appendix 4: Gap reports for ANU DSpace and UQ Fez/Fedora repositories

This analysis was current at 19 May 2006. The reports look at the level of support for the core preservation metadata elements (i.e. PREMIS semantic units) and include recommendations for enhancements where gaps were identified.

A4.1 ANU DSpace repository

PREMIS semantic unit Supported? Comments on current level of support Possible enhancements
Object Identifier Supported Items are given globally unique Handles.
Files (DSpace bitstreams) are given a local database identifier only.
DSpace are planning to use infoURIs for bitstreams which would be globally unique.
Preservation Level Supported DSpace has 3 support levels (Supported, Known, Unsupported) but ANU doesn't assign them. Can be defaulted from the file format. Content policy development around these levels.
The number of levels could be increased if necessary to conform with a generic set of service levels.
Object Category Supported    
Composition Level Not supported; not applicable Default would be 0 for all files in supported formats.
Files that have a composition level of higher than 0 (e.g. zip files) would fall into the category of unknown, unsupported format.
 
Fixity Supported DSpace calculates checksum on ingest. Checksum checker is coming in next version of DSpace (v1.4)
Size Supported DSpace records size on ingest.  
Format Supported DSpace determines this from the filename extension.
Format version is not determined at present.
Format validation by running a tool such as JHOVE or DROID over the repository or on ingest. Tools could also determine format version.
Significant Properties Not supported; not applicable   If required, submission forms could be modified to ask for this information.
Inhibitors Not supported; not applicable Policy would be not to support files to which this applies.  
Creating Application Not supported ANU policy is to avoid providing preservation level support for formats where creating application is important, and instead promote popular, open formats. The date the file was originally created not supported unless explicitly provided in the metadata being submitted. Present or future tools may be able to provide this information automatically.
If required, submission forms could be modified to ask for this information.
Descriptive metadata may indicate date of creation for born digital items.
Original Name Supported    
Storage Supported    
Environment Not supported Not an issue for supported formats. Global format or environment registries (under development) will meet this need. If required for special cases, submission forms could be modified to ask for this information.
Signature Information Not applicable    
Relationship Not supported Only supported currently through DC.Relation, which is at the item, not the file level. Relationships including structural maps could be stored as a serialised bitstream with the object.
Linking Event Partially supported Ingest event can be determined from database. History logging module exists though it doesn't work properly, has performance issues, and is not being used. The checksum checker will be separate from the history module and the logging systems for the database (eg editing and viewing) are also separate. Theoretically events could be got from logs but it might not be easy. Fixity check logging will be possible in next version of DSpace.
If JHOVE or DROID are run over the repository, the validation event could be determined from the JHOVE or DROID output stored with the object. Ideally the logging systems should be integrated and work properly to record events and their outcome for a particular object.
Linking Intellectual Entity (Descriptive metadata) Supported Each item has a qualified Dublin Core record. Other descriptive metadata may be held in serialised bitstreams.  
Linking Permission Statement Supported Some rights may be stored in DC.Rights. Licences (including Creative Commons) may be stored with the object.  

A4.2 University of Queensland Fez/Fedora repository

PREMIS semantic unit Supported? Comments on current level of support Possible enhancements
Object Identifier Supported Persistent identifier (PID) at item level is UQ prefix followed by a number assigned by Fedora. Datastreams (files) associated with an item are identified by their filenames. infoURIs containing the PID and filenames can be constructed.  
Preservation Level Not supported Haven't needed it yet. Could be defaulted to a single level.
They have just received a request for quotation for repository services. There is nowhere specific to store service levels.
Could add field to descriptive metadata form or store service level agreement as datastream with the object.
Object Category Supported Preservation metadata derived from JHOVE is stored at file level.  
Composition Level Not applicable Default would be 0 for all files in supported formats.  
Fixity Not supported Checksums are not being generated. Checksums should be generated on ingest and stored.
Size Supported Is in JHOVE metadata.  
Format Supported Is in JHOVE metadata. Includes version.  
Significant Properties Not supported Could be stored in description. If required, could be added to the submission forms.
Inhibitors Not applicable    
Creating Application Not supported JHOVE doesn't do application names and versions (but does get camera names for JPEGs).
Original creation date of files not stored either.
If required, could use another tool which detects versions. Could be added to submission forms.
Descriptive metadata may indicate date of creation for born digital items.
Original Name Supported File keeps its original name unless it doesn't conform to NCName. Not sure if original name is kept in this case.  
Storage Supported    
Environment Not supported Not an issue for supported formats. Global format or environment registries (under development) will meet this need. If required for special cases, submission forms could be modified to ask for this information.
Signature Information Not applicable    
Relationship Not supported Fedora RELS-EXT is being used for relationships to other items. RELS-INT for relationships between datastreams is in the current version of Fedora but Fez is not using it yet. Internal relationships are only implicit through filenaming conventions at present - there is no metadata about relationships. Implementation of Fedora's RELS-INT.
Linking Event Not supported yet Fedora has some audit trail recording. Fez is currently being developed to use this. History logging could be done automatically or manually. Continued implementation of history logging.
Linking Intellectual Entity (Descriptive metadata) Supported Dublin Core record. Additional descriptive metadata may be stored. If required, additional fields can be added to submission forms.
Linking Permission Statement Supported Fez has sophisticated and flexible rights management. Roles and groups (eg Fez groups, Shibboleth groups, targeted IDs) can be linked to different actions.  

A4.3 Recommendations

  1. ANU and UQ repositories consider implementing the enhancements suggested in the gap reports, particularly
    • recording of preservation events
    • recording of structural relationships
    • file format validation (ANU)
    • checksum generation (UQ)
  2. ANU and UQ repositories particularly take note of the functional requirements in Appendix 6 when implementing the recording of preservation events.

Back to Contents

Appendix 5: Submission models for key digital content categories

The aim of this product was to look at workflow models for different types of digital content e.g. electronic publishing, digitisation of physical object, and to recommend how metadata should be acquired and what metadata a SIP should contain.

At this stage it was felt not to be appropriate to develop submission models for ANU and UQ as their systems already have underlying data models and submission processes, both established and under development. This project has specified in Appendices 1, 4 and 6 what metadata is required, but leaves decisions on how to enhance systems to collect it to the repository administrators and developers. The National Library of Australia has been reviewing the architecture of its Digital Collections Manager and may in future develop general submission models which may be useful to other repositories.

Regardless of the type of digital content, the main methods of submission involve:

In each case a SIP is compiled which the repository can ingest. In the first two cases, the SIP is compiled after the web form is completed. In the latter two cases, the batch or harvested submission may already be in the form of a SIP compiled by an external workflow system or tool. Other APSR projects are developing examples of these tools e.g. in the Bidwern project and FIDAS (Fieldwork Data Sustainability) project (the tool is called FieldHelper). Among other things, these tools help researchers organise and tag their files, then automatically prepare the data for uploading to institutional repositories, for instance, by compiling SIP packages as METS documents for ingest to DSpace. Work is also being done with electronic journal publishing systems.

Whatever method is used for the actual submission, the aim should be to capture as much metadata as possible (automatically where possible) as a by-product of creating a digital object.


Back to Contents

Appendix 6: Preservation Event use cases and functional requirements

A6.1 Introduction

This document describes the requirements for

Lavoie says about actions in the OAIS Functional Model:

" ..the Archival Storage function is responsible for ensuring that archived content resides in appropriate forms of storage ... and that the bit streams comprising the preserved information remain complete and renderable over the long-term. To meet this responsibility, Archival Storage periodically undertakes procedures such as media refreshment or format migration. The Archival Storage function also implements various safeguard mechanisms, such as error-checking procedures, to evaluate the outcome of preservation processes, as well as disaster recovery policies to mitigate the effects of catastrophic events .."

The PREMIS Data Dictionary says about documenting events:

"An Event is an action that involves at least one object or agent known to the preservation repository." "Documentation of actions that modify (that is, create a new version of) a digital object is critical to maintaining digital provenance, a key element of authenticity." "Even actions that alter nothing, such as validity and integrity checks on objects, can be important to record for management purposes."

These requirements are primarily concerned with events in the above context, that is, actions, relevant to preservation, on "master" or archival copies of objects. It is recognised that repositories usually have other purposes in addition to preservation and that display copies, supporting files, metadata etc may exist in repositories as digital objects in addition to the archival "content" object. A repository may log actions and events for various purposes. The requirements listed here may therefore only be a subset of an individual repository's requirements.

These requirements are deliberately generalised in order to be applicable to any repository, regardless of any particular software, implementation or architecture. Repository administrators and developers would need to determine more specifically how the requirements would be implemented in their repositories.

The use cases below apply both to actions that are performed on a single object and actions that are performed on a batch of objects (or all objects) in a repository.

A6.2 Use cases

  1. Performing an action on an object which doesn't change the object e.g. error checking.
  2. Performing an action on an object which transforms an object into a new object (without materially changing its content) e.g. migration to a newer format.
  3. Deleting an object
  4. Updating the content of an object: An action performed in some repositories, not usually for preservation purposes, but included for clarification.
  5. Updating metadata about an object: It is desirable from a preservation point of view to have the most complete, accurate metadata available, therefore there needs to be a way of updating the metadata as new information comes to light.

A6.3 Actors

These are the Actors (roles) in the use cases below. The Actors are "systems" but these may be manual systems (i.e. people), automated systems or a mixture.

A6.4 Use case 1: Performing an action on an object which doesn't change the object

This use case applies to, for instance, PREMIS eventType

Message digest calculation and format validation, and if applicable, fixity check and virus check, should ideally be done on ingest of an object. In this case they may or may not be recorded as separate events but if not, they should be noted in the event details or in repository policy and procedures.

If not done at ingest, these events may occur some time later, when they may be recorded as separate events.

Trigger: Identification of preservation risk (by a person or preservation monitoring system) or part of an auditing process (one off or regularly scheduled)

  1. The Preservation Monitor or Workflow System alerts the Event Manager that an action needs to be performed on an object.
  2. The Event Manager schedules the event.
  3. The Event Manager peforms the action.
  4. The Event Manager notifies the Repository that an event has taken place along with details of the event.
  5. The Repository records the event details. (see Event Details below).

A6.5 Use case 2: Performing an action on an object which transforms an object into a new object without materially altering the content.

This use case would apply to, for instance, PREMIS eventType

An event which changes the preservation copy of an object should always be recorded.

Trigger: Identification of preservation risk (by a person or preservation monitoring system) or implementation of a policy decision e.g. to migrate all files of a certain format to a newer, better supported format.

Base course: A new object is created and the old object is kept.

  1. The Preservation Monitor or Workflow System alerts the Event Manager that an action to change an object needs to be performed.
  2. The Event Manager schedules the event.
  3. The Event Manager takes a copy of the object, and modifies it to create a new object.
  4. The Event Manager submits the new object to the Repository along with details of the event which created it, including its relationship to the old object.
  5. The Repository ingests the new object, records the relationship between the new and old objects, applies version information (especially if the new object is the new master archival copy) and assigns a unique identifier to the new object.
  6. The Repository stores relevant preservation metadata about the new object.
  7. The Repository ensures descriptive, rights and and any other relevant metadata from the old object are associated with the new object.
  8. The Repository records details of the event which created the new object and associates the event with the new and old objects.
  9. The Repository records the event which ingested the new object. If the ingest event is not stored explicitly the details must be able to be output to conform with the draft APSR METS profile (Appendix 8).

Alternative course: A new object is created and the old object is not kept.

After step 9:

A6.6 Use case 3: Deleting an object from a repository

Repositories will have their own policies on the circumstances where deleting an object is allowed. It is expected that some metadata about an object will be kept even though the object itself is removed. This should be at least the object identifier and some descriptive metadata (or a link to it e.g. through a relationship with a current object).

Trigger: Implementation of a policy decision to delete an object. For instance, a decision to delete all objects of a certain type e.g. non-current versions of masters, or a policy to only keep certain objects for 10 years.

  1. Workflow System (after checking its rules about which objects can be deleted and by whom) or Preservation Monitor instructs the Repository to delete an object.
  2. The Repository checks whether the object to be deleted is part of the provenance history of the current "master" copy of an object. (Depending on the particular Repository, this may have been recorded in a provenance history, or through a relationship (direct or indirect) with the master object, or through an event or chain of events that led to the creation of the current master object.)
  3. If it is, the Repository should have rules about what preservation metadata needs to be kept (this should include at least the object's format). The repository administrator should be able to configure these rules.
  4. The Repository checks any other relationships, links or associations the object has. The Repository will have rules to deal with these before or when an object is deleted in order to maintain the integrity of the data.
  5. The Repository deletes the object and keeps any metadata required from the above checks.
  6. The Repository records the deletion event.

A6.7 Use case 4: Updating the content of an object

An example of this use case is a depositor changing the content of a document.

Particularly for internal documents, reports etc, the Workflow System may well assign a version number to the new document. However although this can be regarded as a new "version" of the old object, it is different from Use case 2 above. The repository should differentiate between different versions of the same content, and "versions" where the content is not the same.

Instead, this new "version" should be regarded as a new "work" (PREMIS Intellectual Entity) with a relationship to the old "work". It should have its own descriptive metadata distinct from the descriptive metadata of the old work, similar to the way different editions of a book have their own records in a library catalogue.

This new work should be able to have its own preservation policy. For example, the latest "version" may need to be kept indefinitely, whereas the earlier "versions" may only need to be kept for a defined period. Or the policy may be to only keep the latest "version" (which has been authorised through the Workflow System before submission to the Repository) and delete previous "versions" immediately.

Trigger: The depositor may use a Workflow System to take a copy of the object in the repository, edit it and re-submit it, or the depositor may edit their own local copy of the original and submit it through the Workflow System as a new "version" of the original object.

  1. Workflow System accepts the new object.
  2. Workflow System submits the new object to the Repository. The Workflow System may provide a complete OAIS SIP, or only information that is different for this object.
  3. The Repository ingests the new object, assigns a unique identifier to the new object and records the relationship between the new and old objects. This relationship may be recorded through the descriptive metadata only or may be more explicit in the system.
  4. The Repository may associate updated descriptive, rights and and any other relevant metadata from the old object with the new object, if the Workflow System only provides updated information.
  5. The Repository stores relevant preservation metadata about the new object.
  6. The Repository records the event which ingested the new object. If the ingest event is not stored explicitly the details must be able to be output to conform with the draft APSR METS profile.

A6.8 Use case 5: Updating metadata about an object

For example, the descriptive metadata about a photograph may need to be changed when new information comes to light about the people depicted in it.

This use case may be applied to administrative, structural etc as well as descriptive metadata.

It is up to individual repositories whether only one version or different versions of the metadata are kept, or even if a record of changes is kept. From a preservation point of view the authenticity of the content object is most important and keeping a record of changes to the content object is mandatory, but it is optional for the metadata. Whether or not to keep previous versions of the metadata depends on its significance and what it might be used for. It is however usual to keep at least the date the metadata was originally created and by whom (organisation rather than person) and the date it was last updated and by whom.

  1. Workflow System accepts new version of the metadata.
  2. Workflow System sends new version of metadata with the object identifier to the Repository.
  3. The Repository stores the new version of the metadata as the current version and associates it with the object.
  4. The Repository may or may not keep previous versions of the metadata.
  5. The Repository keeps the date the first version of the metadata was created and updates the date last updated to today's date. The Repository may or may not keep other dates.

A6.9 Event Details

What details are recorded about an event and how they are stored will depend on the particular Repository's data model and architecture.

For the purposes of publishing or exchanging metadata about an archival object, the Repository should be able to conform to the proposed APSR METS profile. This profile specifies that a history of events describing an object's provenance be output in digiprovMD using the schema for the PREMIS Event Entity (see the PREMIS Data Dictionary.)

These are the semantic units of the PREMIS Event Entity (NR=not repeatable; R=repeatable; M=mandatory; O=optional):

The profile also says that additional information about agents associated with events may optionally be recorded. Agents may be persons, organisations or software.
The PREMIS Agent Entity has the following semantic units:

Even if there is no additional information for software or a device, placing it in Agent in a document to conform with the draft APSR METS profile will facilitate mapping in the receiving repository's database.

Additional information considered useful but not covered by PREMIS should be recorded. Other more detailed schemas to describe events may emerge and/or PREMIS Event may be enhanced in the future

A6.10 Recommendations

  1. ANU and UQ repositories particularly take note of the functional requirements for preservation events in Appendix 6 and bring them to the attention of their open source communities as they begin to develop event logging functionality.

Back to Contents

Appendix 7: Issues / enhancements to PREMIS and existing schemas and protocols that might be used.

A7.1 PREMIS conformance

This project is using PREMIS in two ways:

The APSR repositories will aim to be PREMIS conformant in as far as being able to produce a METS document with metadata in a container using a PREMIS namespace valid according to the PREMIS xml schemas. If the data were not available they would have to be included with values of "unknown" or "not applicable". However in this case, i.e. if the repository could not supply a real value for a mandatory semantic unit, they could be regarded as not PREMIS conformant.

A7.2 Issues encountered in PREMIS

Some issues encountered while examining the PREMIS Data Dictionary were raised with the PREMIS Implementors' Group and added to the errata for fixing in the next version of the data dictionary e.g.

An interpretation issue was found with "relationship" in the PREMIS Data Dictionary :

STRUCTURAL RELATIONSHIPS: Under relationshipType on page 2-62 it says "structural=a relationship between parts of an object". This accords with what PREMIS says on page 1-8 i.e. structural relationships are about how to put back together a digital object which consists of more than one part or file. However the paragraph under Derivation relationships on page 1-9 says "A structural relationship among objects can be established by an act of derivation before the objects were ingested by the repository ... " and "..They do not have derivation relationships with each other, but do have a structural relationship as siblings (children of a common parent)". It's confusing to describe this as a structural relationship because the 'siblings' are not part of the same digital object - they belong to different representations.

"PARENT" AND "CHILD": On page 2-63 it says "is child of = the object is directly subordinate in a hierarchy to the related object ..." and "is parent of = the object is directly superior in a hierarchy to the related object ...", but it doesn't say what the hierarchy relates to. In the paragraph (on page 1-9) referred to above, "parent" refers to the object from which the "children" are derived, whereas on page 6-5 "children" is used to describe components of a web site. In the former case the parent has a "source of" relationship with the children; in the latter case the children have an "is part of" relationship with the (parent) website. In NLA's Digital Collections Manager system the term "child" is used to denote "part of" at the Intellectual Entity level. Because "parent" and "child" can be used in various contexts, it is recommended to avoid "is parent of", "is child of", "has child" and "has parent" in relationshipSubType and that more precise terms such as "source of", "derived from", "is part of", "has part" be used instead.

Allowing reciprocal relationships to be described in two places can give rise to data integrity problems e.g. one object may have an "is part of" relationship to a second object, which may in turn have an "is part of" relationship to the first object.

Integrity problems could also arise because two way linking is allowed between Object and Event. An Object can be linked to an Event through the Object's semantic units "relatedEventIdentification" or "linkingEventIdentifier"and an Event can be linked to an Object through the Event's semantic unit "linkingObjectIdentifier".

Another issue with Events and linkingObjectIdentifier is that there is no way of saying (if applicable) which was the "source" object and which the "output" object other than by referring back to one of the objects to find its relationship to the other object, and again there is the potential for inconsistency.

A7.3 Proposals for enhancements to PREMIS

At this stage, other than fixing issues in the first version of the data dictionary, no particular enhancements have been identified. However once repositories begin to use PREMIS desired enhancements will probably be identified.

Proposals for enhancements will be sent for discussion to the PREMIS Implementors' Group list and will be formally submitted to the Editorial Committee for the PREMIS Maintenance Activity, which the National Library of Australia has been invited to join.

A7.4 Other existing schemas and protocols

Other schemas and protocols recommended in this report are