SCAPE Training Outline

Session One

10:15 - 10:30 Introduction

Dave to cover introduction bearing in mind the outline below, up to you how long Dave.

10:30 - 11:00 Format Identification

Concepts

  • What is a format?
    • Once round the room getting each delegate to describe their role and why format identification matters to them
    • Describe other communities and orgs for which format id matters:
      • Operating Systems: In order to work out what program can decode/render a particular file.
      • Web Servers: In order to add the MIME type ContentType to a response.
      • Memory Institutions: In order to identify the software stack needed to render a particular file.
      • More Generally: Anyone with digital data. Recognising the format is a necessary step to identifying software that can extract meaning from the data.
    • Format is used to associate a sequence of bytes with software that can extract meaing from those bytes.
    • Format is also used to associate a sequence of bytes with documentation / specifications as to how meaning is encoded within a stream of bytes. e.g. text documents, pdfs, mp3s, jpgs
    • Format data is used for preservation planning, how do you know if you're collection is vulnerable to a format specific risk if you don't know that you have files of that format?
  • What identifiers are used for format?
  • There are multiple schemes in operation
    • Extension: Used by Windows, is brittle
    • MIME: Used by the internet, used to negotiate content on the internet
    • Uniform Type Identifier: Apple's second solution
    • Type / Creator codes: Apple's first solution
    • PRONOM Unique Identifier: Used within the DP community but not widespread.
  • How do tools go about identifying format?
  • Let's look at how the tools we're using in the coming session go about identifying a few common formats.
    • PDFs
    • MS Office
    • JPG
    • MP3
    • HTML
  • Methods of identification compared (Extension, Magic / Sigs, regexs)
  • Format is a property (albeit an important one) and may (will) change over time in some cases

Tools

Describe the three format identification tools

  • File - standard linux format id tool, gnu FF (provide links for cygwin instructions)
  • Fido - Python based format identification tool, derivative of DROID
  • Tika - Apache Java format identification

Look at how the tools go about their job, how they recognise formats, briefly cover sig files.

11:00 - 11:15 Format Validation

Concepts

  • Distinguish between identification and validation
  • Show occasions where the line gets blurred between the two

11:30 - 11:45 Characterisation

Concepts

  • Characterisation is nearly always format specific
  • Characterisation data is not standard between formats or tools
  • Characterisation nearly always involves parsing and validation
  • Not as reliable as identification, more complex process and parsers fail

Tools

Describe the chararcterisation tools

  • Tika - Apache Java characterisation
  • ExifTool - cross platform characterisation tool specializing in EXIF data

11:45 - 12:45 Demonstrations

Scenarios to include some of (all if time):

  • Demonstrate format identification across homogenous collection
  • Identification of compound object componenets (zip, iso, etc. )
  • Identifying versions of a format, e.g. PDF, TIFF
  • Office formats
  • Characterisation data, quantity and variety

13:45 - 14:00 Introduce Exercise

Aim to create a format profile using the three different format identification tools (a tool for each group). I'll cover some command line tips and give out a cheat sheet. For the same group of files find out how many formats and how many files of each format are in the set.

Bonus exercise, try to do the same thing with characterisation data if time. This will demonstrate the increase in both quantity and variety of data, plus the more delicate nature of characterisation.

14:00 - 15:00 Practical Exercises

15:00 - 15:15 Evaluate Results and Summarise

The messages to take away are that performing characterisation at scale takes time, and will probably have to be repeated, as both format identification and chararcterisation tools improve.

These tools are ideally combined in performant workflows, where components can be upgraded. For large collections these workflows may benefit from parallelisation.

Taming characterisation data is not easy, and if you use a variety of tools you're going to have a lot of it