• Introduction to standards and protocols and their need and importance in digital libraries
• Impart knowlegde on interoperability and data exchange in digital libraries and roles of standrards and protocols
• Impart knowledge on important standards and protocols applicable for digital libraries
II. Learning Outcomes
After going through this lesson, learners would understand that standards and protocols have a major role and importance in building a digital library. They would learn about various categories of standards and protocols that are used in building a digital library, including communication protocols, bibliographic standards, record structure standards, encoding standards, information retrieval standards and digital preservation standards.
2. Standards & Protocols: Definition and Importance
3. Communication Protocols
3.1 Transmission Control Protocol / Internet Protocol (TCP/IP)
3.2 Hyper Text Transfer Protocol (http)
3.3 File Transfer Protocol (ftp)
4. Bibliographic Standards
4.1. Machine Readable Catalogue (MARC)
4.2. Dublin Core
4.4. Text Encoding Initiative (TEI)
4.5. Electronic Archive description (EAD)
4.6. Metadata Encoding and Transmission Standard (METS)
4.7. Metadata Object Description Schema (MODS)
5. Record structure
5.1. ISO 2709 / Z39.2
6. Encoding Standards
7. Information Retrieval Standards
7.1. Z39.50 or ISO 23950
7.2. Search/Retrieve Web Service (SRW) and Search/Retrieve via URL (SRU)
7.6. Open URL
8. Formats and Media Types
9.1. Formats and Encoding used for Text
9.2. Page Image Format
9. Preservation Standards
9.2 Open Archival Information System (OAIS)
Building a digital library requires a number of infrastructural software and hardware components that are not available off-the-shelf as packaged solution in market place. There are no turn-key, monolithic systems available for digital libraries, instead digital libraries are collection of disparate systems and resources connected through a network and made interoperable using open system architecture and open protocols and standards that are integrated within one web-based interface. Use of open architecture and open standards make it possible that pieces of required infrastructure, be it hardware, software or accessories, are gathered from different vendors in the market place and integrated to construct a working digital library environment. Several components required for setting-up a digital library are internal to the institutions, but several others are distributed across the Internet, owned and controlled by a large number of independent players. The task of building a digital library, therefore, requires a great deal of seamless integration of various components for its efficient functioning. As such, standards and protocols have a major role to play in building a digital library.
Standards and protocols are the backbone of a digital library that is instrumental in its efficient implementation with utmost quality and consistency facilitating interoperability, data transfer and exchange of the system. Uniform standards and protocols are pre-requisite for data transfer, exchange and interoperability amongst digital libraries.
This chapter introduces standards and protocols, their role and importance in building digital libraries. The chapter describes important digital library standards and protocols used for digital communication, bibliographic data rendering, record structure, encoding standards to handle multi-lingual records, information retrieval standards, formats and media types used in the digital library and digital preservation.
2. Standards and Protocols: Definition and Importance
A protocol is a series of prescribed steps to be taken, usually in order to allow for the coordinated action of multiple parties. In the world of computers, protocols are used to allow different computers and/or software applications to work and communicate with one another. Because computer protocols are frequently formalized by national and international standard organizations such as ISO and ITU, they are also considered as standards. As such, a protocol that is accepted by most of the parties that implement it can be considered as standard. However, every protocol is not a standard, likewise every standard is not a protocol. Standards are generally agreed-upon models for comparison. In the world of computers, standards are often used to define syntactic or other rule sets, and occasionally protocols, that are used as a basis for comparison. (Ascher Interactive, 2007).
Standards support cooperative relationships amongst multiple vendors and implementors and provide a common base from which individual developments may emerge. Standards make it possible to share and collaborate in developments of products and processes across institutional and political boundaries. Moreover, too many standards for the same products and processes undermine the utility of having any standard at all. Standard for citing bibliographic references is a good example since there are numerous rival and incompatible standards that are used to citing a document, for example the American Psychological Association, the Modern Language Association, the Chicago Manual of Style, Indian standards, ANSI Z39.29 (American National Standard for Bibliographic References) and several other well-known standards that can be used by editors or publishers as their standard.
Standards are supported by a range of national and international organizations, including professional associations such as the Institute of Electrical and Electronics Engineers (IEEE), national standard institutions such as the American National Standards Institute (ANSI) or the British Standards Institution (BSI) or Bureau of Indian Standards, and international bodies such as the International Organization for Standardization (ISO). The US National Information Standards Organization (NISO), accredited by ANSI is specifically assigned the task of preparing standards for library and information science.
A number of important institutions and organizations are actively involved in the development and promotion of standards relevant to digital libraries. For example, the Digital Library Federation (DLF), a consortium of libraries and related agencies, as one of its objectives identify standards for digital collections and network access (http://www.diglib.org). The DLF operates under the administrative umbrella of the Council of Library and Information Resources (http://www.clir.org) located in Washington, DC. The Library of Congress (http://www.loc.gov), plays an important role in maintaining several key standards such as MARC, and the development of MARC within an XML environment. The International Federation of Library Associations and Institutions (IFLA) maintains a gateway– IFLANET Digital Libraries – to resources about a variety of relevant standards (http://www.ifla.org/II/metadata.htm).
3. Communication Protocols
Communication protocols are predefined sets of prompts and responses which two computers follow while communicating with each other. Since digital libraries are designed around Internet and Web technologies, communication protocols such as Transmission Control Protocol / Internet Protocol (TCP/IP), Hyper Text Transfer Protocol (http) and File Transfer Protocol (ftp) that are used by the Internet are also used for establishing communication between clients and servers in a digital library.
3.1 Transmission Control Protocol / Internet Protocol (TCP/IP)
The Internet is a packet-switched network, wherein information to be communicated, is broken down into small packets. These packets are sent individually using several different routes at the same time and then reassembled at the receiving end. TCP is the component that collects and reassembles the packets of data, while IP is responsible for assuring that the packets are sent to the right destination. TCP/IP was developed in the 1970s and adopted as the protocol standard for ARPANET, the predecessor to the Internet, in 1983.
TCP/IP is the protocol that controls the creation of transmission paths between computers on a single network as well as between different networks. The standard defines how electronic devices (like computers) should be connected to the Internet, and how data should be transmitted between them. This protocol is used universally for public networks and many in-house local area networks. Originally designed for the UNIX operating system, TCP/IP software is now available for every major kind of computer operating system and is a de facto standard for transmitting data over networks.
Moreover, the TCP/IP includes commands and facilities that facilitates transfer of files between systems, log in to remote systems, run commands on remote systems, print files on remote systems, send electronic mail to remote users, converse interactively with remote users, manage a network, etc. Fig. 1 and Fig. 2 given below is pictorial depiction of TCP / IP model.
Fig.1: TCP/IP Model Used for Connecting different Nodes in a Network
Fig.2: TCP/IP Layers Involved in Transmission of a Mail
3.2 Hyper Text Transfer Protocol (http)
The http is the underlying protocol used by the WWW to define how messages are formatted and transmitted. It needs an http client program (Internet Browser) on one end, and an http server program on the other end. The protocol is used for carrying requests from clients to the server and returning pages to the client. It is also used for sending requests from one server to another. Http is the most important protocol used in the World Wide Web (WWW).
HTTP runs on top of the TCP/IP protocol. Web browsers are HTTP clients that send file requests to Web servers, which, in turn, handle the requests via an HTTP service. HTTP was originally proposed in 1989 by Tim Berners-Lee, who was a coauthor of the 1.0 specification of http. HTTP, in its 1.0 version was “stateless”, i.e. each new request from a client required setting-up of a new connection instead of handling all requests from the same client through the same connection. Moreover, the version 1.0 of the protocol provided for raw data transfer across the Internet. However, version 1.1 was an improved protocol that included persistent connections, decompression of HTML files by client browsers, multiple domain names sharing the same IP address and handling MIME-like messages. Fig. 3 is pictorial depiction of client-server Interaction using http protocol.
Fig.3: Client-server Interaction using http Protocol
3.3 File Transfer Protocol (ftp)
The File Transfer Protocol (FTP), as its name indicate, is a protocol for transferring files from one computer to another over a local area network (LAN) or a wide area network (WAN) such as Internet. It is a common method of moving files between between client and server over TCP / IP network. The protocol is in existence since 1971 when the file transfer system was first implemented between MIT machines. FTP provides for reliable and swift exchange of files with different operating system and machine architecture. There are many Internet sites that have established publicly accessible repositories of material that can be obtained using FTP, by logging in using the account name anonymous, thus these sites are called anonymous ftp servers. Fig. 4 is pictorial depiction of FTP process model.
Fig.4: FTP Process Model
4. Bibliographic Standards
Bibliographic standards are concerned with the description of contents as well as physical attributes of documents and non-documents in a library. They are generally very complex (MARC has some 700 field definitions) and cover the most difficult and intellectual part of the object definition (Day, 2001) . These definitions are necessary for processing the material and also for searching it. Most digital libraries software support Dublin Core Metadata Sets for bibliographic records.
4.1 Machine Readable Catalogue (MARC)
MARC (MAchine-Readable Cataloging) standards are a set of formats for description of documents catalogued by libraries including books, journals, conference proceedings, CD ROM, etc. ‘Machine-readable’ essentially means that a computer can read and interpret the data given in the cataloging record. MARC was developed in 1960s by the US Library of Congress to create records that can be used by computers and shared among libraries. MARC contains bibliographic elements for content, physical and process description. By 1971, MARC formats had become the US national standard and international standard by 1973. There are several versions of MARC in use around the world, the most predominant being MARC 21, created in 1999 as a result of the harmonization of U.S. and Canadian MARC formats, and UNIMARC, widely used in Europe. The MARC 21 family of standards now includes formats for authority records, holdings records, classification schedules, and community information, in addition to the format for bibliographic records (Furrie, 2003).
4.2 Dublin Core
The Dublin Core refers to a set of metadata element that may be assigned to web pages so as to facilitate discovery of electronic resources. Originally conceived for author-generated description of web resources at the OCLC/NCSA Metadata Workshop held at Dublin, Ohio in 1995, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations. The Dublin Core Workshop Series has gathered experts from the library world, the networking and digital library research communities, and a variety of content specialists in a series of invitational workshops. The building of an interdisciplinary, international consensus around a core element set is the central feature of the Dublin Core. A set of 15 core elements in Dublin Core include: Title, Creator, Subject and Keywords, Description, Publisher, Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, Rights Management. (Baker, 1998).
BIB-1 is a simplified record structure for online transmission. It is essentially a sub-set of MARC. It is the original format for transmission of records within a Z39.50 dialogue between two systems. It has elements that are mappable to both MARC and the Dublin Core (Library of Congress, 2007).
4.4 Text Encoding Initiative (TEI)
The initiative provides a scheme for encoded text so that parts of it such as the start and end of lines, paragraph, pages, chapters, acts, and so on can be marked. Thus such text can be processed to produce accurate indexes for searching. Other features of the text both grammatical and linguistic and also content indicating such as the actors in a play can be identified allowing for a rich analysis. These rules require that the actual text be marked up with SGML encoding (TEI, 2013).
4.5 Electronic Archival Description (EAD)
An encoding scheme devised within the SGML framework to define the content description of documents and other archival objects. It is defined with a minimum number of descriptive elements, but in an extensible fashion. It is designed to create descriptive records which will assist in searching for the original material in a number of ways (Library of Congress, 2007).
4.6 Metadata Encoding and Transmission Standard (METS)
METS has the task of encoding descriptive, administrative and structural metadata for objects in a digital library to facilitate the management of such documents within a repository and their exchange between repositories. It is maintained by the Network Development and MARC Standards Office of the Library of Congress (http://www.loc.gov/standards/mets) and is an initiative of the Digital Library Federation, mentioned earlier in the Chapter (Library of Congress, 2013). The METS format has seven major sections:
i. The METS Header contains metadata describing the METS document itself, including such information as creator or editor.
ii. The Descriptive Metadata section points to descriptive metadata external to the METS document (such as a MARC record in an OPAC or an EAD finding aid on a web server), or contain internally embedded descriptive metadata, or both.
iii. The Administrative Metadata section provides information about how the files were created and stored, intellectual property rights, the original source from which the digital library object document derives, and information regarding the provenance of the files comprising the digital library object (that is master/derivative file relationships, and migration / transformation information). As with Descriptive Metadata, Administrative Metadata may be either external to the METS document, or encoded internally.
iv. The File section lists all the files containing content that form part of the digital document.
v. The Structural Map provides a hierarchical structure for the digital library document or object, and links the elements of that structure to content files and metadata that pertain to each element.
vi. The Structural Links section of METS allows METS’ creators to record the existence of hyperlinks between nodes in the hierarchy outlined in the Structural Map. This is of particular value when using METS to archive websites.
vii. The Behaviour section associates executable behaviours with content in the document.
4.7 Metadata Object Description Schema (MODS)
The Metadata Object Description Schema was developed as a descriptive metadata scheme oriented toward digital objects, and drawing from the MAchine Readable Cataloging (MARC 21) Format. The scheme is reasonably usable, fairly refined as it provides descriptive metadata for digital objects by regrouping the MARC fields, adding a few new ones, and translating the numeric codes to readable English in XML.
MODS has gone through intense development, version 3.1 was released 27 July 2005, and version 3.2 was released 1 June 2006. In addition, MODS was adopted by the Digital Library Federation (DLF) for their Aquifer Project which is seeking to develop the best possible methods and services for federated access to digital resources. DLF intends to use MODS to replace Dublin Core for descriptive metadata for digital objects in the digital library world, for MODS allows more specification of contents and better clarification of the various elements than does Dublin Core (Library of Congress, 2013).
5. Record Structure
Record structure define the physical and logical structure of the record which holds the data. A typical bibliographic record may contain multiple fields of variable length, which may occur more than once (repeatable). Except for proprietary structures, there is really only one structure used for bibliographic data of any complexity. These formats facilitate exchange of data between systems and are not intended for human consumption. Most digital library software support ISO 2709 for structure for individual records which is well-suited to handling the MARC format.
5.1 ISO 2709 / Z39.2
ISO 2709 / Z39.2 defines a very flexible structure for individual records (originally for tape storage) and is exceptionally well-suited to handling the MARC format. The two standards were developed together but 2709 can be the structure of almost any type of record. The main strength of 2709 is its ability to handle variable length fields and records where the occurrence of fields is also variable.
6. Encoding Standards
The encoding deals with the way individual characters are represented in the files and records. It is concerned with text within records almost exclusively. Most digital library software supports ASCII as well as Unicode to support multilingual requirements.
Unicode is a universal encoding scheme using 16 bits to represent each character. It has the advantages of being simple, complete, and is being widely adopted. Its disadvantage is that all characters take twice as much space even for single language data. However, disk storage is getting cheaper (text is very small compared to images and video) and there are ways of simply and speedily compressing the data in storage. Unicode is controlled by the Unicode Consortium and is the operational equivalent of the ISO-10646 standard. Note that 10646 also defines 32 bit characters, but these are not in any general use (Unicode Consortium, 2013).
There are a wide variety of 8 bit character encoding in use round the world, but the most common is that of the American Standard Code for Information Interchange (ASCII). The ASCII defines all the characters necessary for English and many special characters,. This code has been used as the basis for most other 8 bit codes, the “lower” 128 are left alone (they contain the Latin alphabet, numbers, control codes and some special characters) and the “top” 128 characters are used for a second language. Thus, there is a universal compatibility at the “low” 128 level and almost none for the rest. IBM / Microsoft produced a number of “National variants” for the PC DOS operating system and these have a large measure of acceptance through wide distribution however they are only a manufacturer’s standard.
7. Information Retrieval Standards
7.1 Z39.50 or ISO 23950Z39.50 is an ANSI / NISO standard for information storage and retrieval. It is a protocol which specifies data structures and interchange rules that allow a client machine to search databases on a server machine and retrieve records that are identified as a result of such a search. Z39.50 protocol is used for searching and retrieving bibliographic records across more than one library system. This protocol is not used by the Internet search engines (they use http). It is more complex and more comprehensive and powerful than searching through http. Z39.50 has been extended to allow system feedback and inter-system dialogue. Like most applications working under client-server environment, Z39.50 needs a Z39.50 client program on one end, and a Z39.50 server program on the other end.
Z39.50 protocol was originally designed to facilitate searching of very large bibliographic databases like OCLC and the Library of Congress. However, the protocol is now used for a wide range of library applications involving multiple database searching, cataloguing, inter- library loan, online items ordering, document delivery and reference services. With the rapid growth of Internet, the Z39.50 standard has become widely accepted as a solution to the challenge of retrieving multimedia information including text, images, and digitized documents (Wikipedia, 2009).
The name Z39 came from the ANSI committee on libraries, publishing and information services which was named Z39. NISO standards are numbered sequentially and Z39 is the 50th standard developed by the NISO. The current version of Z39.50 was adopted in 1995 superseding earlier versions adopted in 1992 and 1988. Fig. 5 is a pictorial depiction of Z39.50 model of information retrieval.
Fig.5: Z39.50 Model of Information Retrieval
7.2 Search/Retrieve Web Service (SRW) and Search/Retrieve via URL (SRU)
SRW and SRU are web services-based protocols for querying Internet indexes or databases and retrieving search results. Web services essentially send requests for information from a client to a server. The server reads the input, processes it, and returns the results as an XML stream back to the client essentially in two flavours: REST (Representational State Transfer) and SOAP (Simple Object Access Protocol). SRW provides a SOAP interface to queries, to augment the URL interface provided by its companion protocol Search / Retrieve via URL (SRU). Queries in SRU and SRW are expressed using the Contextual Query Language (CQL). (Morgan, 2004).
A REST-based web service usually encodes commands from a client to a server in the query string of a URL. Each name / value pair of the query string specifies a set of input parameters for the server. Once received, the server parses these name / value pairs, does some processing using them as input, and returns the results as an XML stream. The shape of the query string as well as the shape of the XML stream are dictated by the protocol. By definition, the communication process between the client and the server is facilitated over an HTTP connection.
SOAP-based web services work in a similar manner, except the name / value pairs of SOAP requests are encoded in an XML SOAP ‘envelope’. Similarly, SOAP servers return responses using the SOAP XML vocabulary. The biggest difference between REST-based web services and SOAP requests is the transport mechanism. REST-based web services are always transmitted via HTTP.
SRW is a SOAP-based web service, SRU is a REST-based web service. While, REST-based web services encode the input usually in the shape of URLs, SOAP requests are marked up in a SOAP XML vocabulary.
7.3 Open Archives Initiatives-Metadata Harvesting Protocol (OAI-PMH)
The Open Archives Initiatives-Metadata Harvesting Protocol (OAI-PMH) (Open Archives, 2008) defines protocols that support creation of interoperable digital libraries allowing remote archives to access its metadata using an “open” standard. The OAI-PMH supports streaming of metadata from one repository to another and its harvesting by a service provider. A service provider can harvest metadata from various digital repositories distributed across universities and institutions to provide services such as browsing, searching, alert or annotation. In essence, the OAI-PMH works in a distributed mode with two classes of participants, i.e. Data providers (Domain-specific Digital Repositories and Institutional Repositories) and Service Providers or specialized search engines:
• Data Providers: Data providers are OAI-compliant institutional repositories or domain- specific digital repositories set-up by the institutions and universities. IRs use OAI- compliant software that supports the OAI-PMH. The OAI-PMH protocol enable data providers (repositories) to expose structured metadata of publications stored in repositories to the Internet, so that it can be harvested by service providers.
• Service Providers: Service providers, or harvesters issue OAI-PMH requests to data providers (i.e. OAI-compliant digital repositories) in order to harvest metadata so as to provide value-added services. The metadata stored in the data provider’s database is transferred in bulk to the metadata database of the service providers.
Fig. 6 is a pictorial depiction of OAI-PMH architecture.
Fig.6: The OAI-PMH Architecture
7.4 Open URL
OpenURL (Wikipedia, 2006) is a versatile linking standard that use metadata (instead of an object identifier such as DOI) for generating dynamic link by passing metadata about a resource to a resolver program. It consists of two components, i.e. the URL of OpenURL resolver followed by a description of the information object consisting of a set of metadata elements (e.g. author, journal issue no., volume, year, etc.).
For OpenURL to work, a library is required to setup a resolution server with information on full-text journals accessible to the library with their link as well as how to link to local print holdings and other local services. The information provider (or publisher) must also be OpenURL-enabled to redirect the linking request to the local resolution server. A “link resolver” or “link-server”, parses the elements of an OpenURL and provides links to appropriate services as identified by a library. OpenURL link allows access to multiple information services from multiple resources, including full-text repositories, abstracting, indexing, and citation databases, online library catalogues, document delivery service and other web resources and services.
When a user clicks at an OpenURL link, he / she is directed to OpenURL resolver. The resolver, based on the services availed by the library provides him an HTML page consisting of a site’s links to resources from where user can access the resource (full-text from publisher’s site, DDS, Aggregators, etc). The user selects an appropriate service, clicks on the link that takes him to the site of service provider. Fig. 7 is a pictorial depiction of functioning of open URL
Fig.7: How OpenURL Works?
OpenURL was developed by Herbert van de Sompel, a librarian at the University of Ghent. His link-server software, SFX, was purchased by the library automation company Ex Libris which popularized OpenURL in the information industry. Many other companies now market link server systems, including Openly Informatics (1Cate -acquired by OCLC in 2006), Endeavor Information Systems, Inc. (Discovery: Resolver), SerialsSolutions (ArticleLinker), Innovative Interfaces, Inc. (WebBridge), EBSCO (LinkSource), Ovid (LinkSolver), SirsiDynix (Resolver), Fretwell-Downing (OL2), TDNet (TOUR), Bowker (Ulrichs Resource Linker) and KINS (K-Link).
The National Information Standards Organization (NISO), has developed OpenURL and its data container (the ContextObject) as international ANSI standard Z39.88.
8. Formats and Media Types in Digital Library
A defined arrangement for discrete sets of data that allow a computer and software to interpret the data is called a file format. Different file formats are used to store different media types like text, images, graphics, pictures, musical works, computer programs, databases, models and designs, video programs and compound works combining many type of information. Although almost every type of information can be represented in digital form, a few important file formats for text and images typically applicable to a library-based digital library are described here. However, every object in a digital library needs to have a name or identifier which distinctly identifies its type and format. This is achieved by assigning file extensions to the digital objects. The file extensions in a digital library typically denotes formats, protocols and rights management that are appropriate for the type of material.
Information contents of a digital library, depending on the media type it contain, may include a combination of structured / unstructured text, numerical data, scanned images, graphics, audio and video recordings.
8.1 Formats and Encoding used for Text
Text-based contents of a digital library can be stored and presented as i) simple text or ASCII (American Standard Code for Information Interchange; ii) unstructured text; and iii) Structured text (SGML or HTML or XML).
8.1.1. Simple Text or ASCII
Simple text or ASCII (American Standard Code for Information Exchange) is the most commonly used encoding scheme used for facilitating exchange of data from one software to another or from one platform to another. “Full-text” of articles from many journals has been available electronically through online vendors like DIALOG and STN in this format for over two decades. Typically what is stored in the text of each article, broken into paragraphs, along with bibliographic information is a simple tagged information.
Simple text or ASCII is compact, economic to capture and store, searchable, inter-operable and is malleable with other text-based services. On the other hand, the simple text or ASCII can not be used for displaying complex tables or mathematical formulas, photographs, diagrams, graphics, special characters can not be displayed in ASCII. ASCII format does not store text formatting information, i.e. italics, bold, font type, font size or paragraph justification information. Simple text or ASCII in many ways is inadequate to represent many journal articles because of the reasons mentioned above. Although simple text or ASCII is extremely useful for searching and selection, its inability to capture the richness of the original makes it an interim step to structured text formats.
8.1.2. Structured Text Format
Structured text attempt to capture the essence of documents by “marking-up” the text so that the original form could be recreated or even produce other forms such as ASCII. Structured text format have provision to embed images, graphics and other multimedia formats in the text. SGML (Standard Generalized Markup Language) is one of the most important and popular structured text format. ODA (Office Document Architecture) is a similar and competing standard. SGML is an international standard (ISO, 1986) around which several related standards are built. SGML is a flexible language that gave birth to HTML (Hyper- text Markup Language), de facto markup language of the World Wide Web, to control the display format of documents and even the appearance of the user interface for interacting with the documents. Extensible Markup Language (XML) is derived from SGML to interchange structured documents on the web. Like SGML, XML also deals with the structure of document and not its formatting. The Cascading Style Sheet (CSS) developed for HTML would also function for XML to take care of formatting and appearance. Unlike HTML, XML allows for the invention of new codes. Unlike SGML, XML always requires explicit end tags that make it a lot easier to write tools and browsers.
Like simple text or ASCII, structured text can be searched or manipulated. It is highly flexible and suitable both for electronic and paper production. Well-formatted text increase visual presentation of textual, graphical and pictorial value of information. Structured formats can easily display complex tables and equations. Moreover, the structured text is compact in comparison to the image-based formats, even after including embedded graphics and pictures.
Besides SGML and HTML, there are other formats used in digital library implementation. TeX, used for formatting highly mathematical text is one such format which allows greater control over the resulting display of document, including reviewing the formatting of errors.
8.1.3. Page Description Language (PDL)
Page Description Language (PDLs), such as Adobe’s PostScript and PDF (Portable Document Format) are similar to image but the formatted pages displayed to the user are text- based rather than image-based. PostScript and PDF formats can easily be captured during the typesetting process. PostScript is especially easy to capture since most of the systems automatically generate it and conversion program, called Acrobat Distiller, can be used to convert PostScript file into PDF files. The documents stored as PDF require Acrobat Reader at the user’s end to read or print the document. The Acrobat Reader can be downloaded free of cost from the Adobe’s Web Site.
Acrobat’s Portable Document Format (PDF) is a by-product of PostScript, Adobe’s page- description language that has become the standard way to describe pages electronically in the graphics world. While PostScript is a programming language, PDF is a page-description format.
8.2 Page Image Format
The digitally scanned images are stored in a file as a bit-mapped page image, irrespective of the fact that a scanned page contains a photograph, a line drawing or text. The bit-mapped page image can be created in dozens of different formats depending upon the scanner and its software. National and international standards for image-file formats and compression methods exist to ensure that data will be interchangeable amongst systems. An image file stores discrete sets of data and information allowing a computing system to display, interpret and print the image in a pre-defined fashion. An image file format consists of three district components, i.e. header which stores information on file identifier and image specifications; Image data consisting of look-up table and image raster and lastly footer that signals file termination information. While bit-mapped portion of a raster image is standardized, it is the file header that differentiate one format from another.
TIFF (Tagged Image File Format) is the most commonly used page image file format and is considered to be the de facto standard for bitonal images. Some image formats are propriety developed by a commercial vendor and require a specific software or hardware for display and printing. Images can be coloured, grey-scale or black and white (called bitonal). They can be uncompressed (raw) or compressed using several different compression algorithms.
Image files are much larger than text files, thus compression is necessary for their economic storage. A compression algorithm reduces a redundant string such as one or more rows of white bits, to a single code. The standard compression scheme for black and white bitonal image is the one developed by the International Telecommunications Union (formerly Consultative Committee for International Telephony & Telegraphy (CCITT) for group 4 fax images, commonly referred to as CCITT Group 4 (G-4) or ITU-G-4. An image created as a TIFF and compressed using CCITT Group 4 is called a TIFF G4 which is the de facto standard for storing bitonal images.
Some of the formats mentioned above are maintained and developed by international organizations such as the International Standards Organization (ISO), the International Telecommunications Union (ITU).
9. Preservation Standards
The digital preservation metadata is a subset of metadata that describes attributes of digital resources essential for its long-term accessibility. Preservation metadata provides structured ways to describe and record information needed to manage the preservation of digital resources. In contrast to descriptive metadata schemas (e.g. MARC, Dublin Core), which are used in the discovery and identification of digital objects, preservation metadata is sometimes considered as a subset of administrative metadata design to assist in the management of technical metadata for assisting continuing access to the digital content. Preservation metadata is intended to store technical details on the format, structure and use of the digital content, the history of all actions performed on the resource including changes and decisions, the authenticity information such as technical features or custody history, and the responsibilities and rights information applicable to preservation actions. The scope and depth of the preservation metadata required for a given digital preservation activity will vary according to numerous factors, such as the “intensity” of preservation, the length of archival retention, or even the knowledge base of the intended user community.
9.1 PREMIS (PREservation Metadata: Implementation Strategies)
The OAIS Framework prompted interest in moving it toward a more implementable status. To achieve this objective, OCLC and RLG sponsored a second working group called PREMIS (PREservation Metadata: Implementation Strategies). Composed of more than thirty international experts in preservation metadata, PREMIS sought to: i) define a core set of implementable, broadly applicable preservation metadata elements, supported by a data dictionary; and ii) identify and evaluate alternative strategies for encoding, storing, managing, and exchanging preservation metadata in digital archiving systems. In September 2004, PREMIS released a survey report describing current practice and emerging trends associated with the management and use of preservation metadata to support repository functions and policies. The final report of the PREMIS Working Group was released in May 2005. The PREMIS Data Dictionary is a comprehensive, practical resource for implementing preservation metadata in digital archiving systems. It defines implementable, core preservation metadata, along with guidelines and recommendations for management and use. PREMIS also developed a set of XML schema to support use of the Data Dictionary by institutions managing and exchanging PREMIS conformant preservation metadata.
9.2 Open Archival Information System (OAIS)
OAIS describes all the functions of a digital repository: how digital objects can be prepared, submitted to an archive, stored for long periods, maintained, and retrieved as needed (Library of Congress, 2013). It does not address specific technologies, archiving techniques, or types of content. RLG built on the OAIS model in our digital preservation projects such as PREMIS and Digital Repository Certification. The OAIS Reference Model was developed by the Consultative Committee for Space Data Systems (CCSDS) as a conceptual framework describing the environment, functional components and information objects associated with a system responsible for the long term preservation of digital materials. The metadata in OAIS Model plays an essential role in preserving digital content and supporting its use over the long-term. The OAIS information model implicitly establishes the link between metadata and digital preservation – i.e., preservation metadata. The OAIS reference model provides a high- level overview of the types of information needed to support digital preservation that can broadly be grouped under two major umbrella terms called i) Preservation Description Information (PDI); and ii) Representation and Descriptive Information.
The chapter introduces standards and protocols, their role and importance in building digital libraries. The chapter elaborates upon important digital library standards and protocols used for digital communication (i.e. TCP / IP, http and FTP), bibliographic standards (i.e. MARC, Dublin Core, BIB-I, TEI, EAD, METS and MODS), record structure (i.e. ISO 2709 / Z39.2), encoding standards (Unicode and ACII), information retrieval standards (i.e. Z39.50, SRW/ SRU, SOAP, REST, OAI-PMH and Open URL), formats and media types used in digital library and digital preservation including unstructured text formats (ASCII), structured text formats (i.e. SGML, XML, ODA, HTML and TeX), Page Description Language (PDL) (i.e. Adobe’s PostScript and PDF), Page Image Formats (i.e. TIFF, Image PDF, JPEG, etc.) and preservation standards such as PREMIS and OAIS.
References (All Internet URLs were checked on 18th Dec., 2013)
1. Ascher Interactive (2007). What is the difference between a protocol and a standard? (http://ascherconsulting.com/what/is/the/difference/between/a/protocol/and/a/standard/).
2. Baker, T. (1998) Languages for Dublin Core. D-Lib Magazine, 4 (12). Available at: http://www.dlib.org/dlib/december98/12baker.html
3. Day, M. (2001). Metadata in a Nutshell. (http://www.ukoln.ac.uk/metadata/publications/nutshell).
4. Furrie, B. (2003) Understanding MARC Bibliographic Machine-Readable Cataloguing. 7th ed. McHenry, ILL: Follett Software
- Library of Congress (2007). Bib-1 Attribute Set. (http://www.loc.gov/z3950/agency/defns/bib1.html).
- Library of Congress (2007) Electronic Archival Description (EAD) Version 2002 Official Site. (http://www.loc.gov/ead/)
- Library of Congress (2013). Metadata Encoding and Transmission Standard (METS). (http://www.loc.gov/standards/mets/)
- Library of Congress (2013). Metadata Object Description Schema Official web Site. (http://www.loc.gov/standards/mods/)
- Library of Congress (2013). Preservation Metadata Maintenance Activity Version 2.0. (http://www.loc.gov/standards/premis/)
- Moore, Brian (2001). An introduction to the Simple Object Access Protocol (SOAP). (http://www.techrepublic.com/article/an-introduction-to-the-simple-object-access- protocol-soap/)
- Morgan, Eric Lease (2004). An Introduction to the Search/Retrieve URL Service (SRU). Ariadne, (40). (http://www.ariadne.ac.uk/issue40/morgan/)
- Open Archives (2008). The Open Archives Initiative Protocol for Metadata Harvesting. (http://www.openarchives.org/OAI/openarchivesprotocol.html)
- SearchSOA (2005). Representational State Transfer. (http://searchsoa.techtarget.com/definition/REST)
- TEI: Text Encoding Initiative (2013). (http://www.tei-c.org/index.xml) Last visited on 18th Dec., 2013.
- Unicode Consortium (2013). (http://www.unicode.org/). Last visited on 18th Dec., 2013.
- Wikipedia (2009). Z39.50. (http://en.wikipedia.org/wiki/Z39.50). Last visited on 18th Dec., 2013.
- Wikipedia, 2013. OpenURL. (http://en.wikipedia.org/wiki/OpenURL). Last visited on 18th Dec., 2013.
- Wikipedia, 2013. Open Archival Information System (http://en.wikipedia.org/wiki/Open_Archival_Information_System). Last visited on 18th Dec., 2013.