Movatterモバイル変換


[0]ホーム

URL:


[RFC Home] [TEXT|PDF|HTML] [Tracker] [IPR] [Info page]

INFORMATIONAL
Network Working Group                                          W. TurnerRequest for Comments: 1691                                           LTDCategory: Informational                                      August 1994The Document Architecture for the Cornell Digital LibraryStatus of this Memo   This memo provides information for the Internet community.  This memo   does not specify an Internet standard of any kind.  Distribution of   this memo is unlimited.Abstract   This memo defines an architecture for the storage and retrieval of   the digital representations for books, journals, photographic images,   etc., which are collected in a large organized digital library.   Two unique features of this architecture are the ability to generate   reference documents and the ability to create multiple views of a   document.Introduction   In 1989, Cornell University and Xerox Corporation, with support from   the Commission on Preservation and Access and later Sun Microsystems,   embarked on a collaborative project to study and to prototype the   application of digital technologies for the preservation of library   material.  During this project, Xerox developed the College Library   Access and Storage System (CLASS), and Cornell developed software to   provide network access to the CLASS Digital Library.   Xerox and Cornell University Library staff worked closely together to   define requirements for storing both low- and high-resolution   versions of images, so that the low-resolution images could be used   for browsing over the network and the high-resolution images could be   used for printing.  In addition, substantial work was done to define   documents with internal structures that could be navigated.  Xerox   developed the software to create and store documents, while Cornell   developed complementary software to allow library users to browse the   documents and request printed copies over the network.   Cornell has defined a document architecture which builds on the   lessons learned in the CLASS project, and is maintaining digital   library materials in that form.Turner                                                          [Page 1]

RFC 1691               CDL Document Architecture             August 1994Document Architecture Overview   Just as a conventional library contains books rather than pages, so   the electronic library must contain documents rather than images.   During the scanning process, images are automatically linked into   documents by creating document structure files which order the image   files in the same way the binding of a book orders the pages.  Thus,   the digital book as currently configured consists of two parts: a set   of individual pages stored as discrete bit map image files, and the   document structure files which "bind" the image files into a   document.  In addition, a database entry is made for each digital   document which permits searching by author and title (i.e.,   bibliographic information).  Beyond the order of the pages, the   arrangement of a physical book provides information to readers.  The   title page and publication information come first; the table of   contents usually precedes the text; the text is divided into sections   or chapters; if there is an index, it follows the text.  The reader   often refers to these components of a book when browsing the library   shelves, in order to determine whether to read the book.   The document structure provides direct access to the components of an   electronic document, storing the information that would otherwise be   lost when the book is disbound for scanning.Document Architecture Requirements   Listed below are the requirements that were initially set down for   the Cornell Digital Library Architecture.   1. The architecture must be open (i.e., published and freely      available).   2. The architecture should be as simple as possible (to facilitate      product development).   3. The architecture should assume data storage in UNIX file systems.   4. The architecture should allow for standard data usage, such as via      FTP and Gopher servers (i.e., pages of a document must exist in a      single directory, and the naming convention used must order them      in the standard collating sequence, such as the series "0001.TIF,      0002.TIF,..., 0411.TIF" (NOTE: a series such as "1.TIF, 2.TIF,...,      10.TIF" would be ordered "1.TIF, 10.TIF, 2.TIF, ..." which is not      acceptable).   5. The architecture should provide for storing the same information      in different formats.  For example, when a page of a document is      available at several different resolutions.Turner                                                          [Page 2]

RFC 1691               CDL Document Architecture             August 1994   6. Low-resolution "thumbnail" images of each page must be stored to      facilitate browsing and sharing of data.   7. The architecture must support distribution of files so that      similar files may be stored together, permitting optimization of      storage use and performance.   8. The architecture must support documents that are composed of      references to all or part of other documents.   9. The architecture must support document components which are      stored on separate servers distributed across the network.   10. The architecture must support not only an hierarchical structure       for each document, but the ability to define multiple views of       each document.   11. The architecture should accept, rather than dictate, directory       structures in which documents will be stored.  This will permit       documents created in other ways to be added to the Digital       Library simply by adding database information rather than by       copying or moving files.Document Architecture Description   A digital library consists of a Digital Library Server, networked   storage, and a referencing database.  A single digital library will   contain one or more collections.  Each collection will contain one or   more documents.   The referencing database allows searching for documents by author,   title, and document ID.  In the current implementation, the   referencing database is a relational SQL database, and each   collection is  epresented by a table in the database.  It is planned   to migrate to Z39.50 database searching as the preferred method, as   this protocol has been established as the standard for library   applications.   Authorization will be primarily collection-based, although the design   will permit authorization checking at any level down to the   individual file.  Notification would come only when the patron   attempted to open the document or access the particular component.   Each document consists of three components: the logical structure;   the physical references; and the data files.Turner                                                          [Page 3]

RFC 1691               CDL Document Architecture             August 1994   The logical structure is a logical description of the document.   Conceptually, a document is a tree, with the leaves being the data   files (pages).  At a minimum, all documents have a logical structure   which lists the pages in the document and the order in which they   appear.  Usually, documents will have a more elaborate structure.   The logical structure relates the logical structure of a document to   the physical references which make up the document.   These physical references map the lowest levels of the document's   logical structure (the leaves of the tree) to the files that contain   the data.  Where there are multiple representations of a page, such   as images at various resolutions, these are linked together in the   physical references file.   The data files contain the data making up a document.  Any format can   be accommodated: image files, ASCII text, PostScript, etc.  However,   one-to-one correspondence between data files for a given physical   reference is assumed.  That is, if there are multiple file types for   a single page, these files should represent exactly the same   information.Physical References File   The Physical References file is the component of the document which   relates logical structures (logical components of documents) to   physical files.  Document references, by which a document can be   composed of all or part of other documents possibly residing on   different servers, are handled in the Physical References file.   A document may contain multiple document objects, each of which   contains one or more data objects.  When a document contains actual   physical data (for example, it is created by scanning or importing   images), a Master Document Object is created.  When a document   incorporates components of other documents, a Reference Document   Object is created for each of the other documents.  The Document   Objects are numbered with internal reference numbers, which are   included in the corresponding Data Object lines.   Data Object lines include the Document Object number, the file   reference number, and the file type.  The Document Object number   refers to a Document Object line, from which the library name,   collection name, and document ID can be retrieved.  The tuple   <libraryID>+<collectionID>+<documentID>+<filetype>+<file reference>   is guaranteed to locate a file.  Each Data Object line refers to a   single file; where multiple file types of a single document page   exist, there will be multiple Data Object lines for that page.Turner                                                          [Page 4]

RFC 1691               CDL Document Architecture             August 1994   In the file, all Document Object lines will preceed all Data Object   lines for a given document.  Document Object lines may be either   grouped together at the beginning of the file, or may immediately   preceed the first Data Object line for the Document Object. Document   Object lines will appear in order by Document Object number.  Data   Object lines will appear in order by sequence number, NOT by Document   Object number.   The fields in the Physical References file are delimited by vertical   bars.Document Object Lines   Field   Description                  Comments   -----   ----------------------       ----------------------------     1     Document Object number       0 => Master Document Object                                        1-9 => Reference Document Object     2     Library name                 Server name     3     Collection name     4     Document ID                  8-digit number     5     Author name     6     Volume     7     Title     8     EditionData Object Lines   Field   Description                  Comments   -----   ----------------------       ----------------------------     1     Document Object number       Corresponds to above     2     Sequence number     3     File reference               Reference number used to locate                                        file in filing system     4     Physical reference number    Equal to Logical Structure file     5     File type                    1 = TIFF 600dpi                                        2 = TIFF thumbnail                                        3 = ASCII version of page                                            (i.e., OCR output)                                        4 = ASCII notes                                        5 = Other                                        6 = TIFF 300dpi     6     NoteTurner                                                          [Page 5]

RFC 1691               CDL Document Architecture             August 1994Physical References File Example+0|CORNELL|OLINLIB|00000001|Boole, Mary Everest||Philosophy Of Algebra|||0|1|00000002|5|1||   (File ref. #2 = Phys. ref. #5 = 600dpi TIFF image)|0|2|00000003|5|2||   (File ref. #3 = Phys. ref. #5 = 100dpi TIFF image)|0|3|00000004|6|1||   (File ref. #4 = Phys. ref. #6 = 600dpi TIFF image)|0|4|00000005|6|2||   (File ref. #5 = Phys. ref. #6 = 100dpi TIFF image)   Note that in the above, it is guaranteed that file references 2 and 3   are two different versions of the same page, as are file references 4   and 5.Logical Structure File   The Logical Structure file is the component of the document structure   which offers "views" of a document and links images together   logically to define documents. The file is actually an unloaded tree;   when a document is "opened", the file is read and the tree   reconstructed. By convention, all Logical Structure files contain one   logical structure "PAGES" which defines the document by listing the   pages in the order in which they appeared in the original document.Document Structure lines   Field   Description                  Comments   -----   ----------------------       ----------------------------     1     Parent structure number      Structure is a child of...     2     Sequence number     3     Logical Structure name       Label for this structure     4     Structure number             Equal to Physical Reference file     5     Logical Children             # of logical children of this                                          structureDocument Structure lines (continued)   Field   Description                  Comments   -----   ----------------------       ----------------------------     6     Physical Children            # of physical children of this                                          structure     7     References                   # of references to this                                          structure within this document                                        (for how many structures is this                                         a substructure)Turner                                                          [Page 6]

RFC 1691               CDL Document Architecture             August 1994Logical Structure File Example|0|0|ROOT|0|4|0|0|            Structure 0, ROOT, has 4 logical children|0|1|PAGES|1|100|0|1|         Str. 1, PAGES, has 100 logical children|0|2|CONTENTS|2|22|0|1|       Str. 2, CONTENTS, has 22 logical children                              ...has no physical children ...|1|1|Production note|5|0|2|2| Str. 5 is child of structure 1                              ...has a label "Production note"                              ...has no logical children                              ...has 2 physical references                              ...is referenced twice in this document|1|2||6|0|2|1|                Str. 6 has no label|1|3||7|0|2|1|                Str. 7 has 2 physical references|1|4||8|0|2|1|                Str. 8 is referenced only here|1|5||9|0|2|1|                Str. 9 is 5th sequential child of PAGES ...|1|99||103|0|2|2||1|100||104|0|2|2||2|1|Production note|105|1|0|1|          Str. 105 is a child of str. 2|2|2|Title page|106|1|0|1|               Str. 106 has 1 logical child|2|3|Table of contents|107|2|0|1||2|4|Chapter 1. From Arithmetic to Algebra|108|6|0|1||2|5|Chapter 2. The Making of Algebras|109|4|0|1||2|6|Chapter 3. Simultaneous Problems|110|4|0|1||2|7|Chapter 4. Partial Solutions...|111|3|0|1||2|8|Chapter 5. Mathematical Certainty...|112|3|0|1||2|9|Chapter 6. The First Hebrew Algebra|113|8|0|1||2|10|Chapter 7. How to Choose our Hypotheses|114|9|0|1||2|11|Chapter 8. The Limits of the Teachers Function|115|5|0|1||2|12|Chapter 9. The Use of Sewing Cards|116|4|0|1| ...|2|20|Chapter 17. From Bondage to Freedom|124|5|0|1||2|21|Appendix|125|2|1|1||2|22|advertisements|126|4|1|2||105|1|Production note|5|0|2|2|          Str. 5 is a child of str. 105|106|1|Title page|11|0|2|2|              2nd reference to str. 11|107|1|7|15|0|2|2||107|2|8|16|0|2|2| ...|126|4||104|0|2|2|Turner                                                          [Page 7]

RFC 1691               CDL Document Architecture             August 1994Implementation Details   The tuple <library ID>+<collection ID>+<document ID>+<filetype>+   <file reference> is guaranteed to locate a file.  A file locator   program will translate between this tuple and the fully-qualified   path and file name in the underlying file system.  While a library   will always have a hierarchical nature corresponding to UNIX file   systems, the order of the hierarchy will be flexible to accommodate   optimization efforts.  Each level of the hierarchy will have an INFO   file that describes the order of the lower levels of the hierarchy.   The file locator program will read these files as it navigates the   directory structure of the file system when a library, collection, or   document is opened.  Two examples follow:     Example 1.  Hierarchy is LIBRARY, COLLECTION, DOCUMENT, FILETYPE.  /<library name>          LIBINFO.TXT                      Description of library          /<collection name>                 COLINFO.TXT               Description of collection                 /<document ID>                       DOCINFO.TXT         Description of document                       LOGSTR.000          Logical structure file                       PHYSREF.000         Physical reference file                       /<filetype1>                               00001.TIF                               00002.TIF                               ...                       /<filetype2>                               00001.TIF                               00002.TIF                               ...Turner                                                          [Page 8]

RFC 1691               CDL Document Architecture             August 1994   Example 2.  Hierarchy is LIBRARY, FILETYPE, COLLECTION, DOCUMENT.  /<library name>          LIBINFO.TXT                         Description of library          /<filetype1>                  /<collection name>                         COLINFO.TXT          Description of collection                         /<document ID>                               DOCINFO.TXT    Description of document                               LOGSTR.000     Logical structure file                               PHYSREF.000    Physical reference file                               00001.TIF                               00002.TIF                               ...          /<filetype2>                  /<collection name>                         COLINFO.TXT          Description of collection                         /<document ID>                               DOCINFO.TXT    Description of document                               LOGSTR.000     Logical structure file                               PHYSREF.000    Physical reference file                               00001.TIF                               00002.TIF                               ....   This implementation involves some redundancy, but it permits complete   copies of a collection to be mounted on different file systems for   performance considerations.  In particular, the second scheme would   facilitate storing all low-resolution images on high-speed magnetic   disk for fast access, and all high-resolution images on slower, less   expensive storage.  This will also facilitate authorizing access to   low-resolution images by other software systems (FTP, Gopher) while   restricting access to high-resolution images.Turner                                                          [Page 9]

RFC 1691               CDL Document Architecture             August 1994Security Considerations   Security issues are not discussed in this memo.References   [1] Turner, W., "Cornell Digital Library Document Architecture,       Version 1.1 - 3/22/94", Library Technology Department, Cornell       University.Author's Address       William Turner       Library Technology       502 Olin Library       Cornell University       Ithaca, NY  14853       Phone: 607-255-9098       Fax:   607-255-9346       EMail: wrt1@cornell.eduTurner                                                         [Page 10]

[8]ページ先頭

©2009-2025 Movatter.jp