Abstract: "Document retrieval based on string searches typically returns either the whole document or just the occurrences of the searched words. What the user often is after, however, is microdocument: a part of the document that contains the occurrences and is reasonably self-contained."
"These microdocuments might, for instance, consist of several successive text paragraphs sharing a mutual subject. Single paragraphs, or corresponding close-to-leaf SGML elements, do not convey enough of the contextual information. On the other hand, sections or subsections of a text document, such as a book or an article, can discuss many heterogeneous topics, and thus be too large a unit for retrieval or assembly.
"We claim that such microdocuments are both suitable retrievable units and appropriate units for document assembly, and that they can be reasonably well located using automatic techniques.
"Optimal creation of microdocuments would require thorough semantic analysis of the text. However, it is possible to catch parts of the elementary semantic content by statistical term-frequency analysis.
"Term-frequency distributions enable us to determine the locations of possible topic changes in the text. Based on this information, we can measure the similarity of two successive elements, and decide whether we wish to have them in the same microdocument. On the other hand, existing markup, for example classifying attributes, can be used in boundary detection. The microdocument, again, can be attributed with content information.
"The results of our preliminary experiments show that the presented approach works well in user-assisted topic-oriented microdocument detection. We currently study the usefulness of this technique in document assembly,i.e., in generating new documents from a collection of existing text documents."
[...] "We consider a topical microdocument to be semantically motivated by the topic the microdocument discusses. Topical microdocuments might, for instance, consist of several successive text paragraphs. Single paragraphs, or corresponding close-to-leaf SGML elements, do not convey enough of the contextual information. On the other hand, sections or subsections of a text document, such as a book or an article, can discuss many heterogeneous topics. Furthermore, sections are often longer than desired with respect to the intended purpose, such as document retrieval or assembly.
In this article, we presented a method for detecting microdocuments based on term-frequency distributions. The detection process has two phases: similarity calculation and fragmentation. In general, the results of our preliminary experiments show that the presented approach works well in user-assisted topic-oriented microdocument detection. We currently study the usefulness of this technique in document assembly,i.e., in generating new documents from a collection of existing text documents.
A version of the document is available online in Postscript format:from Helsinki, or the local mirror copy. A number of related publications from the University of Helsinki are listed in a departmental bibliography.
Note: The electronic conference proceedings in hypertext were produced byInso Corporation (DynaText) and byHigh Text (EnLIGHTeN). Information about the SGML Europe '97 Conference may be found inthe main database entry.