US20020176628A1

Movatterモバイル変換

Info

Publication number: US20020176628A1
Application number: US09/862,728
Authority: US
Inventors: Gary Starkweather
Original assignee: Individual
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2001-05-22
Filing date: 2001-05-22
Publication date: 2002-11-28
Also published as: US8380012B2; US20050160115A1

Abstract

A document digitizing method digitizes and automatically indexes documents in printed form. The method includes optically scanning the document, forming and storing a digitized image file from the optically scanned document, optically recognizing characters in the optically scanned document, and forming and storing a text file of the optically recognized characters in document. A retrieval method for retrieving the digitized image file for a document includes searching the text files to identify any having a selected text string and providing access to the digitized image files that correspond to those text files. The digital image file and the text file together represent a digitized document data structure that combines a digital image of a document with a text file of optically recognized characters in the digital image.

Description

TECHNICAL FIELD

The present invention relates to storage and retrieval of digitized images of printed documents and, in particular, to a simple document imaging system with automatic indexing.[0001]

BACKGROUND AND SUMMARY

The development and availability of personal computers and personal printers have brought with them repeated predictions that the paperless office is at hand. It is not yet so. Instead, the use of office bond paper continues to grow year after year. Even workplaces that minimize the generation of paper (i.e., printed) documents by use of email, online and networked resources, etc., commonly receive large numbers of printed materials and documents. Rather than the paperless office, it appears that the foreseeable achievable accomplishment will be the management of printed documents.[0002]

The age-old solution has been to store printed documents in large repositories called files. In addition to large amounts of space, such repositories require manual indexing systems to keep the documents in order for retrieval, as well as staffing for physically placing documents in the files and retrieving them. As a consequence, conventional file storage systems for printed documents are relatively large and are surprisingly expensive to maintain when all costs are considered.[0003]

In response to the significant costs and requirements of maintaining conventional printed document files, computerized or digitized document storage systems have been developed. One of the simplest digital document storage systems is simply maintaining in electronic form documents that are originally generated in that form. For example, computer storage of word processing documents, e-mail communications, etc., simply maintains such documents in the electronic form in which they were created.[0004]

Of greater complexity are systems that convert printed or other written materials, generically referred to herein as printed documents, into a digitized form for storage on computer-readable media. Such systems characteristically employ optical scanners that form digitized images of the printed documents for storage in a computer storage medium, and software for creating indices or other identifying information for retrieving the digitized images at a later time. In most such systems, indexing information is manually entered by a user into a computer system. For example, the indexing information could include conventional file reference information of the type used for conventional paper files (e.g., file reference names or numbers).[0005]

This type of digitized document storage system may be a suitable substitute for paper document storage in many business contexts. Staff who might otherwise by physically storing and retrieving paper documents can provide the indexing information and potentially process greater numbers of digitized documents for storage. In addition, many such business contexts have existing document indexing formats that may be applied to the digitized storage.[0006]

However, such manual indexing might not be suitable in other business contexts, such as smaller businesses, or for individual users. For these users, the effort of manually indexing digitized documents for storage can pose a barrier to adoption of digitized document storage. For example, there often may not be suitable formal file format for indexing the digitized documents.[0007]

Accordingly, an aspect of the present invention is a document digitizing method for digitizing and automatically indexing documents in printed form. The method includes optically scanning the document, forming and storing a digitized image file from the optically scanned document, optically recognizing characters in the optically scanned document, and forming and storing a text file of the optically recognized characters in document. The text file and the digital image file for a document are associated with each other. For example, the associated text and digital image files may have a common name and may be distinguished by appropriate file extensions.[0008]

The digital image file and the text file together represent a digitized document data structure that combines a digital image of a document with a text file of optically recognized characters in the digital image. The text file functions as a searchable index for retrieving the digital image files corresponding to the document. As a result, the text file functions as an automatically-generated index of the digital image file and the document pages they represent. In one implementation, the document includes plural pages and a separate digitized image file and text file is formed for each page of the document.[0009]

Another aspect of the invention is a retrieval method for retrieving the digitized image file for a document. The retrieval method includes searching the text files to identify any having a selected text string and providing access to the digitized image files that correspond to those text files. For example, the access to the digitized image files may include allowing a user to selectively display any digitized image file that corresponds to an identified text file.[0010]

Additional objects and advantages of the present invention will be apparent from the detailed description of the preferred embodiment thereof, which proceeds with reference to the accompanying drawings.[0011]

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an operating environment for an embodiment of the present invention.[0012]

FIG. 2 is a flow diagram of a document digitizing method of the present invention that can provide simple automatic imaging and indexing of digitized documents.[0013]

FIG. 3 is a block diagram representing a digitized document data structure according to the present invention.[0014]

FIG. 4 is a flow diagram of a digitized document retrieval method for retrieving digitized documents stored in accordance with the document digitizing method of FIG. 2.[0015]

FIG. 5 is a simplified diagram of an exemplary graphical user interface for the digitized document retrieval method of FIG. 4.[0016]

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FIG. 1 illustrates an operating environment for an embodiment of the present invention as a[0017]

computer system

20 with acomputer22 that comprises at least one high speed processing unit (CPU)24 in conjunction with amemory system26, aninput device28, and anoutput device30. These elements are interconnected by at least onebus structure32.

The illustrated[0018]

CPU

24 is of familiar design and includes anALU34 for performing computations, a collection ofregisters36 for temporary storage of data and instructions, and acontrol unit38 for controlling operation of thesystem20. TheCPU24 may be a processor having any of a variety of architectures including Alpha from Digital, MIPS from MIPS Technology, NEC, IDT, Siemens, and others, x86 from Intel and others, including Cyrix, AMD, and Nexgen, and the PowerPC from IBM and Motorola.

The[0019]

memory system

26 generally includes high-speedmain memory40 in the form of a medium such as random access memory (RAM) and read only memory (ROM) semiconductor devices, andsecondary storage42 in the form of long term storage mediums such as floppy disks, hard disks, tape, CD-ROM, flash memory, etc. and other devices that store data using electrical, magnetic, optical or other recording media. Themain memory40 also can include video display memory for displaying images through a display device. Those skilled in the art will recognize that thememory26 can comprise a variety of alternative components having a variety of storage capacities.

The input and[0020]

output devices

28 and30 also are familiar. Theinput device28 can comprise a keyboard, a mouse, a physical transducer (e.g., a microphone), etc. In addition,input device28 includes an optical scanner that optically scans printed and other written documents or materials (together referred to as printed documents) to generate digitized images of them. Theoutput device30 can comprise a display, a printer, a transducer (e.g., a speaker), etc. Some devices, such as a network interface or a modem, can be used as input and/or output devices.

As is familiar to those skilled in the art, the[0021]

computer system

20 further includes an operating system and at least one application program. The operating system is the set of software which controls the computer system operation and the allocation of resources. The application program is the set of software that performs a task desired by the user, using computer resources made available through the operating system. Both are resident in the illustratedmemory system26.

In conjunction with the referenced optical[0022]

scanner input device

28,computer system20 includes software for controlling the optical scanner and for generating digitized images of scanned documents.Computer system20 also includes optical character recognition software, as is known in the art, for discerning under computer control text characters in a scanned document and generating a corresponding text computer file.

In accordance with the practices of persons skilled in the art of computer programming, the present invention is described below with reference to acts and symbolic representations of operations that are performed by[0023]

computer system

20, unless indicated otherwise. Such acts and operations are sometimes referred to as being computer-executed and may be associated with the operating system or the application program as appropriate. It will be appreciated that the acts and symbolically represented operations include the manipulation by theCPU24 of electrical signals representing data bits which causes a resulting transformation or reduction of the electrical signal representation, and the maintenance of data bits at memory locations inmemory system26 to thereby reconfigure or otherwise alter the computer system's operation, as well as other processing of signals. The memory locations where data bits are maintained are physical locations that have particular electrical, magnetic, or optical properties corresponding to the data bits.

FIG. 2 is a flow diagram of a[0024]

document digitizing method

50 of the present invention that can provide automatic imaging and indexing of digitized documents. Digitizingmethod50 employs conventional optical scanning of written or printed materials or documents to generate digital images of documents and automatically generates an indexing file to aid in retrieval of the documents.

It will be appreciated that references to “printed documents” or “written or printed materials” are inclusive of virtually any paper or other medium with text characters on it. There may be images or pictures interspersed with the text characters, and some pages of a document may have no text characters at all. For best utilization of the present invention, at least some of the text characters on at least one page of a document will be discernible by optical character recognition software.[0025]

[0026]

Process block

52 indicates that a document is optically scanned to form a digital image file54 (FIG. 3) for each page of the document. At least one page of the document includes text characters discernible by optical character recognition software. In an exemplary implementation, eachdigital image file54 may be of a Tag Image File (.tif) format or another lossless image format, or may alternatively be of a lossy image format such as JPEG. In addition, eachdigital image file54 may be compressed, such as by CCITT Group4 compression for .tif files. It will be appreciated that other compression formats may be used for tif file format images, as well as other lossless or lossy file formats.

[0027]

Process block

56 indicates that digitalimage computer file54 for each page of the document is stored under a file indicator or name that is selected in a predefined manner. In one implementation, digital image files54 may be stored under numeric, alphabetic, or alphanumeric file indicators or names that increment sequentially for all digital image files54 generated bymethod50. For example,document digitizing method50 may over time be applied to thousands of documents encompassing thousands of digital image files54. The file indicators or names for the pages of a most recent document could be the next successive numeric, alphabetic, or alphanumeric in sequence from the preceding digital image files54.

[0028]

Process block

58 indicates that optical character recognition is applied to eachdigital image file54, or a copy or alternative form of it, to discern text characters in and form a text file60 (FIG. 3) for each corresponding page of the document. For example, optical character recognition is applied to all text characters in the optically scanned document. Application of optical character recognition to all text characters means that optical character recognition is attempted throughout the optically scanned document. It will be appreciated, however, that due various circumstances, not all text characters will necessarily be recognized by the optical character recognition software.

In an exemplary implementation, each[0029]

text file

60 may be of a format with minimal or no embedded coding and minimal text formatting, such as ASCII characters in a .txt file format. Other text file formats could alternatively be used, but the .txt file format is desirable because it simplifies the optical character recognition and minimizes the storage requirements of text files60. The optical character recognition may be performed by any optical character recognition software, such as any of a variety of commercially available optical character recognition software programs including OmniPage ProTM from Caere Inc., TextBridgeTM and Pagis ProTM from Xerox Corporation, and TypeReaderTM5.0 by Expervision, Inc.

[0030]

Process block

62 indicates thattext file60 for each page of the document is stored under a file indicator or name that is selected in a predefined manner. In one implementation, text files60 may be stored under numeric, alphabetic, or alphanumeric file indicators or names that increment sequentially for alltext files60 generated bymethod50. For example,text file60 for each page may have the same numeric, alphabetic, or alphanumeric file indicators or names as the correspondingdigital image file54, but have a different file extension to distinguish the text and image files (e.g., .txt and .tif). Such common names for corresponding text and digital image files54 represents a simplest manner of correlating corresponding files.

In alternative implementations, the corresponding text and digital image files[0031]54 could have different names. However, such implementations would require an algorithm, a table, or another manner of correlating the corresponding text and digital image files54.

Digital image files[0032]54 andtext files60 of FIG. 3 together represent a digitizeddocument data structure64 that combines a digital image of each page of the document with a text file of optically recognized characters in the digital image. As described below in greater detail, the text file functions as a searchable index for retrieving the digital image files corresponding to a document. As a result, text files60 function as an automatically-generated index of digital image files54 and the document pages they represent.

FIG. 4 is a flow diagram of a digitized[0033]

document retrieval method

80 for retrieving digitized documents stored in accordance withdocument digitizing method50 or having digitizeddocument data structure64. Digitizeddocument retrieval method80 is described with reference to an exemplary graphical user interface82 (FIG. 5) that would be rendered on a computer display screen.

[0034]

Process block

84 indicates that one or more text strings to be searched for within text files60 are entered by a user. Inuser interface82, for example, a selected text string to be searched for may be entered into atext box86.Text box86 illustrates a user interface feature for searching a single text string, butuser interface82 further allows searching of multiple text strings conjunctively, as described below.

[0035]

Process block

88 indicates that a search is commenced, such as when a user activates a graphical control like abutton90 inuser interface82.

[0036]

Process block

92 indicates that text files60 are searched to identify any with the one or more text strings.

[0037]

Process block

94 indicates that the names of the text files60 identified as having the one or more text strings are listed, such as in a display for viewing by the user. Inuser interface82, for example, text files60 identified as having an initial text string may be listed in asearch result box96. Alternatively, the names of the text files60 identified as having the one or more text strings may be listed in a search results computer file stored on the computer system to be accessed later by the user.

A first additional conjunctive text string may be listed in a[0038]

text box

98, and a first additional conjunctive text string search may be commenced when a user activates a graphical control like abutton100. Text files60 identified as having the initial text string oftext box86 and the first additional conjunctive text string oftext box98 may be listed in asearch result box102. Likewise, a second additional conjunctive text string may be listed in atext box104, and a second additional conjunctive text string search may be commenced when a user activates a graphical control like abutton106. Text files60 identified as having the initial text string oftext box86 and the first and second additional conjunctive text strings of

text boxes

98 and104 may be listed in asearch result box108. The numbers oftext files60 listed in

search results boxes

96,102, and108 may be indicated in respective search results file

number boxes

110,112, and114. A Remove SelectedItem control116 initiates deletion of files that are selected or highlighted in one of

search results boxes

96,102, and108. A selected items countbox118 indicates the number of items that are selected or highlighted in one of

search results boxes

96,102, and108.

It will be appreciated that separate controls and search result boxes for initial and successive conjunctive text string searches is merely one graphical user interface implementation. Alternatively, such conjunctive searches may be entered into a single search text box, as is common with many computer search tools. Moreover, other implementations could include any the Boolean combinations of text strings commonly employed with computer search tools.[0039]

[0040]

Process block

120 indicates that a file name or indicator corresponding to adigital image file54 to be viewed is entered by the user. For example, the file name or indicator may be that of atext file60 listed in a search results box, such as one of

boxes

96,102, and108.

The user may manually enter the file name or indicator, or may enter it by selecting (e.g., “single clicking”) or activating (e.g., “double clicking”)[0041]

text file

60 listed in a search results box. As described above, eachtext file60 corresponds to adigital image file54. Inuser interface82, for example, the file name or indicator may be entered into atext box122.

[0042]

Process block

124 indicates thatdigital image file54 corresponding to the entered file name or indicator is retrieved and displayed, such as when a user activates a graphical control like abutton126 inuser interface82.

Digitized[0043]

document retrieval method

80 provides retrieval of digital image files based upon text strings in corresponding text files60. As a result, text files60 of digitizeddocument data structure64 provide an automatic indexing structure for accessing corresponding digital image files54. Sometimes a searched text string will provide access to one page of a digitized document when another page of the document is actually desired. Accordingly,user interface82 includes a shownext image control128 for displaying the next successivedigital image file54 and a showprevious image control130 for displaying the immediately precedingdigital image file54.

Controls

128 and130 allow a user to scroll to successive pages of a document. In this regard, it is desirable that digitizingmethod50 be applied to documents with their pages in regular order or sequence.

The implementation of[0044]

user interface

82 shown in FIG. 5 includes features in addition to the features described above with reference to digitizeddocument retrieval method80. For example,user interface82 includes a text filedirectory navigation window140 listing one or more operating system directories or folders in which text files60 are stored. Multiple directories or folders may be used to overcome operating system limits on the numbers of files in a directory or folder or to organize digitized documents in a user-defined manner.

Selection of a directory or folder listed in text file[0045]

directory navigation window

140 accesses the text files60 in a selected directory or folder and lists at least a portion of the text files60 in a textfile listing window142. A second textfile listing window144 allows text files60 in a second selected directory or folder to be listed. Textfile count windows146 list the numbers of files in one or all of the directories or folders, and a maximum textfile count window148 lists the maximum number of text files that can be accommodated.

To minimize system resources and time required for repeated searches,[0046]

user interface

82 includes a Load Previous Searches control150 operable by a user to retrieve results of previous text string searches, with the text strings themselves being listed in a Previously SearchedText Strings window152. For example, the results of searches for the previous n-number of most recent text string searches may be stored in a search results file that correlates the text string searched with the names of the text files identified in the search. Selection of a text string listed in Previously SearchedText Strings window152 loads the corresponding listing of identifiedtext files60 intosearch result box96.

In an alternative implementation, the search results file may include only the previously searched text strings, and not the text file listings generated from the previous searches. In this implementation, selection of a text string in Previously Searched[0047]

Text Strings window

152 would cause the text string to be copied to searchterm text box86, so that activation ofcontrol90 would initiate a new search of the indicated text string.

A previous search[0048]

term count window

154 indicates a numeric count of the number of text strings included in the text string searches file, and an “alphabetize”control156 allows a user to alphabetize the listing of text strings displayed in Previously SearchedText Strings window152.

A batch search[0049]

text string window

160 lists one or more text strings for which searches oftext files60 are to be conducted, such as in a batch of multiple successive searches. Such batched searches may commonly be distinguished from individual searches for which a user would desire search results immediately upon completion of the search. A text string may be added to batch searchtext string window160 by first entering the text string into anentry window162 and activating anadd item control164. A text string may be removed from batch searchtext string window160 by selecting the text string and activating aremove item control166. The batched searching of text strings listed in batch searchtext string window160 is commenced upon user activation of a batch searchgraphical control168 and may continue that are executed in the user's absence. It will be appreciated, however, that such batched searching could be used by a user wanting search results immediately.

Having described and illustrated the principles of our invention with reference to an illustrated embodiment, it will be recognized that the illustrated embodiment can be modified in arrangement and detail without departing from such principles. In view of the many possible embodiments to which the principles of our invention may be applied, it should be recognized that the detailed embodiments are illustrative only and should not be taken as limiting the scope of our invention. Rather, I claim as my invention all such embodiments as may come within the scope and spirit of the following claims and equivalents thereto.[0050]

Claims

1. A document digitizing method of digitizing a document in printed form, comprising:

optically scanning the document;

forming and storing a digitized image file from the optically scanned document;

optically recognizing under computer control characters in the optically scanned document; and

forming and storing a text file of the optically recognized characters in document.

2. The method ofclaim 1 in which the document includes plural pages and a separate digitized image file is formed for each page of the document.

3. The method ofclaim 2 in which a separate text file is formed for each page of the document.

4. The method ofclaim 1 in which the document includes plural pages and a separate text file is formed for each page of the document.

5. The method ofclaim 1 in which each digitized image file is correlated with a corresponding text file.

6. The method ofclaim 5 in which corresponding digitized image files and text files are correlated by being assigned common names and are distinguished by appropriate file extensions.

7. The method ofclaim 5 in which corresponding digitized image files and text files are correlated by a mapping table or algorithm.

8. The method ofclaim 1 further comprising retrieving a digitized image file for a document based upon a text string in the text file corresponding to the digitized image file.

9. The method ofclaim 1 in which the digitized image file is compressed and of a lossless image file format.

10. The method ofclaim 1 in which the text file is of a simplified file format based upon ASCII characters.

11. The method ofclaim 1 in which optical character recognition is applied to all text characters in the optically scanned document.

12. In a document digitizing system for optically scanning and digitizing a document in printed form and having a computer readable medium, a digitized document data structure for digitized documents stored in the computer readable, comprising:

a digitized image file representing a digitized image from the optically scanned document; and

a text file of text characters optically recognized from the optically scanned document.

13. The data structure ofclaim 12 further comprising correlated file indicators for the digitized image file and the text file representing the optically scanned document.

14. The data structure ofclaim 13 in which the correlated file indicators include common file names and distinct file extensions for the digitized image file and the text file representing the optically scanned document.

15. The data structure ofclaim 12 in which the document includes plural pages, the data structure comprising a separate digitized image file for each page of the document and a corresponding separate text file for each page of the document.

16. The data structure ofclaim 12 in which the digitized image file is compressed and of a lossless image file format.

17. The data structure ofclaim 12 in which the text file is of a simplified file format based upon ASCII characters.

18. In a document digitizing system for optically scanning and forming a digitized image file of a document having text characters in printed form, a method of retrieving the digitized image file for a document, comprising:

storing digitized image files for plural printed documents in association with text files of the text characters in each document, the text files being generated by computer optical character recognition of the digitized image files or related image files;

searching the text files to identify any having a selected text string; and

providing access to the digitized image files corresponding to the text files identified as having the selected text string.

19. The method ofclaim 18 in which providing access to the digitized image files includes allowing a user to selectively display any of the digitized image files corresponding to the text files identified as having the selected text string.

20. The method ofclaim 18 in which digitized image files are associated with text files by having common file names and are distinguished by appropriate file extensions.

21. The method ofclaim 18 in which searching the text files to identify any having a selected text string includes specifying multiple separate text strings and searching the text files in a batch to identify any text files having any of the separate text strings.

22. The method ofclaim 18 in which the text files have file names, the method further comprising storing the file names of the text files identified as having the selected text string.

23. In a computer-readable medium, document digitizing software for digitizing a document in printed form, comprising:

software for optically scanning the document;

software for forming and storing a digitized image file from the optically scanned document;

software for optically recognizing under computer control characters in the optically scanned document; and

software for forming and storing a text file of the optically recognized characters in document.

24. The medium ofclaim 23 further including software for correlating each digitized image file with a corresponding text file.

25. The medium ofclaim 24 in which corresponding digitized image files and text files are correlated by being assigned common names and are distinguished by appropriate file extensions.

26. The medium ofclaim 23 further comprising software for retrieving a digitized image file for a document based upon a text string in the text file corresponding to the digitized image file.