US20040163035A1

Movatterモバイル変換

Info

Publication number: US20040163035A1
Application number: US10/771,315
Authority: US
Inventors: Assaf Ariel; Michael Brand; Itsik Horowitz; Ofer Shochet; Itzik Stauber; Dror Ziv
Original assignee: Verint Systems Inc
Current assignee: Credit Suisse AG
Priority date: 2003-02-05
Filing date: 2004-02-05
Publication date: 2004-08-19
Also published as: EP1590796A1; US20080183468A1; WO2004072955A1; IL170065A; WO2004072780A3; WO2004072780A2; US7792671B2; US20040158469A1; US8195459B1; EP1590798A2

Abstract

Non-deterministic text with average word recognition precision below 50% is processed utilizing non-textual differences between words or sequences of words in the text to provide more useful information to users by resolving more than two decision options. One or more indexes that indicate non-textual differences between n-word sequences, where n is a positive integer, may be generated for use in data mining that considers the non-textual differences. Alternatively, multiple indexes may be generated using different data mining techniques that may or may not utilize non-textual differences and then the results produced by the different data mining techniques may be merged to identify non-textual differences. These techniques may be used in classifying, labeling, categorizing, filtering, clustering, or retrieving documents, or in discovering salient terms in a set of documents.

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application is related to and claims priority to U.S. provisional application entitled METHOD FOR AUTOMATIC AND SEMI-AUTOMATIC CLASSIFICATION AND CLUSTERING OF NON-DETERMINISTIC TEXTS having serial No. 60/444,982, by Assaf ARIEL, Itsik HOROWITZ, Itzik STAUBER, Michael BRAND, Ofer SHOCHET and Dror ZIV, filed Feb. 5, 2003 and incorporated by reference herein. This application is also related to the application entitled AUGMENTATION AND CALIBRATION OF OUTPUT FROM NON-DETERMINISTIC TEXT GENERATORS BY MODELING ITS CHARACTERISTICS IN SPECIFIC ENVIRONMENTS by Michael BRAND, filed concurrently and incorporated by reference herein.[0001]

BACKGROUND OF THE INVENTION

1. Field of the Invention[0002]

The present invention is directed to processing of information in non-deterministic texts to increase the usefulness of the texts and, more particularly, to using non-textual information to indicate the importance or recognition accuracy of individual words or sequences of words.[0003]

2. Description of the Related Art[0004]

In general, spoken document retrieval (SDR) is composed of two stages: transcription of speech and information retrieval (IR). Transcription of the speech is often referred to as speech-to-text (STT) or automatic speech recognition (ASR), and is often performed using a large vocabulary continuous speech recognizer (LVCSR). Information retrieval (IR) is a general term referring to all forms of data mining. One common form of data mining, for example, is query-based retrieval, where, based on a user's query, documents are retrieved and presented to the user, ordered by an estimated measure of their relevance to the query. Traditionally, this stage is performed on the text output of the first stage.[0005]

There are many known techniques for extracting useful information from texts, commonly referred to as text mining or text data mining which is a sub-discipline of data mining. Many of these techniques have been used on text output by speech-to-text algorithms or automatic character recognition systems. However, in systems that use text that has been converted from digitized speech or is based on character recognition, there has been little success when the original source is of low quality, such as telephone conversations or handwritten text, due to the low precision of accuracy of the resulting texts. As a result, most commentators in the field have discouraged application of techniques developed for easily recognized source material to source material that is difficult to recognize. Examples of such techniques can be found in U.S. Pat. Nos. 5,625,748; 6,397,181 and 6,598,054, all incorporated by reference herein.[0006]

Therefore, there are no known systems that provide easy access to poor quality audio, except when it is in a predictable format, such as the rules that conversations between air traffic controllers and persons in the cockpit of an aircraft follow.[0007]

SUMMARY OF THE INVENTION

It is an aspect of the present invention to improve access to text by using non-textual information.[0008]

It is another aspect of the present invention to use conventional text mining techniques in previously developed text mining software in a way that utilizes non-textual information in data mining.[0009]

It is a further aspect of the present invention to improve access to documents produced by speech recognizers using recognition confidence measurement.[0010]

The above aspects can be attained by a method for processing documents derived from at least one of spontaneous and conversational expression and containing non-deterministic text with average word recognition precision below 50 percent, the processing utilizing non-textual differences between n-word sequences in the documents to resolve more than two decision options, where n is a positive integer. Such text may be obtained by automatic character recognition or automatic speech recognition of audio signals received via a telephone system. In the preferred embodiment, the non-textual differences between the n-word sequences relate to recognition confidence of the n-word sequences[0011]

When the processing requires fast access to the information stored in a large corpus of documents, e.g. for the purpose of data mining, the data is preferrably pre-processed to index the n-word sequences in a method that utilizes the non-textual differences between them. Such a procedure can speed up many forms of data access, and in particular many forms of data mining, including query based retrieval, as would be apparent to a person skilled in the art.[0012]

The data mining may include extracting parameters from the documents utilizing the non-textual differences between the n-word sequences and establishing relations between the parameters extracted from the documents. The parameters extracted from the documents may be fully known, such as parameters available in document metadata or may be hidden variables that cannot be fully determined from information existing in the document. Examples of extracted parameters include an assessment of relevance to a query based on the non-textual differences between the n-word sequences and an assessment of the document's relevance to a category.[0013]

As an alternative to creating index(es) indicating non-textual differences between n-word sequences, algorithm(s) can be used to convert text containing non-textual differences between the n-word sequences into different standard text documents. Many different algorithms may be used to transform non-deterministic text into standard text documents usable in text mining. For example, the algorithm to extract standard text documents from text with non-textual differences may apply a thresholding algorithm with varying thresholds. Then, one or more data mining techniques, each of which does not utilize non-textual differences, can be applied to these standard text documents and the outputs of the different data mining techniques can be merged to obtain information that is equivalent to that obtained by data mining that utilizes the non-textual differences.[0014]

Whether or not the index(es) include an indication of non-textual differences, the documents may be categorized, clustered, classified, filtered or labeled, e.g., by using an algorithm to detect salient terms in the documents based on non-linguistic differences between the n-word sequences.[0015]

In response to a query using any type(s) of index(es), information related to at least one of the documents may be displayed, including at least some non-textual differences between n-word sequences. Portions of the document(s) may be selectively displayed based on confidence of the accuracy of the displayed words. For example, salient terms in the document(s) may be displayed based on processing of confidence levels of recognition of the salient terms that resolves more than two decision options. In addition, parameters extracted from the documents and indications of the relations between these parameters may be displayed graphically.[0016]

In response to the display of such information, a user may indicate errors in recognition. In this case at least one word in the document is preferably replaced with a corrected word supplied by the user and the confidence level(s) of the corrected word(s) are reset to indicate high recognition accuracy.[0017]

These, together with other aspects and advantages which will be subsequently apparent, reside in the details of construction and operation as more fully hereinafter described and claimed, reference being had to the accompanying drawings forming a part hereof, wherein like numerals refer to like parts throughout.[0018]

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of a conventional spoken document retrieval system.[0019]

FIG. 2 is a flowchart of one method of spoken document retrieval according to the invention.[0020]

FIG. 3 is a flowchart of another method of spoken document retrieval according to the invention.[0021]

FIGS. 4 and 5 are block diagrams of spoken document retrieval systems according to the invention.[0022]

FIG. 6 is a block diagram of one confidence sensitive inverted index and one regular index containing confidence information.[0023]

FIG. 7 is a flowchart of text processing according to the invention.[0024]

FIGS. 8 and 9 are examples of displays generated by telephone call processing applications according to the invention.[0025]

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Following are several terms used herein that are in common use in automatic speech recognition or data mining.[0026]



labeling	a form of data processing where documents are analyzed,
	and the analysis results (referred to as “labels”) are made
	available for later processing stages. For example, a
	topical analysis of documents is a labeling of the
	documents by subject.
retrieving	a form of data mining where a subset of a document
	corpus is returned in response to a query. Preferably, the
	documents are each given a rank pertaining to their
	relevance to the query, and are sorted by decreasing
	relevance.
categorizing	a form of data mining where several “categories” are
	defined, and the documents of a corpus are labeled
	according the category to which they fit best. A common
	variation is multilabel categorizing, where each document
	may fit zero or more categories. Preferably, information
	is given regarding the quality of the fit.
clustering	a form of data mining similar to categorization,
	with the difference that the “categories” are not predefined,
	and the data mining must reveal them automatically.
classifying	a process performed on a stream of incoming
	documents, where each is labeled and then forwarded
	for relevant additional processing (manual or automatic)
	based on the labels that have been discovered.
filtering	a process performed on a stream of incoming
	documents, where each is labeled and then forwarded or
	discarded based on the labels that have been discovered.
salient terms	terms whose appearance in a document
	provides information relevant to its correct labeling, and
	consequently to all forms of data mining subsequent
	to labeling.

First, processing performed by a typical spoken document retrieval system will be described with reference to FIG. 1.[0027]

High quality audio

20 is input into anASR system22 using LVCSR.ASR system22 converts spoken words into textually represented words, but often has other outputs as well. These outputs may include timing information, an indication of the confidence of recognition of particular words and phrases, alternative likely transcriptions, and more.

The LVCSR output cannot be piped directly into a traditional text mining system. It has to be converted into searchable text. For this reason, canonization[0028]24 is performed to produce canonized text, also referred to below as standard text documents, used by conventional text mining software. Most commonly, canonization simply involves taking the textual words out of the LVCSR output and concatenating them. More sophisticated canonization schemes involve usage of both textual and non-textual information to convert the LVCSR output into a format more easily handled bytext mining system26. Usage of textual information may include capitalization and punctuation based on grammatical rules. Usage of non-textual information may include capitalization and punctuation using timing information and word(s) omitted based on low confidence levels.

[0029]

Text mining system

26 receives input from many different audio segments and stores the information in some format that will be convenient for later processing, a process known as “indexing”. When asked to produce output, typically, though not exclusively, byuser query28,text mining system26 searches its index and produces output. For example the output may be the identities of the audio segments that were requested for retrieval, ranked and scored by some relevance metric. The output may also include other information, such as the phrases in the retrieved segments that have proved to be salient terms, because of which the document was given the score that it was given.

All this information is finally piped into[0030]

document display system

30 which can use all of it, and add to it the original audio segment(s)20, to give the user feedback touser query28 that is as informative and audio-visually appealing as possible.

A simplified embodiment of a method according to the present invention for spoken document retrieval of[0031]

low grade audio

32 is illustrated in FIG. 2. A similar system could be used for text generated by a handwritten text recognition system. In embodiment illustrated in FIG. 2, the traditional text mining system has been replaced withdata mining system34 that is designed especially for speech inlow grade audio32.Data mining system34 doesn't require that the output ofASR36 be canonized into text before it is handled, and can therefore utilize all information available in the output ofASR36. No information is lost in a canonization process, and words that are indexed can receive different and appropriate handling based on non-textual information, such as their confidence scores. Thecanonization stage24 of the process illustrated in FIG. 1 is therefore entirely omitted, and the output ofASR36 is available with more of its information in subsequent processing stages, including indexing, retrieval and display. Since non-textual information is available,document display system38 in FIG. 2 displays different information thandocument display system30 in FIG. 1.

Traditionally, speech data mining was confined to[0032]

high quality audio

20 such as broadcast quality audio. Broadcast audio is of high quality, typically achieving 60-80% word recognition precision, and certainly better than 50%. By contrast, telephony grade data typically has 20-30% word recognition precision, and certainly worse than 50%. Traditional SDR systems do not utilize non-linguistic information during data mining (in all stages where documents are handled together and not separately). Attempts to use traditional text mining methodologies on low-grade audio data have proven inadequate, with the results not good enough for commercial use.

[0033]

Data mining system

34 uses data mining targeted specifically at low-grade audio speech data, extracting useful information by ranking and retrieval utilizing non-textual data which makes data mining of telephony speech quite feasible. In the embodiment illustrated in FIG. 2,data mining34 includes indexing which differentiates words using non-textual differences between the words. Upon receipt ofuser query28,data mining system34 uses the non-textual differences between the words for retrieval.

An alternative way to determine non-textual differences between words without using a specially created data mining system is illustrated in FIG. 3. In this scheme, the same segments of[0034]

audio

32 are canonized several times usingdifferent canonization methods24a, . . .24n. These methods can differ, for example, by a choice of minimal allowed confidence, any word of confidence lower than that being removed from the transcription in one way or another.

After applying the N[0035]

different canonization schemes

24a, . . .24n, the obtained text is indexed in N different indexes bytext mining systems26a, . . .26nwhich may use the same indexing methods or different methods. Although not illustrated in FIG. 3,low grade audio32 may additionally be processed byASR36 to produce output for data mining system(s)34 that utilize(s) non-textual differences between words.

When user query[0036]28 is received,text mining systems26a, . . .26nsearch indexes of standard text documents, result(s) are retrieved from each index, and finally all N results are algorithmically merged42, providing a single output, which is then forwarded to documentdisplay system38. The merged output, unlike the single outputs of the text mining systems, can differentiate results by confidence levels, anddisplay system38 can use this information.

Either of the spoken document retrieval methods illustrated in FIGS. 2 and 3 can be implemented using systems like that illustrated in FIGS. 4 and 5. Such systems can be configured in many different ways, depending on the tasks needed to be performed and the volume of data processed and FIGS. 4 and 5 are just two examples. The invention is not limited to the configurations illustrated in FIGS. 4 and 5 and other configurations are possible and will be apparent to a person of ordinary skill in the art. For example, the functions performed by separate servers in FIGS. 4 and 5 may be performed by separate modules in a single computing system.[0037]

In the configuration illustrated in FIG. 4, a system according to the invention is used interactively offline. Voice data are supplied from voice acquisition module(s)[0038]50 bynetwork52 and stored indata storage54.Network52 may be any known type of network, such as a local area network (LAN), wide area network (WAN), the Internet, etc. The voice data may be in the form of WAV files or any other audio file format.

In either of the configurations illustrated in FIGS. 4 and 5, the system is accessed by one or[0039]

more user terminals

56, such as personal computers or other devices that include a user interface which may include a display. In the interactive offline system illustrated in FIG. 4, users log into the system at various times to submit queries to voice oriented information retrieval (VOIR) indexingserver58. Voice data from voice acquisition module(s)50 are supplied tospeech categorization server60 which, if necessary, converts the data before supplying the voice data to LVCSR(s)22 and performs load balancing when more than oneLVCSR22 is used.

LVCSR(s)[0040]22 output words and additional data, such as speaker-change, timing information, confidence scores, etc. In addition, call metadata, such as the time that a call was made and the number dialed, is obtained from voice acquisition module(s)50 together with the voice data. All these types of data are combined, e.g., byspeech categorization server60 and forwarded, in online mode tospeech analysis server62 and in offline mode toVOIR indexing server58. Regardless of whether the method illustrated in FIG. 2 or FIG. 3 is implemented, results of a query in offline mode can be displayed onuser terminals56 with at least some of the non-textual differences between n-word sequences indicated. Examples of how the non-textual differences are conveyed to the user will be described below with reference to FIGS. 7 and 8.

The online configuration illustrated in FIG. 5, may be used when the volume of voice data is too large to allow effective offline processing, or it is desired to use push-technology alerts to people who may want the data. For example, a police inspector may want to be paged when the system detects a phone conversation relevant to her case. In the online configuration illustrated in FIG. 5, the output of LVCSR(s)[0041]22 is supplied vianetwork52 tospeech analysis server62 which labels the voice data. For example, the voice data may be labeled according to importance, subject matter, person or group that needs to respond, etc. The labeling of the transcribed voice data is combined with the output ofLVCSRs22 and call metadata, and forwarded to categorization queue andworkflow manager64. The users atuser terminals56 are provided this information by categorization queue andworkflow manager64. Using the labeling provided byspeech analysis server62, categorization queue andworkflow manager64 supplies text, voice data and call metadata appropriate for that user, depending on importance, topic, identity of the user, etc.

Training of[0042]

speech analysis server

62 may be accomplished by offline processing usingVOIR indexing server58 in an implementation that includes both

servers

58 and62. One or more users label calls by importance, subject matter, relevant person or group, etc. The labels assigned by users can be provided tospeech analysis server62 as training data to recognize similar calls during online processing of calls in a call center, for example. In addition, training may continue during online processing as users correct the labeling provided byspeech analysis server62. When all processing is offline,VOIR indexing server58 is trained in a similar manner.

In a typical implementation of the invention, a[0043]

single LVCSR

22 pass is sufficient for each call. If the method described above with reference to FIG. 2 is implemented,LVCSR22 supplies metadata, including confidence scores, associated with recognized words toVOIR indexing server58 which generates an index that indicates at least some of the non-textual differences between n-word sequences. If the method described above with reference to FIG. 3 is implemented,VOIR indexing server58 maintains an index for eachcanonization system24. In either case, the index(es) and the voice data (preferably compressed to minimize space requirements) or other data from which indexed text is obtained (such as handwritten documents) are preferably stored indata storage54.

In the preferred embodiment, if the method illustrated in FIG. 2 is implemented, data in[0044]

data storage

54 is indexed by use of at least one confidence sensitive inverted index. A confidence sensitive inverted index maps from terms to a sorted linked list identifying all documents where each term occurs and from each appearance of a document in this list to a sorted linked list identifying all positions in which the term appears and the confidence level of its recognition. In addition (or alternatively), indexed data may include aggregated information relating to confidence.

An example is illustrated in FIG. 6 of a confidence sensitive[0045]

inverted index

65 and a regular (forward)index66 containing confidence information with the two

indexes

65,66 referencing each other. In the mapping67 from terms in the documents to a sorted linked list of documents68, each appearance69 of a term in the document can carry additional data, such as its position in the document, its timing information, recognition score of that appearance of the term, etc. Also, an expected number of real occurrences of the term (e.g., term i which points tomapping67i) in the indexed document (e.g.,68a) can be calculated based on the individual recognition scores of the occurrences.

Another example of aggregated information relating to confidence information that can be saved is the strength of association between every document and each category. This information can be saved either in a regular (forward) index, like[0046]

index

66, another inverted index (not shown), or both. Information not relating to confidence, such as call metadata, can also be indexed, either in another inverted index, in a forward index likeindex66, or both. In both cases, if an inverted index is used, confidence sensitiveinverted index65 or a separate index can be used. Furthermore, additional mapping technologies, in addition to or instead of a mapping into a sorted linked list, can also be used.Data storage54 can also store information other than the indexes, such as the data that is being indexed. This data may include, among others, call audio, voice data and call metadata, and may include additional indexes used to refer to the same data.

A more detailed flow of processing through the system illustrated in FIGS. 4 and 5 is provided in FIG. 8. Online processing flow corresponding to the configuration illustrated in FIG. 5 is illustrated in FIG. 8 by solid lines, while offline processing flow corresponding to the configuration illustrated in FIG. 4 is illustrated by dash-dot lines. Low[0047]

quality source data

32, such as recorded telephone conversations, supplied by voice acquisition module(s)50, undergotext extraction74 in LVCSR(s)22 controlled byspeech categorization server60. In offline mode, theresults76, which may include text, confidence scores, timing information and text alternative lattice information (potentially, other information, as well), undergoindexing78 inVOIR indexing server58 and are stored indata storage54. In the online mode, results76 are supplied tospeech analysis server62 which may perform labeling80 of the calls, as described above. Data fromVOIR indexing server58 are used forcategory training82, so that thecategorization84 can later be used in either online or offline mode.

One embodiment of the system illustrated in FIG. 4 is used to process recorded telephone conversations at a call center by automatically generating transcriptions of the conversations. In this embodiment, offline ad-hoc querying[0048]86 (FIG. 8) utilizescategorization84 or rule-based keyword spotting88 to obtaininformation90 related to at least one of the documents, including at least some of the non-textual differences between n-word sequences that may be displayed on user terminal(s)56 in the format illustrated in FIG. 8 or9. The display illustrated in FIG. 8 provides an example ofuser input keywords102 “call OR meeting” that have been found in 170 documents, eight of which are displayed on screen in FIG. 8. Preferably, the documents may be listed in a table104 in an order based in part on the confidence of accuracy of the keywords displayed in the list. In the example illustrated in FIGS. 8 and 9, table104 includes call metadata, such as start time.

In the example illustrated in FIG. 8, a[0049]

waveform

106 of a portion of the seventh document (indicated as selected by shading in table104) is displayed in the lower portion of the screen with indications of when the keywords were detected. Below the waveform is thetext108 recognized by LVCSR(s)22. Preferably,text108 indicates the recognition confidence of the words and the salient terms listed in the query using one or more of highlighting, underlining, color or shade, size and style of fonts. Also shown in the example illustrated in FIGS. 8 and 9 are labels of the conversation, such as “Technical” and “Incomplete” which follow the “Categories”116 and appear in the column under “Contact Related To” in table104, along with similar category information. Confidence of these labels is also indicated.

In one embodiment of the invention, a user may listen to the entire recording by using a pointing device, such as a computer mouse, to select a row in table[0050]104 corresponding to the recording or can hear just the segments of audio corresponding to transcribed salient terms by selecting the speaker icon under the word “Play” on the row. Once a row has been selected, a user may select one of the words, such as “call” in a user-selectable speech bubble110 associated with the waveform, or in the adjoining text, to skip directly to the point in a conversation where the word was said. Apointer112 below thewaveform106 indicates what sound is being played back to the user and avertical cursor114 indicates what word was recognized for the associated sound.

Preferably, user terminal(s)[0051]56 can also be used to graphically display results, e.g.,content information90, indicating parameters extracted from the documents. Examples illustrated in FIG. 9 are

bar graphs

118,120. In FIG. 9, leftbar graph118 shows the number of calls matching a query based on call date, whileright graph120 shows the relations of several categories to the user query.

The present invention has been described with respect to embodiments using text documents generated from telephone calls. However, as noted above, the invention is not limited to texts generated in this manner and can also be applied to text obtained in other ways, such as from fact extraction systems. Furthermore, the present invention can be used with any system for processing documents that derive from at least one of spontaneous and conversational expression which outputs non-deterministic text with average word recognition below 50 percent.[0052]

The many features and advantages of the invention are apparent from the detailed specification and, thus, it is intended by the appended claims to cover all such features and advantages of the invention that fall within the true spirit and scope of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.[0053]