SEMIAUTOMATED RELAY METHOD AND APPARATUSCROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation-in-part of US patent application No./729,069 which was filed on October 10, 2017, and which is titled"SEMIAUTOMATED RELAY METHOD AND APPARATUS" which is a continuation inpart of US patent application No. 15/171,720, filed on June 2, 2017, and titled"SEMIAUTOMATED RELAY METHOD AND APPARATUS", which is a continuation-in-part of US patent application No. 14/953,631, filed on November 30, 2015, and titled"SEMIAUTOMATED RELAY METHOD AND APPARATUS", which is a continuation-in-part of US patent application No. 14/632,257, filed on February 26, 2015, and titled"SEMIAUTOMATED RELAY METHOD AND APPARATUS", which claims priority to USprovisional patent application serial No. 61/946,072 filed on February 28, 2014, andtitled “SEMIAUTOMATED RELAY METHOD AND APPARATUS”, and claims priority toeach of the above applications, each of which is incorporated herein in its entirety byreference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH ORDEVELOPMENT Not applicable.
BACKGROUND OF THE DISCLOSURE The present invention relates to relay systems for providing voice-to-textcaptioning for hearing impaired users and more specifically to a relay system that usesautomated voice-to-text captioning software to transcribe voice-to-text.
Many people have at least some degree of hearing loss. For instance, in theUnited states, about 3 out of every 1000 people are functionally deaf and about 17percent (36 million) of American adults report some degree of hearing loss whichtypically gets worse as people age. Many people with hearing loss have developedways to cope with the ways their loss effects their ability to communicate. For instance,many deaf people have learned to use their sight to compensate for hearing loss byeither communicating via sign language or by reading another person's lips as theyspeak.
When it comes to remotely communicating using a telephone, unfortunately,there is no way for a hearing impaired person (e.g., an assisted user (AU)) to use sightto compensate for hearing loss as conventional telephones do not enable an AU to seea person on the other end of the line (e.g., no lip reading or sign viewing). For personswith only partial hearing impairment, some simply turn up the volume on theirtelephones to try to compensate for their loss and can make do in most cases. Forothers with more severe hearing loss conventional telephones cannot compensate fortheir loss and telephone communication is a poor option.
An industry has evolved for providing communication services to AUswhereby voice communications from a person linked to an AU's communication deviceare transcribed into text and displayed on an electronic display screen for the AU toread during a communication session. In many cases the AU's device will alsobroadcast the linked person's voice substantially simultaneously as the text is displayedso that an AU that has some ability to hear can use their hearing sense to discern mostphrases and can refer to the text when some part of a communication is notunderstandable from what was heard.
US patent No. 6,603,835 (hereinafter "the '835 patent) titled "System For TextAssisted Telephony" teaches several different types of relay systems for providing textcaptioning services to AUs. One captioning service type is referred to as a single linesystem where a relay is linked between an AU's device and a telephone used by theperson communicating with the AU. Hereinafter, unless indicated otherwise the otherperson communicating with the AU will be referred to as a hearing user (HU) eventhough the AU may in fact be communicating with another AU. In single line systems,one line links an HU device to the relay and one line (e.g., the single line) links the relayto the AU device. Voice from the HU is presented to a relay call assistant (CA) whotranscribes the voice-to-text and then the text is transmitted to the AU device to bedisplayed. The HU's voice is also, in at least some cases, carried or passed through therelay to the AU device to be broadcast to the AU.
The other captioning service type described in the '835 patent is a two linesystem. In a two line system a HU's telephone is directly linked to an AU's device via afirst line for voice communications between the AU and the HU. When captioning isrequired, the AU can select a captioning control button on the AU device to link to therelay and provide the HU's voice to the relay on a second line. Again, a relay CA listensto the HU voice message and transcribes the voice message into text which istransmitted back to the AU device on the second line to be displayed to the AU. One ofthe primary advantages of the two line system over one line systems is that the AU canadd captioning to an on-going call. This is important as many AUs are only partiallyimpaired and may only want captioning when absolutely necessary. The option to nothave captioning is also important in cases where an AU device can be used as a normaltelephone and where non-AUs (e.g., a spouse living with an AU that has good hearingcapability) that do not need captioning may also use the AU device.
With any relay system, the primary factors for determining the value of thesystem are accuracy, speed and cost to provide the service. Regarding accuracy, textshould accurately represent spoken messages from HUs so that an AU reading the texthas an accurate understanding of the meaning of the message. Erroneous wordsprovide inaccurate messages and also can cause confusion for an AU readingtranscribed text.
Regarding speed, ideally text is presented to an AU simultaneously with thevoice message corresponding to the text so that an AU sees text associated with amessage as the message is heard. In this regard, text that trails a voice message byseveral seconds can cause confusion. Current systems present captioned textrelatively quickly (e.g. 1-3 seconds after the voice message is broadcast) most of thetime. However, at times a CA can fall behind when captioning so that longer delays(e.g., 10-15 seconds) occur.
Regarding cost, existing systems require a unique and highly trained CA foreach communication session. In known cases CAs need to be able to speak clearlyand need to be able to type quickly and accurately. CA jobs are also relatively highpressure jobs and therefore turnover is relatively high when compared jobs in manyother industries which further increases the costs associated with operating a relay.
One innovation that has increased captioning speed appreciably and that hasreduced the costs associated with captioning at least somewhat has been the use ofvoice-to-text transcription software by relay CAs. In this regard, early relay systemsrequired CAs to type all of the text presented via an AU device. To present text asquickly as possible after broadcast of an associated voice message, highly skilledtypists were required. During normal conversations people routinely speak at a ratebetween 110 and 150 words per minute. During a conversation between an AU and anHU, typically only about half the words voiced have to be transcribed (e.g., the AUtypically communicates to the HU during half of a session). Because of variousinefficiencies this means that to keep up with transcribing the HU's portion of a typicalconversation a CA has to be able to type at around 100 words per minute or more. Tothis end, most professional typists type at around 50 to 80 words per minute andtherefore can keep up with a normal conversation for at least some time. Professionaltypists are relatively expensive. In addition, despite being able to keep up with aconversation most of the time, at other times (e.g., during long conversations or duringparticularly high speed conversations) even professional typists fall behind transcribingreal time text and more substantial delays can occur.
In relay systems that use voice-to-text transcription software trained to a CA'svoice, a CA listens to an HU's voice and revoices the HU's voice message to acomputer running the trained software. The software, being trained to the CA's voice,transcribes the re-voiced message much more quickly than a typist can type text andwith only minimal errors. In many respects revoicing techniques for generating text areeasier and much faster to learn than high speed typing and therefore training costs andthe general costs associated with CA's are reduced appreciably. In addition, becauserevoicing is much faster than typing in most cases, voice-to-text transcription can beexpedited appreciably using revoicing techniques.
At least some prior systems have contemplated further reducing costsassociated with relay services by replacing CA's with computers running voice-to-textsoftware to automatically convert HU voice messages to text. In the past there havebeen several problems with this solution which have resulted in no one implementing aworkable system. First, most voice messages (e.g., an HU's voice message) deliveredover most telephone lines to a relay are not suitable for direct voice-to-text transcriptionsoftware. In this regard, automated transcription software on the market has beentuned to work well with a voice signal that includes a much larger spectrum offrequencies than the range used in typical phone communications. The frequencyrange of voice signals on phone lines is typically between 300 and 3000 Hz. Thus,automated transcription software does not work well with voice signals delivered over atelephone line and large numbers of errors occur. Accuracy further suffers where noiseexists on a telephone line which is a common occurrence.
Second, many automated transcription software programs have to be trainedto the voice of a speaker to be accurate. When a new HU calls an AU's device, there isno way for a relay to have previously trained software to the HU voice and therefore thesoftware cannot accurately generate text using the HU voice messages.
Third, many automated transcription software packages use context in orderto generate text from a voice message. To this end, the words around each word in avoice message can be used by software as context for determining which word hasbeen uttered. To use words around a first word to identify the first word, the wordsaround the first word have to be obtained. For this reason, many automatedtranscription systems wait to present transcribed text until after subsequent words in avoice message have been transcribed so that context can be used to correct priorwords before presentation. Systems that hold off on presenting text to correct usingsubsequent context cause delay in text presentation which is inconsistent with the relaysystem need for real time or close to real time text delivery.
BRIEF SUMMARY OF THE DISCLOSURE It has been recognized that a hybrid semi-automated system can be providedwhere, when acceptable accuracy can be achieved using automated transcriptionsoftware, the system can automatically use the transcription software to transcribe HUvoice messages to text and when accuracy is unacceptable, the system can patch in ahuman CA to transcribe voice messages to text. Here, it is believed that the number ofCAs required at a large relay facility may be reduced appreciably (e.g., 30% or more)where software can accomplish a large portion of transcription to text. In this regard,not only is the automated transcription software getting better over time, in at leastsome cases the software may train to an HU's voice and the vagaries associated withvoice messages received over a phone line (e.g., the limited 300 to 3000 Hz range)during a first portion of a call so that during a later portion of the call accuracy isparticularly good. Training may occur while and in parallel with a CA manually (e.g., viatyping, revoicing, etc.) transcribing voice-to-text and, once accuracy is at an acceptablethreshold level, the system may automatically delink from the CA and use the textgenerated by the software to drive the AU display device.
It has been recognized that in a relay system there are at least twoprocessors that may be capable of performing automated voice recognition processesand therefore that can handle the automated voice recognition part of a triage processinvolving a CA. To this end, in most cases either a relay processor or an AU's deviceprocessor may be able to perform the automated transcription portion of a hybridprocess. For instance, in some cases an AU's device will perform automatedtranscription in parallel with a relay assistant generating CA generated text where therelay and AU's device cooperate to provide text and assess when the CA should be cutout of a call with the automated text replacing the CA generated text.
In other cases where a HU's communication device is a computer or includesa processor capable of transcribing voice messages to text, a HU's device maygenerated automated text in parallel with a CA generating text and the HU's device andthe relay may cooperate to provide text and determine when the CA should be cut out ofthe call.
Regardless of which device is performing automated captioning, the CAgenerated text may be used to assess accuracy of the automated text for the purpose ofdetermining when the CA should be cut out of the call. In addition, regardless of whichdevice is performing automated text captioning, the CA generated text may be used totrain the automated voice-to-text software or engine on the fly to expedite the process ofincreasing accuracy until the CA can be cut out of the call.
It has also been recognized that there are times when a hearing impairedperson is listening to a HU's voice without an AU's device providing simultaneous textwhen the AU is confused and would like transcription of recent voice messages of theHU. For instance, where an AU uses an AU's device to carry on a non-captioned calland the AU has difficulty understanding a voice message so that the AU initiates acaptioning service to obtain text for subsequent voice messages. Here, while text isprovided for subsequent messages, the AU still cannot obtain an understanding of thevoice message that prompted initiation of captioning. As another instance, where CAgenerated text lags appreciably behind a current HU's voice message, an AU mayrequest that the captioning catch up to the current message.
To provide captioning of recent voice messages in these cases, in at leastsome embodiments of this disclosure an AU's device stores an HU's voice messagesand, when captioning is initiated or a catch up request is received, the recorded voicemessages are used to either automatically generate text or to have a CA generate textcorresponding to the recorded voice messages.
In at least some cases when automated software is trained to a HU's voice, avoice model for the HU that can be used subsequently to tune automated software totranscribe the HU's voice may be stored along with a voice profile for the HU that can beused to distinguish the HU's voice from other HUs. Thereafter, when the HU calls anAU's device again, the profile can be used to identify the HU and the voice model canbe used to tune the software so that the automated software can immediately startgenerating highly accurate or at least relatively more accurate text corresponding to theHU's voice messages.
A relay for captioning a hearing user's (HU's) voice signal during a phone callbetween an HU and a hearing assisted user (AU), the HU using an HU device and theAU using an AU device where the HU voice signal is transmitted from the HU device tothe AU device, the relay comprising a display screen, a processor linked to the displayand programmed to perform the steps of receiving the HU voice signal from the AUdevice, transmitting the HU voice signal to a remote automatic speech recognition(ASR) server running ASR software that converts the HU voice signal to ASR generatedtext, the remote ASR server located at a remote location from the relay, receiving theASR generated text from the ASR server, present the ASR generated text for viewingby a call assistant (CA) via the display and transmitting the ASR generated text to theAU device.
In at least some embodiments the relay further includes an interface thatenables a CA to make changes to the ASR generated text presented on the display. Insome cases the processor is further programmed to transmit CA corrections made tothe ASR generated text to the AU device with instructions to modify the ASR generatedtext previously sent to the AU device. In some cases the relay separates the HU voicesignal into voice signal slices, the step of transmitting the HU voice signal to the ASRserver includes independently transmitting the voice signal slices to the remote ASRserver for captioning and wherein the step of receiving the ASR generated text from therelay includes receiving separate ASR generated text segments for each of the slicesand cobbling the separate segments together to form a stream of ASR generated text.
In some cases at least some of the voice signal slices overlap. In somecases at least some of the voice signal slices are relatively short and some of the voicesignal slices are relatively long and wherein the short voice signal slices are consecutiveand do not overlap and wherein at least some relatively long voice signal slices overlapat least first and second of the relatively short voice signal slices. In some cases atleast some of the ASR generated text associated with overlapping voice signal slices isinconsistent, the relay applying a rule set to identify which inconsistent ASR generatedtext to use in the stream of ASR generated text.
In some cases the ASR server generates ASR error corrections for the ASRgenerated text, the relay further programmed to perform the steps of receiving ASRerror corrections, using the error corrections to automatically correct at least some ofthe errors in the ASR generated text on the display screen and transmitting the ASRerror corrections to the AU device. In at least some embodiments the relay furtherincludes an interface that enables a CA to make changes to the ASR generated textpresented on the display, the processor further programmed to transmit CA correctionsmade to the ASR generated text to the AU device with instructions to modify the ASRgenerated text previously sent to the AU device. In some cases, after a CA makes achange to ASR generated text, the text prior thereto becomes firm so that no ASR errorcorrections are made to the text subsequent thereto.
In some cases the relay further includes a speaker and wherein the processorbroadcasts the HU voice signal to the CA via the speaker as the ASR generated text ispresented on the display screen. In some cases the processor aligns broadcast of theHU voice signal with ASR generated text presented on the display screen. In somecases the processor presents the ASR generated text on the on the display screenimmediately upon reception and transmits the ASR generated text immediately uponreception and broadcasts the HU voice signal under control of the CA using aninterface. In some cases, as word in the HU voice signal is broadcast to the CA, textcorresponding to the broadcast word in on the display screen is visually distinguishedfrom other text on the display screen.
Other embodiment include a relay for captioning a hearing user's (HU's) voicesignal during a phone call between an HU and a hearing assisted user (AU), the HUusing an HU device and the AU using an AU device where the HU voice signal istransmitted from the HU device to the AU device, the relay comprising a display screen,an interface device, a processor linked to the display screen and the interface device,the processor programmed to perform the steps of receiving the HU voice signal fromthe AU device, separating the HU voice signal into voice signal slices, separatelytransmitting the HU voice signal slices to a remote automatic speech recognition (ASR)server that is located at a remote location from the relay, receiving separate ASRgenerated text segments for each of the slices and cobbling the separate segmentstogether to form a stream of ASR generated text, present the stream of ASR generatedtext as it is received from the ASR server for viewing by a call assistant (CA) via thedisplay and transmitting the stream of ASR generated text to the AU device as thestream is received from the relay.
In some cases ASR error corrections to the ASR generated text are receivedfrom the ASR server and at least some of the ASR error corrections are used to correctthe text on the display, the relay receives CA error corrections to the text on the displayand uses those corrections to correct text on the display. In some cases, once a CAcorrects an error in the text on the display, ASR error corrections for text prior to the CAcorrected text on the display are not used to make error corrections on the display. Insome cases all ASR generated text presented on the display is transmitted to the AUdevice and all ASR error corrections and CA text corrections that are presented on thedisplay are transmitted as correction text to the AU device.
Some embodiment include an caption device for use by a hard of hearingassisted user (AU) to assist the AU during voice communications with a hearing user(HU) using an HU device, the caption device comprising a display screen, a memory, atleast one communication link element for linking to a communication network, aspeaker, a processor linked to each of the display screen, the memory, the speaker andthe communication link, the processor programmed to perform the steps of receiving anHU voice signal from the HU device during a call, broadcasting the HU voice signal tothe AU via the speaker, storing at least a most recent portion of the HU voice signal inthe memory, receiving a command from the AU to start a captioning session, uponreceiving the command, obtaining a text caption corresponding to the stored HU voicesignal and presenting the text caption to the AU via the display.
In some cases the step of obtaining a text caption includes initiating aprocess whereby an automated speech recognition (ASR) program converts the storedHU voice signal to text. In some cases the processor runs the ASR program. In somecases the step of initiating the process includes establishing a link to a remote relay,and transmitting the stored HU voice signal to the relay, the step of obtaining furtherincluding receiving the text caption from the relay. In at least some embodiments therelay further includes, subsequent to receiving the command, obtaining text captions foradditional HU voice signals received during the ongoing call. In some cases the step ofobtaining text caption of the stored HU voice signal includes initiating a processwhereby the HU voice signal is converted to text via an automatic speech recognition(ASR) engine and wherein the step of obtaining text captions form additional HU voicesignal received during the ongoing call further includes transmitting the additional HUvoice signal to a relay and receiving text captions back from the relay.
To the accomplishment of the foregoing and related ends, the disclosure,then, comprises the features hereinafter fully described. The following description andthe annexed drawings set forth in detail certain illustrative aspects of the disclosure.
However, these aspects are indicative of but a few of the various ways in which theprinciples of the invention can be employed. Other aspects, advantages and novelfeatures of the disclosure will become apparent from the following detailed descriptionof the invention when considered in conjunction with the drawings.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS Fig. 1 is a schematic showing various components of a communicationsystem including a relay that may be used to perform various processes and methodsaccording to at least some aspects of the present invention; Fig. 2 is a schematic of the relay server shown in Fig. 1; Fig. 3 is a flow chart showing a process whereby an automated voice-to-textengine is used to generate automated text in parallel with a CA generating text wherethe automated text is used instead of CA generated text to provide captioning an AU'sdevice once an accuracy threshold has been exceeded; Fig. 4 is a sub-process that maybe substituted for a portion of the processshown in Fig. 3 whereby a control assistant can determine whether or not theautomated text takes over the process after the accuracy threshold has been achieved; Fig. 5 is a sub-process that may be added to the process shown in Fig. 3wherein, upon an AU’s requesting help, a call is linked to a second CA for correcting theautomated text; Fig. 6 is a process whereby an automated voice-to-text engine is used to fill intext for a HU’s voice messages that are skipped over by a CA when an AU requestsinstantaneous captioning of a current message; Fig. 7 is a process whereby automated text is automatically used to fill incaptioning when transcription by a CA lags behind a HU’s voice messages by athreshold duration; Fig. 8 is a flow chart illustrating a process whereby text is generated for aHU’s voice messages that precede a request for captioning services; Fig. 9 is a flow chart illustrating a process whereby voice messages prior to arequest for captioning service are automatically transcribed to text by an automatedvoice-to-text engine; Fig. 10 is a flow chart illustrating a process whereby an AU’s deviceprocessor performs transcription processes until a request for captioning is received atwhich point the AU’s device presents texts related to HU voice messages prior to therequest and ongoing voice messages are transcribed via a relay; Fig. 11 is a flow chart illustrating a process whereby an AU’s deviceprocessor generates automated text for a hear user’s voice messages which ispresented via a display to an AU and also transmits the text to a CA at a relay forcorrection purposes; Fig. 12 is a flow chart illustrating a process whereby high definition digitalvoice messages and analog voice messages are handled differently at a relay; Fig. 13 is a process similar to Fig. 12, albeit where an AU also has the optionto link to a CA for captioning service regardless of the type of voice message received; Fig. 14 is a flow chart that may be substituted for a portion of the processshown in Fig. 3 whereby voice models and voice profiles are generated for frequentHU’s that communicate with an AU where the models and profiles can be subsequentlyused to increase accuracy of a transcription process; Fig. 15 is a flow chart illustrating a process similar to the sub-process shownin Fig. 14 where voice profiles and voice models are generated and stored forsubsequent use during transcription; Fig. 16 is a flow chart illustrating a sub-process that may be added to theprocess shown in Fig. 15 where the resulting process calls for training of a voice modelat each of an AU’s device and a relay; Fig. 17 is a schematic illustrating a screen shot that may be presented via anAU’s device display screen; Fig. 18 is similar to Fig. 17, albeit showing a different screen shot; Fig. 19 is a process that may be performed by the system shown in Fig. 1where automated text is generated for line check words and is presented to an AUimmediately upon identification of the words; Fig. 20 is similar to Fig. 17, albeit showing a different screen shot; Fig. 21 is a flow chart illustrating a method whereby an automated voice-to-text engine is used to identify errors in CA generated text which can be highlighted andcan be corrected by a CA; Fig. 22 is an exemplary AU device display screen shot that illustrates visuallydistinct text to indicate non-textual characteristics of an HU voice signal to an AU; Fig. 23 is an exemplary CA workstation display screen shot that shows howautomated ASR text associated with an instantaneously broadcast word may be visuallydistinguished for an error correcting CA; Fig. 23A is a screen shot of a CA interface providing an option to switch fromASR generated text to a full CA system where a CA generates caption text; Fig. 24 shows an exemplary HU communication device with CA captioned HUtext and ASR generated AU text presented as well as other communication informationthat is consistent with at least some aspects off the present disclosure; Fig. 25 is an exemplary CA workstation display screen shot similar to Fig. 23,albeit where a CA has corrected an error and an HU voice signal playback has beenskipped backward as a function of where the correction occurred; Fig. 26 is a screen shot of an exemplary AU device display that presents CAcaptioned HU text as well as ASR engine generated AU text; Fig. 27 is an illustration of an exemplary HU device that shows textcorresponding to the HU's voice signal as well as an indication of which word in the texthas been most recently presented to an AU; Fig. 28 is a schematic diagram showing a relay captioning system that isconsistent with at least some aspects of the present disclosure; Fig. 29 is a schematic diagram of a relay system that includes a texttranscription quality assessment function that is consistent with at least some aspects ofthe present disclosure; Fig. 30 is similar to Fig. 29, albeit showing a different relay system thatincludes a different quality assessment function; Fig. 31 is similar to Fig. 29, albeit showing a third relay system that includes athird quality assessment function; Fig. 32 is a flow chart illustrating a method whereby time stamps are assignedto HU voice segments which are then used to substantially synchronize text and voicepresentation; Fig. 33 is a schematic illustrating a caption relay system that may implementthe method illustrated in Fig. 32 as well as other methods described herein; Fig. 34 is a sub process that may be substituted for a portion of the Fig. 32process where an Au device assigns a sequence of time stamps to a sequence of textsegments; Fig. 35 is another flow chart illustrating another method for assigning andusing time stamps to synchronize text and HU voice broadcast; Fig. 36 is a screen shot illustrating a CA interface where a prior word isselected to be rebroadcast; Fig. 37 is a screen shot similar to Fig. 36, albeit of an Au device displayshowing an AU selecting a prior broadcast phrase for rebroadcast; Fig. 38 is another sub process that may be substituted for a portion of the Fig.32 method; Fig. 39 is a screen shot showing a CA interface where various inventivefeatures are shown; Fig. 40 is a screen shot illustrating another CA interface where low and highconfidence text is presented in different columns to help a CA more easily distinguishbetween text likely to need correction and text that is less likely to need correction; Fig. 40A is a screen shot of a CA interface showing low confidence captiontext visually distinguished from other text presented to a CA for correction consideration,among other things; Fig. 41 is a flow chart illustrating a method of introducing errors in ASRgenerated text to text CA attention; Fig. 42 is a screen shot illustrating an AU interface including, in addition totext presentation, an HU video field and a CA signing field that is consistent with at leastsome aspects of the present disclosure; Fig. 43 is a screen shot illustrating yet another CA interface; Fig. 44 is another Au interface screen shot including scrolling text and an HUvideo window; and Fig. 45 is another CA interface screen shot showing a CA correction field, anASR uncorrected text field and an intervening time field that is consistent with at leastsome aspects of the present disclosure; Fig. 46 is a schematic illustrating different phrase slices that may be formed thatis consistent with at least some aspects of the present disclosure; Fig. 47 is a screen shot illustrating an interface presented to a CA that includesvarious transcription feedback tools that are consistent with various aspects of thepresent disclosure; Fig. 48 is a screen shot illustrating an interface presented to an AU that indicatesa transition from automated text to CA generated text that is consistent with at leastsome aspects of the present disclosure; Fig. 49 is similar to Fig. 48, albeit illustrating an interface that indicates atransition from automated text to CA corrected text that is consistent with at least someaspects of the present disclosure; Fig. 50 is a screen shot showing a CA interface that, among other things,enables a CA to select specific points in ASR generated text to firm up prior ASRgenerated text; Fig. 51 is a screen shot illustrating an administrators interface that shows resultsof CA generated text and scoring tools used to assess quality of captions generated bya CA; Fig. 52 is a screen shot illustrating a CA interface where a CA is restricted toediting text within a small field of recent text to ensure that the CA keeps up with currentHU voice utterances within some window of time;.
Fig. 53 is similar to Fig. 52, albeit showing the interface at a different point intime; Fig. 54 is a top plan view of a CA workstation including an eye tracking camerathat is consistent with at least some aspects of some embodiments of the presentdisclosure; Fig. 55 is a schematic illustrating an exemplary CA screen shot and a camerathat tracks a CA's eyes that is consistent with at least some aspects of someembodiments of the present disclosure; Fig. 56 is a screen shot showing an AU interface where a first error correction isshown distinguished in multiple ways; Fig. 57 is a screen shot similar to Fig. 56, albeit where the first error correction isshown in a less noticeable way and a second error correction is shown distinguished inmultiple ways so that the distinguishing effect related to the first error correction appearsto be extinguishing; and Fig. 58 is similar to Figs. 56 and 57, albeit showing the interface after a third errorcorrection is presented where the first error correction is now shown as normal text, thesecond is shown distinguished in an extinguishing fashion and the third error correctionis fully dfistinguished.
While the disclosure is susceptible to various modifications and alternativeforms, specific embodiments thereof have been shown by way of example in thedrawings and are herein described in detail. It should be understood, however, that thedescription herein of specific embodiments is not intended to limit the disclosure to theparticular forms disclosed, but on the contrary, the intention is to cover all modifications,equivalents, and alternatives falling within the spirit and scope of the disclosure asdefined by the appended claims.
DETAILED DESCRIPTION OF THE DISCLOSURE The various aspects of the subject disclosure are now described withreference to the annexed drawings, wherein like reference numerals correspond tosimilar elements throughout the several views. It should be understood, however, thatthe drawings and detailed description hereafter relating thereto are not intended to limitthe claimed subject matter to the particular form disclosed. Rather, the intention is tocover all modifications, equivalents, and alternatives falling within the spirit and scope ofthe claimed subject matter.
As used herein, the terms “component,” “system” and the like are intended torefer to a computer-related entity, either hardware, a combination of hardware andsoftware, software, or software in execution. For example, a component may be, but isnot limited to being, a process running on a processor, a processor, an object, anexecutable, a thread of execution, a program, and/or a computer. By way of illustration,both an application running on a computer and the computer can be a component. Oneor more components may reside within a process and/or thread of execution and acomponent may be localized on one computer and/or distributed between two or morecomputers or processors.
The word “exemplary” is used herein to mean serving as an example,instance, or illustration. Any aspect or design described herein as “exemplary” is notnecessarily to be construed as preferred or advantageous over other aspects ordesigns.
Furthermore, the disclosed subject matter may be implemented as a system,method, apparatus, or article of manufacture using standard programming and/orengineering techniques to produce software, firmware, hardware, or any combinationthereof to control a computer or processor based device to implement aspects detailedherein. The term “article of manufacture” (or alternatively, “computer program product”)as used herein is intended to encompass a computer program accessible from anycomputer-readable device, carrier, or media. For example, computer readable mediacan include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk,magnetic strips . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) .. . ), smart cards, solid state drives and flash memory devices (e.g., card, stick).
Additionally it should be appreciated that a carrier wave can be employed to carrycomputer-readable electronic data such as those used in transmitting and receivingelectronic mail or in accessing a network such as the Internet or a local area network(LAN). Of course, those skilled in the art will recognize many modifications may bemade to this configuration without departing from the scope or spirit of the claimedsubject matter.
Unless indicates otherwise, the phrases "assisted user", "hearing user" and"call assistant" will be represented by the acronyms "AU", "HU" and "CA", respectively.
The acronym "ASR" will be used to abbreviate the phrase "automatic speechrecognition". Unless indicated otherwise, the phrase "full CA mode" will be used to referto a call captioning system instantaneously generating captions for at least a portion ofa communication session wherein a voice signal is listened to by a live CA (e.g., aperson) who transcribes the voice message to text which the CA then corrects wherethe CA generated text is presented to at least one of the communicants to thecommunication session and the phrase "ASR-CA backed up mode" will be used to referto a call captioning system instantaneously generating captions for at least a portion ofa communication session where a voice signal is fed to an ASR software engine (e.g., acomputer running software) that generates at least initial captions for the received voicesignal and where a CA corrects the original captions where the ASR generated captionsand in at least some cases the CA generated corrections are presented to at least oneof the communicants to the communication session.
System Architecture Referring now to the drawings wherein like reference numerals correspond tosimilar elements throughout the several views and, more specifically, referring to Fig. 1,the present disclosure will be described in the context of an exemplary communicationsystem 10 including an AU's communication device 12, an HU's telephone or other typecommunication device 14, and a relay 16. The AU's device 12 is linked to the HU'sdevice 14 via any network connection capable of facilitating a voice call between the AUand the HU. For instance, the link may be a conventional telephone line, a networkconnection such as an internet connection or other network connection, a wirelessconnection, etc. AU device 12 includes a keyboard 20, a display screen 18 and ahandset 22. Keyboard 20 can be used to dial any telephone number to initiate a calland, in at least some cases, includes other keys or may be controlled to present virtualbuttons via screen 18 for controlling various functions that will be described in greaterdetail below. Other identifiers such as IP addresses or the like may also be used in atleast some cases to initiate a call. Screen 18 includes a flat panel display screen fordisplaying, among other things, text transcribed from a voice message or signalgenerated using HU's device 14, control icons or buttons, caption feedback signals, etc.
Handset 22 includes a speaker for broadcasting a HU’s voice messages to an AU and amicrophone for receiving a voice message from an AU for delivery to the HU's device14. AU device 12 may also include a second loud speaker so that device 12 canoperate as a speaker phone type device. Although not shown, device 12 furtherincludes a processor and a memory for storing software run by the processor to performvarious functions that are consistent with at least some aspects of the presentdisclosure. Device 12 is also linked or is linkable to relay 16 via any communicationnetwork including a phone network, a wireless network, the internet or some othersimilar network, etc. Device 12 may further include a Bluetooth or other type oftransmitter for linking to an AU's hear aide or some other speaker type device.
HU’s device 14, in at least some embodiments, includes a communicationdevice (e.g., a telephone) including a keyboard for dialing phone numbers and ahandset including a speaker and a microphone for communication with other devices.
In other embodiments device 14 may include a computer, a smart phone, a smart tablet,etc., that can facilitate audio communications with other devices. Devices 12 and 14may use any of several different communication protocols including analog or digitalprotocols, a VOIP protocol or others.
Referring still to Fig. 1, relay 16 includes, among other things, a relay serverand a plurality of CA work stations 32, 34, etc. Each of the CA work stations 32, 34,etc., is similar and operates in a similar fashion and therefore only station 32 isdescribed here in any detail. Station 32 includes a display screen 50, a keyboard 52and a headphone/microphone headset 54. Screen 50 may be any type of electronicdisplay screen for presenting information including text transcribed from a HU’s voicesignal or message. In most cases screen 50 will present a graphical user interface withon screen tools for editing text that appears on the screen. One text editing system isdescribed in US patent No. 7,164,753 which issued on January 16, 2007 which is titled"Real Time Transcription Correction System" and which is incorporated herein in itsentirety.
Keyboard 52 is a standard text entry QUERTY type keyboard and can beused to type text or to correct text presented on displays screen 50. Headset 54includes a speaker in an ear piece and a microphone in a mouth piece and is worn by aCA. The headset enables a CA to listen to the voice of a HU and the microphoneenables the CA to speak voice messages into the relay system such as, for instance,revoiced messages from a HU to be transcribed into text. For instance, typically duringa call between a HU on device 14 and an AU on device 12, the HU's voice messagesare presented to a CA via headset 54 and the CA revoices the messages into the relaysystem using headset 54. Software trained to the voice of the CA transcribes theassistant's voice messages into text which is presented on display screen 50. The CAthen uses keyboard 52 and/or headset 54 to make corrections to the text on display 50.
The corrected text is then transmitted to the AU's device 12 for display on screen 18. Inthe alternative, the text may be transmitted prior to correction to the AU's device 12 fordisplay and corrections may be subsequently transmitted to correct the displayed textvia in-line corrections where errors are replaced by corrected text.
Although not shown, CA work station 32 may also include a foot pedal orother device for controlling the speed with which voice messages are played viaheadset 54 so that the CA can slow or even stop play of the messages while theassistant either catches up on transcription or correction of text.
Referring still to Fig. 1 and also to Fig. 2, server 30 is a computer system thatincludes, among other components, at least a first processor 56 linked to a memory ordatabase 58 where software run by processor 56 to facilitate various functions that areconsistent with at least some aspects of the present disclosure is stored. The softwarestored in memory 58 includes pre-trained CA voice-to-text transcription software 60 foreach CA where CA specific software is trained to the voice of an associated CA therebyincreasing the accuracy of transcription activities. For instance, Naturally Speakingcontinuous speech recognition software by Dragon, Inc. may be pre-trained to the voiceof a specific CA and then used to transcribe voice messages voiced by the CA into text.
In addition to the CA trained software, a voice-to-text software program 62that is not pre-trained to a CA's voice and instead that trains to any voice on the fly asvoice messages are received is stored in memory 58. Again, Naturally Speakingsoftware that can train on the fly may be used for this purpose. Hereinafter, theautomatic speech recognition software or system that trains to the HU voices will bereferred to generally as an ASR engine at times.
Moreover, software 64 that automatically performs one of several differenttypes of triage processes to generate text from voice messages accurately, quickly andin a relatively cost effective manner is stored in memory 58. The triage programs aredescribed in detail hereafter.
One issue with existing relay systems is that each call is relatively expensiveto facilitate. To this end, in order to meet required accuracy standards for text captioncalls, each call requires a dedicated CA. While automated voice-to-text systems thatwould not require a CA have been contemplated, none has been successfullyimplemented because of accuracy and speed problems.
Basic Semi-Automated System One aspect of the present disclosure is related to a system that is semi-automated wherein a CA is used when accuracy of an automated system is not atrequired levels and the assistant is cut out of a call automatically or manually whenaccuracy of the automated system meets or exceeds accuracy standards or at thepreference of an AU. For instance, in at least some cases a CA will be assigned toevery new call linked to a relay and the CA will transcribe voice-to-text as in an existingsystem. Here, however, the difference will be that, during the call, the voice of a HU willalso be processed by server 30 to automatically transcribe the HU's voice messages totext (e.g., into "automated text"). Server 30 compares corrected text generated by theCA to the automated text to identify errors in the automated text. Server 30 usesidentified errors to train the automated voice-to-text software to the voice of the HU.
During the beginning of the call the software trains to the HU's voice and accuracyincreases over time as the software trains. At some point the accuracy increases untilrequired accuracy standards are met. Once accuracy standards are met, server 30 isprogrammed to automatically cut out the CA and start transmitting the automated text tothe AU's device 12.
In at least some cases, when a CA is cut out of a call, the system mayprovide a "Help" button, an "Assist" button or "Assistance Request" type button (see 68in Fig. 1) to an AU so that, if the AU recognizes that the automated text has too manyerrors for some reason, the AU can request a link to a CA to increase transcriptionaccuracy (e.g., generate an assistance request). In some cases the help button may bea persistent mechanical button on the AU's device 12. In the alternative, the help buttonmay be a virtual on screen icon (e.g., see 68 in Fig. 1) and screen 18 may be a touchsensitive screen so that contact with the virtual button can be sensed. Where the helpbutton is virtual, the button may only be presented after the system switches fromproviding CA generated text to an AU's device to providing automated text to the AU'sdevice to avoid confusion (e.g., avoid a case where an AU is already receiving CAgenerated text but thinks, because of a help button, that even better accuracy can beachieved in some fashion). Thus, while CA generated text is displayed on an AU'sdevice 12, no "help" button is presented and after automated text is presented, the"help" button is presented. After the help button is selected and a CA is re-linked to thecall, the help button is again removed from the AU's device display 18 to avoidconfusion.
Referring now to Figs. 2 and 3, a method or process 70 is illustrated that maybe performed by server 30 to cut out a CA when automated text reaches an accuracylevel that meets a standard threshold level. Referring also to Fig. 1, at block 72, helpand auto flags are each set to a zero value. The help flag indicates that an AU hasselected a help or assist button via the AU's device 12 because of a perception that toomany errors are occurring in transcribed text. The auto flag indicates that automatedtext accuracy has exceeded a standard threshold requirement. Zero values indicatethat the help button has not been selected and that the standard requirement has yet tobe met and one values indicate that the button has been selected and that the standardrequirement has been met.
Referring still to Figs. 1 and 3, at block 74, during a phone call between a HUusing device 14 and an AU using device 12, the HU's voice messages are transmittedto server 30 at relay 16. Upon receiving the HU's voice messages, server 30 checksthe auto and help flags at blocks 76 and 84, respectively. At least initially the auto flagwill be set to zero at block 76 meaning that automated text has not reached theaccuracy standard requirement and therefore control passes down to block 78 wherethe HU's voice messages are provided to a CA. At block 80, the CA listens to the HU'svoice messages and generates text corresponding thereto by either typing themessages, revoicing the messages to voice-to-text transcription software trained to theCA's voice, or a combination of both. Text generated is presented on screen 50 and theCA makes corrections to the text using keyboard 52 and/or headset 54 at block 80. Atblock 82 the CA generated text is transmitted to AU device 12 to be displayed for theAU on screen 18.
Referring again to Figs. 1 and 3, at block 84, at least initially the help flag willbe set to zero indicating that the AU has not requested additional captioning assistance.
In fact, at least initially the "help" button 68 may not be presented to an AU as CAgenerated text is initially presented. Where the help flag is zero at block 84, controlpasses to block 86 where the HU's voice messages are fed to voice-to-text software runby server 30 that has not been previously trained to any particular voice. At block 88the software automatically converts the HU's voice-to-text generating automated text.
At block 90, server 30 compares the CA generated text to the automated text to identifyerrors in the automated text. At block 92, server 30 uses the errors to train the voice-to-text software for the HU's voice. In this regard, for instance, where an error is identified,server 30 modifies the software so that the next time the utterance that resulted in theerror occurs, the software will generate the word or words that the CA generated for theutterance. Other ways of altering or training the voice-to-text software are well known inthe art and any way of training the software may be used at block 92.
After block 92 control passes to block 94 where server 30 monitors for aselection of the "help" button 68 by the AU. If the help button has not been selected,control passes to block 96 where server 30 compares the accuracy of the automatedtext to a threshold standard accuracy requirement. For instance, the standardrequirement may require that accuracy be great than 96% measured over at least amost recent forty-five second period or a most recent 100 words uttered by a HU,whichever is longer. Where accuracy is below the threshold requirement, controlpasses back up to block 74 where the process described above continues. At block 96,once the accuracy is greater than the threshold requirement, control passes to block 98where the auto flag is set to one indicating that the system should start using theautomated text and delink the CA from the call to free up the assistant to handle adifferent call. A virtual “help” button may also be presented via the AU’s display 18 atthis time. Next, at block 100, the CA is delinked from the call and at block 102 theprocessor generated automated text is transmitted to the AU device to be presented ondisplay screen 18.
Referring again to block 74, the HU's voice is continually received during acall and at block 76, once the auto flag has been set to one, the lower portion of the lefthand loop including blocks 78, 80 and 82 is cut out of the process as control loops backup to block 74.
Referring again to block 94, if, during an automated portion of a call whenautomated text is being presented to the AU, the AU decides that there are too manyerrors in the transcription presented via display 18 and the AU selects the "help" button68 (see again Fig. 1), control passes to block 104 where the help flag is set to oneindicating that the AU has requested the assistance of a CA and the auto flag is reset tozero indicating that CA generated text will be used to drive the AU’s display 18 insteadof the automated text. Thereafter control passes back up to block 74. Again, at block76, with the auto flag set to zero the next time through decision block 76, control passesback down to block 78 where the call is again linked to a CA for transcription asdescribed above. In addition, the next time through block 84, because the help flag isset to one, control passes back up to block 74 and the automated text loop includingblocks 86 through 104 is effectively cut out of the rest of the call.
In at least some embodiments, there will be a short delay (e.g., 5 to 10seconds in most cases) between setting the flags at block 104 and stopping use of theautomated text so that a new CA can be linked up to the call and start generating CAgenerated text prior to halting the automated text. In these cases, until the CA is linkedand generating text for at least a few seconds (e.g., 3 seconds), the automated text willstill be used to drive the AU's display 18. The delay may either be a pre-defined delayor may have a case specific duration that is determined by server 30 monitoring CAgenerated text and switching over to the CA generated text once the CA is up to speed.
In some embodiments, prior to delinking a CA from a call at block 100, servermay store a CA identifier along with a call identifier for the call. Thereafter, if an AUrequests help at block 94, server 30 may be programmed to identify if the CA previouslyassociated with the call is available (e.g. not handling another call) and, if so, may re-link to the CA at block 78. In this manner, if possible, a CA that has at least somecontext for the call can be linked up to restart transcription services.
In some embodiments it is contemplated that after an AU has selected a helpbutton to receive call assistance, the call will be completed with a CA on the line. Inother cases it is contemplated that server 30 may, when a CA is re-linked to a call, starta second triage process to attempt to delink the CA a second time if a thresholdaccuracy level is again achieved. For instance, in some cases, midstream during a call,a second HU may start communicating with the AU via the HU's device. For instance, achild may yield the HU's device 14 to a grandchild that has a different voice profilecausing the AU to request help from a CA because of perceived text errors. Here, afterthe hand back to the CA, server 30 may start training on the grandchild's voice and mayeventually achieve the threshold level required. Once the threshold again occurs, theCA may be delinked a second time so that automated text is again fed to the AU'sdevice.
As another example text errors in automated text may be caused bytemporary noise in one or more of the lines carrying the HU's voice messages to relay16. Here, once the noise clears up, automated text may again be a suitable option.
Thus, here, after an AU requests CA help, the triage process may again commence andif the threshold accuracy level is again exceeded, the CA may be delinked and theautomated text may again be used to drive the AU's device 12. While the thresholdaccuracy level may be the same each time through the triage process, in at least someembodiments the accuracy level may be changed each time through the process. Forinstance, the first time through the triage process the accuracy threshold may be 96%.
The second time through the triage process the accuracy threshold may be raised to98%.
In at least some embodiments, when the automated text accuracy exceedsthe standard accuracy threshold, there may be a short transition time during which a CAon a call observes automated text while listening to a HU's voice message to manuallyconfirm that the handover from CA generated text to automated text is smooth. Duringthis short transition time, for instance, the CA may watch the automated text on herworkstation screen 50 and may correct any errors that occur during the transition. In atleast some cases, if the CA perceives that the handoff does not work or the quality ofthe automated text is poor for some reason, the CA may opt to retake control of thetranscription process.
One sub-process 120 that may be added to the process shown in Fig. 3 formanaging a CA to automated text handoff is illustrated in Fig. 4. Referring also to Figs.1 and 2, at block 96 in Fig. 3, if the accuracy of the automated text exceeds theaccuracy standard threshold level, control may pass to block 122 in Fig. 4. At block122, a short duration transition timer (e.g. 10-15 seconds) is started. At block 124automated text (e.g., text generated by feeding the HU's voice messages directly tovoice-to-text software) is presented on the CA's display 50. At block 126 an on screen"Retain Control" icon or virtual button is provided to the CA via the assistant's displayscreen 50 which can be selected by the CA to forego the handoff to the automatedvoice-to-text software. At block 128, if the "Retain Control" icon is selected, controlpasses to block 132 where the help flag is set to one and then control passes back upto block 76 in Fig. 3 where the CA process for generating text continues as describedabove. At block 128, if the CA does not select the "Retain Control" icon, control passesto block 130 where the transition timer is checked. If the transition timer has not timedout control passes back up to block 124. Once the timer times out at block 130, controlpasses back to block 98 in Fig. 3 where the auto flag is set to one and the CA isdelinked from the call.
In at least some embodiments it is contemplated that after voice-to-textsoftware takes over the transcription task and the CA is delinked from a call, server 30itself may be programmed to sense when transcription accuracy has degradedsubstantially and the server 30 may cause a re-link to a CA to increase accuracy of thetext transcription. For instance, server 30 may assign a confidence factor to each wordin the automated text based on how confident the server is that the word has beenaccurately transcribed. The confidence factors over a most recent number of words(e.g., 100) or a most recent period (e.g., 45 seconds) may be averaged and the averageused to assess an overall confidence factor for transcription accuracy. Where theconfidence factor is below a threshold level, server 30 may re-link to a CA to increasetranscription accuracy. The automated process for re-linking to a CA may be usedinstead of or in addition to the process described above whereby an AU selects the"help" button to re-link to a CA.
In at least some cases when an AU selects a "help" button to re-link to a CA,partial call assistance may be provided instead of full CA service. For instance, insteadof adding a CA that transcribes a HU’s voice messages and then corrects errors, a CAmay be linked only for correction purposes. The idea here is that while software trainedto a HU's voice may generate some errors, the number of errors after training will still berelatively small in most cases even if objectionable to an AU. In at least some casesCAs may be trained to have different skill sets where highly skilled and relatively moreexpensive to retain CAs are trained to re-voice HU voice messages and correct theresulting text and less skilled CAs are trained to simply make corrections to automatedtext. Here, initially all calls may be routed to highly skilled revoicing or “transcribing”CAs and all re-linked calls may be routed to less skilled "corrector" CAs.
A sub-process 134 that may be added to the process of Fig. 3 for routing re-linked calls to a corrector CA is shown in Fig. 5. Referring also to Figs. 1 and 3, atdecision block 94, if an AU selects the help button, control may pass to block 136 in Fig.3 where the call is linked to a second corrector CA. At block 138 the automated text ispresented to the second CA via the CA's display 50. At block 140 the second CAlistens to the voice of the HU and observes the automated text and makes correctionsto errors perceived in the text. At block 142, server 30 transmits the correctedautomated text to the AU's device for display via screen 18. After block 142 controlpasses back up to block 76 in Fig. 2.
Re-Sync And Fill In Text In some cases where a CA generates text that drives an AU's display screen18 (see again Fig. 1), for one reason or another the CA's transcription to text may fallbehind the HU's voice message stream by a substantial amount. For instance, where aHU is speaking quickly, is using odd vocabulary, and/or has an unusual accent that ishard to understand, CA transcription may fall behind a voice message stream by 20seconds, 40 seconds or more.
In many cases when captioning falls behind, an AU can perceive thatpresented text has fallen far behind broadcast voice messages from a HU based onmemory of recently broadcast voice message content and observed text. For instance,an AU may recognize that currently displayed text corresponds to a portion of thebroadcast voice message that occurred thirty seconds ago. In other cases somecaptioning delay indicator may be presented via an AU’s device display 18. Forinstance, see Fig. 17 where captioning delay is indicated in two different ways on adisplay screen 18. First, text 212 indicates an estimated delay in seconds (e.g., 24second delay). Second, at the end of already transcribed text 214, blanks 216 for wordsalready voiced but yet to be transcribed may be presented to give an AU a sense ofhow delayed the captioning process has become.
When an AU perceives that captioning is too far behind or when the usercannot understand a recently broadcast voice message, the AU may want the textcaptioning to skip ahead to the currently broadcast voice message. For instance, if anAU had difficulty hearing the most recent five seconds of a HU's voice message andcontinues to have difficulty hearing but generally understood the preceding 25 seconds,the AU may want the captioning process to be re-synced with the current HU's voicemessage so that the AU's understanding of current words is accurate.
Here, however, because the AU could not understand the most recent 5seconds of broadcast voice message, a re-sync with the current voice message wouldleave the AU with at least some void in understanding the conversation (e.g., at leastthe most recent 5 seconds of misunderstood voice message would be lost). To dealwith this issue, in at least some embodiments, it is contemplated that server 30 may runautomated voice-to-text software on a HU's voice message simultaneously with a CAgenerating text from the voice message and, when an AU requests a “catch-up” or “re-sync” of the transcription process to the current voice message, server 30 may provide"fill in" automated text corresponding to the portion of the voice message between themost recent CA generated text and the instantaneous voice message which may beprovided to the AU's device for display and also, optionally, to the CA's display screento maintain context for the CA. In this case, while the fill in automated text may havesome errors, the fill in text will be better than no text for the associated period and canbe referred to by the AU to better understand the voice messages.
In cases where the fill in text is presented on the CA's display screen, the CAmay correct any errors in the fill in text. This correction and any error correction by aCA for that matter may be made prior to transmitting text to the AU's device orsubsequent thereto. Where corrected text is transmitted to an AU's device subsequentto transmission of the original error prone text, the AU's device corrects the errors byreplacing the erroneous text with the corrected text.
Because it is often the case that AUs will request a re-sync only when theyhave difficulty understanding words, server 30 may only present automated fill in text toan AU corresponding to a pre-defined duration period (e.g., 8 seconds) that precedesthe time when the re-sync request occurs. For instance, consistent with the exampleabove where CA captioning falls behind by thirty seconds, an AU may only request re-sync at the end of the most recent five seconds as inability to understand the voicemessage may only be an issue during those five seconds. By presenting the mostrecent eight seconds of automated text to the AU, the user will have the chance to readtext corresponding to the misunderstood voice message without being inundated with alarge segment of automated text to view. Where automated fill in text is provided to anAU for only a pre-defined duration period, the same text may be provided for correctionto the CA.
Referring now to Fig. 7, a method 190 by which an AU requests a re-sync ofthe transcription process to current voice messages when CA generated text fallsbehind current voice messages is illustrated. Referring also to Fig. 1, at block 192 aHU's voice messages are received at relay 16. After block 192, control passes down toeach of blocks 194 and 200 where two simultaneous sub-processes occur in parallel.
At block 194, the HU's voice messages are stored in a rolling buffer. The rolling buffermay, for instance, have a two minute duration so that the most recent two minutes of aHU's voice messages are always stored. At block 196, a CA listens to the HU's voicemessage and transcribes text corresponding to the messages via re-voicing to softwaretrained to the CA's voice, typing, etc. At block 198 the CA generated text is transmittedto AU's device 12 to be presented on display screen 18 after which control passes backup to block 192. Text correction may occur at block 196 or after block 198.
Referring again to Fig. 7, at process block 200, the HU's voice is fed directlyto voice-to-text software run by server 30 which generates automated text at block 202.
Although not shown in Fig. 7, after block 202, server 30 may compare the automatedtext to the CA generated text to identify errors and may use those errors to train thesoftware to the HU's voice so that the automated text continues to get more accurate asa call proceeds.
Referring still to Figs. 1 and 7, at decision block 204, controller 30 monitorsfor a catch up or re-sync command received via the AU's device 12 (e.g., via selectionof an on-screen virtual “catch up” button 220, see again Fig. 17). Where no catch up orre-sync command has been received, control passes back up to block 192 where theprocess described above continues to cycle. At block 204, once a re-sync commandhas been received, control passes to block 206 where the buffered voice messages areskipped and a current voice message is presented to the ear of the CA to betranscribed. At block 208 the automated text corresponding to the skipped voicemessage segment is filled in to the text on the CA's screen for context and at block 210the fill in text is transmitted to the AU's device for display.
Where automated text is filled in upon the occurrence of a catch up process,the fill in text may be visually distinguished on the AU’s screen and/or on the CA’sscreen. For instance, fill in text may be highlighted, underlined, bolded, shown in adistinct font, etc. For example, see Fig. 18 that shows fill in text 222 that is underlined tovisually distinguish. See also that the captioning delay 212 has been updated. In somecases, fill in text corresponding to voice messages that occur after or within some pre-defined period prior to a re-sync request may be distinguished in yet a third way to pointout the text corresponding to the portion of a voice message that the AU most likelyfound interesting (e.g., the portion that prompted selection of the re-sync button). Forinstance, where 24 previous seconds of text are filled in when a re-sync request isinitiated, all 24 seconds of fill in text may be underlined and the 8 seconds of text priorto the re-sync request may also be highlighted in yellow. See in Fig. 18 that some ofthe fill in text is shown in a phantom box 226 to indicate highlighting.
In at least some cases it is contemplated that server 30 may be programmedto automatically determine when CA generated text substantially lags a current voicemessage from a HU and server 30 may automatically skip ahead to re-sync a CA with acurrent message while providing automated fill in text corresponding to interveningvoice messages. For instance, server 30 may recognize when CA generated text ismore than thirty seconds behind a current voice message and may skip the voicemessages ahead to the current message while filling in automated text to fill the gap. Inat least some cases this automated skip ahead process may only occur after at leastsome (e.g., 2 minutes) training to a HU's voice so ensure that minimal errors aregenerated in the fill in text.
A method 150 for automatically skipping to a current voice message in abuffer when a CA falls to far behind is shown in Fig. 6. Referring also to Fig. 1, at block152, a HU's voice messages are received at relay 16. After block 152, control passesdown to each of blocks 154 and 162 where two simultaneous sub-processes occur inparallel. At block 154, the HU's voice messages are stored in a rolling buffer. At block156, a CA listens to the HU's voice message and transcribes text corresponding to themessages via re-voicing to software trained to the CA's voice, typing, etc., after whichcontrol passes to block 170.
Referring still to Fig. 6, at process block 162, the HU's voice is fed directly tovoice-to-text software run by server 30 which generates automated text at block 164.
Although not shown in Fig. 6, after block 164, server 30 may compare the automatedtext to the CA generated text to identify errors and may use those errors to train thesoftware to the HU's voice so that the automated text continues to get more accurate asa call proceeds.
Referring still to Figs. 1 and 6, at decision block 166, controller 30 monitorshow far CA text transcription is behind the current voice message and compares thatvalue to a threshold value. If the delay is less than the threshold value, control passesdown to block 170. If the delay exceeds the threshold value, control passes to block168 where server 30 uses automated text from block 164 to fill in the CA generated textand skips the CA up to the current voice message. After block 168 control passes toblock 170. At block 170, the text including the CA generated text and the fill in text ispresented to the CA via display screen 50 and the CA makes any corrections toobserved errors. At block 172, the text is transmitted to AU's device 12 and is displayedon screen 18. Again, uncorrected text may be transmitted to and displayed on device12 and corrected text may be subsequently transmitted and used to correct errors in theprior text in line on device 12. After block 172 control passes back up to block 152where the process described above continues to cycle. Automatically generated text tofill in when skipping forward may be visually distinguished (e.g., highlighted, underlined,etc.) In at least some cases when automated fill in text is generated, that text maynot be presented to the CA or the AU as a single block and instead may be doled out ata higher speed than the talking speed of the HU until the text catches up with a currenttime. To this end, where transcription is far behind a current point in a conversation, ifautomated catch up text were generated as an immediate single block, in at least somecases, the earliest text in the block could shoot off a CA's display screen or an AU'sdisplay screen so that the CA or the AU would be unable to view all of the automatedcatch up text. Instead of presenting the automated text as a complete block uponcatchup, the automated catch up text may be presented at a rate that is faster (e.g., twoto three times faster) than the HU's rate of speaking so that catch up is rapid without theoldest catch up text running off the CA's or AU's displays.
In addition to avoiding a case where text shoots off an AU's display screen,presenting text in a constant but rapid flow has a better feel to it as the text is notpresented in a jerky start and stop fashion which can be distracting to an AU trying tofollow along as text is presented.
In other cases, when an AU requests fill in, the system may automatically fillin text and only present the most recent 10 seconds or so of the automatic fill in text tothe CA for correction so that the AU has corrected text corresponding to a most recentperiod as quickly as possible. In many cases where the CA generated text issubstantially delayed, much of the fill in text would run off a typical AU's device displayscreen when presented so making corrections to that text would make little sense as theAU that requests catch up text is typically most interested in text associated with themost recent HU voice signal.
Many AU's devices can be used as conventional telephones withoutcaptioning service or as AU devices where captioning is presented and voice messagesare broadcast to an AU. The idea here is that one device can be used by hearingimpaired persons and persons that have no hearing impairment and that the overallcosts associated with providing captioning service can be minimized by only usingcaptioning when necessary. In many cases even a hearing impaired person may notneed captioning service all of the time. For instance, a hearing impaired person may beable to hear the voice of a person that speaks loudly fairly well but may not be able tohear the voice of another person that speaks more softly. In this case, captioning wouldbe required when speaking to the person with the soft voice but may not be requiredwhen speaking to the person with the loud voice. As another instance, an impairedperson may hear better when well rested but hear relatively more poorly when tired socaptioning is required only when the person is tired. As still another instance, animpaired person may hear well when there is minimal noise on a line but may hearpoorly if line noise exceeds some threshold. Again, the impaired person would onlyneed captioning some of the time.
To minimize captioning service costs and still enable an impaired person toobtain captioning service whenever needed and even during an ongoing call, somesystems start out all calls with a default setting where an AU's device 12 is used like anormal telephone without captioning. At any time during an ongoing call, an AU canselect either a mechanical or virtual "Caption" icon or button (see again 68 in Fig. 1) tolink the call to a relay, provide a HU's voice messages to the relay and commencecaptioning service. One problem with starting captioning only after an AU experiencesproblems hearing words is that at least some words (e.g., words that prompted the AUto select the caption button in the first place) typically go unrecognized and therefore theAU is left with a void in their understanding of a conversation.
One solution to the problem of lost meaning when words are not understoodjust prior to selection of a caption button is to store a rolling recordation of a HU's voicemessages that can be transcribed subsequently when the caption button is selected togenerate “fill in” text. For instance, the most recent 20 seconds of a HU's voicemessages may be recorded and then transcribed only if the caption button is selected.
The relay generates text for the recorded message either automatically via software orvia revoicing or typing by a CA or via a combination of both. In addition, the CA or theautomated voice recognition software starts transcribing current voice messages. Thetext from the recording and the real time messages is transmitted to and presented viaAU's device 12 which should enable the AU to determine the meaning of the previouslymisunderstood words. In at least some embodiments the rolling recordation of HU'svoice messages may be maintained by the AU's device 12 (see again Fig. 1) and thatrecordation may be sent to the relay for immediate transcription upon selection of thecaption button.
Referring now to Fig. 8, a process 230 that may be performed by the systemof Fig. 1 to provide captioning for voice messages that occur prior to a request forcaptioning service is illustrated. Referring also to Fig. 1, at block 232 a HU's voicemessages are received during a call with an AU at the AU's device 12. At block 234 theAU's device 12 stores a most recent 20 seconds of the HU's voice messages on arolling basis. The 20 seconds of voice messages are stored without captioning initiallyin at least some embodiments. At decision block 236, the AU's device monitors forselection of a captioning button (not shown). If the captioning button has not beenselected, control passes back up to block 232 where blocks 232, 234 and 236 continueto cycle.
Once the caption button has been selected, control passes to block 238where AU's device 12 establishes a communication link to relay 16. At block 240 AU'sdevice 12 transmits the stored 20 seconds of the HU's voice messages along withcurrent ongoing voice messages from the HU to relay 16. At this point a CA and/orsoftware at the relay transcribes the voice-to-text, corrections are made (or not), and thetext is transmitted back to device 12 to be displayed. At block 242 AU's device 12receives the captioned text from the relay 16 and at block 244 the received text isdisplayed or presented on the AU's device display 18. At block 246, in at least someembodiments, text corresponding to the 20 seconds of HU voice messages prior toselection of the caption button may be visually distinguished (e.g., highlighted, bolded,underlined, etc.) from other text in some fashion. After block 246 control passes backup to block 232 where the process described above continues to cycle and captioning insubstantially real time continues.
Referring to Fig. 9, a relay server process 270 whereby automated softwaretranscribes voice messages that occur prior to selection of a caption button and a CA atleast initially captions current voice messages is illustrated. At block 272, after an AUrequests captioning service by selecting a caption button, server 30 receives a HU'svoice messages including current ongoing messages as well as the most recent 20seconds of voice messages that had been stored by AU's device 12 (see again Fig. 1).
After block 27, control passes to each of blocks 274 and 278 where two simultaneousprocesses commence in parallel. At block 274 the stored 20 seconds of voicemessages are provided to voice-to-text software run by server 30 to generateautomated text and at block 276 the automated text is transmitted to the AU's device 12for display. At block 278 the current or real time HU's voice messages are provided to aCA and at block 280 the CA transcribes the current voice messages to text. The CAgenerated text is transmitted to an AU's device at block 282 where the text is displayedalong with the text transmitted at block 276. Thus, here, the AU receives textcorresponding to misunderstood voice messages that occur just prior to the AUrequesting captioning. One other advantage of this system is that when captioningstarts, the CA is not starting captioning with an already existing backlog of words totranscribe and instead automated software is used to provide the prior text.
In other embodiments, when an AU cannot understand a voice messageduring a normal call and selects a caption button to obtain captioning for a most recentsegment of a HU's voice signal, the system may simply provide captions for the mostrecent 10-20 seconds of the voice signal without initiating ongoing automatic orassistance from a CA. Thus, where an AU is only sporadically or periodically unable tohear and understand the broadcast HU's voice, the HU may select the caption button toobtain periodic captioning when needed. For instance, it is envisioned that in one case,an AU may participate in a five minute call and may only require captioning during threeshort 20 second periods. In this case, the AU would select the caption button threetimes, once for each time that the user is unable to hear the HU's voice signal, and thesystem would generate three bursts of text, one for each of three HU voice segmentsjust prior to each of the button activation events.
In some cases instead of just presenting captioning for the 20 seconds priorto a caption button activation event, the system may present the prior 20 seconds and afew seconds (e.g. 10) of captioning just after the button selection to provide the 20 priorseconds in some context to make it easier for the AU to understand the overall text.
Third Party Automated Speech Recognition (ASR) And Other ASRResources In addition to using a service provided by relay 16 to transcribe stored rollingtext, other resources may be used to transcribe the stored rolling text. For instance, inat least some embodiments an AU's device may link via the Internet or the like to a thirdparty provider running automated speech recognition (ASR) software that can receivevoice messages and transcribe those messages, at least somewhat accurately, to text.
In these cases it is contemplated that real time transcription where accuracy needs tomeet a high accuracy standard would still be performed by a CA or software trained to aspecific voice while less accuracy sensitive text may be generated by the third partyprovider, at least some of the time for free or for a nominal fee, and transmitted back tothe AU's device for display.
In other cases, it is contemplated that the AU's device 12 itself may run voice-to-text or ASR software to at least somewhat accurately transcribe voice messages totext where the text generated by the AU's device would only be provided in cases whereaccuracy sensitivity is less than normal such as where rolling voice messages prior toselection of a caption icon to initiate captioning are to be transcribed.
Fig. 10 shows another method 300 for providing text for voice messages thatoccurred prior to a caption request, albeit where an AU's device generates the pre-request text as opposed to a relay. Referring also to Fig. 1, at block 310 a HU's voicemessages are received at an AU's device 12. At block 312, the AU's device 12 runsvoice-to-text software that, in at least some embodiments, trains on the fly to the voiceof a linked HU and generates caption text.
Here, on the fly training may include assigning a confidence factor to eachautomatically transcribed word and only using text that has a high confidence factor totrain a voice model for the HU. For instance, only text having a confidence factorgreater than 95% may be used for automatic training purposes. Here, confidencefactors may be assigned based on many different factors or algorithms, many of whichare well known in the automatic voice recognition art. In this embodiment, at leastinitially, the caption text generated by the AU's device 12 is not displayed to the AU. Atblock 314, until the AU requests captioning, control simply routes back up to block 310.
Once captioning is requested by an AU, control passes to block 316 where the textcorresponding to the last 20 seconds generated by the AU's device is presented on theAU's device display 18. Here, while there may be some errors in the displayed text, atleast some text associated with the most recent voice message can be quicklypresented and give the AU the opportunity to attempt to understand the voice messagesassociated therewith. At block 318 the AU's device links to a relay and at block 320 theHU's ongoing voice messages are transmitted to the relay. At block 322, after CAtranscription at the relay, the AU's device receives the transcribed text from the relayand at block 324 the text is displayed. After block 324 control passes back up to block320 where the sub-loop including blocks 320, 322 and 324 continues to cycle.
Thus, in the above example, instead of the AU's device storing the last 20seconds of a HU's voice signal and transcribing that voice signal to text after the AUrequests transcription, the AU's device constantly runs an ASR engine behind thescenes to generate automated engine text which is stored without initially beingpresented to the AU. Then, when the AU requests captioning or transcription, the mostrecently transcribed text can be presented via the AU's device display immediately orvia rapid presentation (e.g., sequentially at a speed higher than the HU's speakingspeed).
In at least some cases it is contemplated that voice-to-text software runoutside control of the relay may be used to generate at least initial text for a HU's voiceand that the initial text may be presented via an AU's device. Here, because knownsoftware still may generate more text transcription errors than allowed given standardaccuracy requirements in the text captioning industry, a relay correction service may beprovided. For instance, in addition to presenting text transcribed by the AU's device viaa device display 18, the text transcribed by the AU's device may also be transmitted to arelay 16 for correction. In addition to transmitting the text to the relay, the HU's voicemessages may also be transmitted to the relay so that a CA can compare the textautomatically generated by the AU’s device to the HU's voice messages. At the relay,the CA can listen to the voice of the hearing person and can observe associated text.
Any errors in the text can be corrected and corrected text blocks can be transmittedback to the AU's device and used for in line correction on the AU's display screen.
One advantage to this type of system is that relatively less skilled CAs maybe retained at a lesser cost to perform the CA tasks. A related advantage is that thestress level on CAs may be reduced appreciably by eliminating the need to bothtranscribe and correct at high speeds and therefore CA turnover at relays may beappreciably reduced which ultimately reduces costs associated with providing relayservices.
A similar system may include an AU’s device that links to some other thirdparty provider ASR transcription/caption server (e.g., in the "cloud") to obtain initialcaptioned text which is immediately displayed to an AU and which is also transmitted tothe relay for CA correction. Here, again, the CA corrections may be used by the thirdparty provider to train the software on the fly to the HU’s voice. In this case, the AU’sdevice may have three separate links, one to the HU, a second link to a third partyprovider server, and a third link to the relay. In other cases, the relay may create thelink to the third party server for ASR services. Here, the relay would provide the HU'svoice signal to the third party server, would receive text back from the server to transmitto the AU device and would receive corrections from the CA to transmit to each of theAU device and the third party server. The third party server would then use thecorrections to train the voice model to the HU voice and would use the evolving modelto continue ASR transcription. In still other cases the third party ASR may train on anHU's voice signal based on confidence factors and other training algorithms andcompletely independent of CA corrections.
Referring to Fig. 11, a method 360 whereby an AU's device transcribes aHU's voice to text and where corrections are made to the text at a relay is illustrated. Atblock 362 a HU's voice messages are received at an AU's device 12 (see also againFig. 1). At block 364 the AU's device runs voice-to-text software to generate text fromthe received voice messages and at block 366 the generated text is presented to theAU via display 18. At block 370 the transcribed text is transmitted to the relay 16 and atblock 372 the text is presented to a CA via the CA's display 50. At block 374 the CAcorrects the text and at block 376 corrected blocks of text are transmitted to the AU'sdevice 12. At block 378 the AU's device 12 uses the corrected blocks to correct the texterrors via in line correction. At block 380, the AU’s device uses the errors, the correctedtext and the voice messages to train the captioning software to the HU’s voice.
In some cases instead of having a relay or an AU's device run automatedvoice-to-text transcription software, a HU's device may include a processor that runstranscription software to generate text corresponding to the HU's voice messages. Tothis end, device 14 may, instead of including a simple telephone, include a computerthat can run various applications including a voice-to-text program or may link to somethird party real time transcription software program (e.g., software run on a third partyserver in the "cloud"(e.g., Watson, Google Voice, etc.)) to obtain an initial texttranscription substantially in real time. Here, as in the case where an AU's device runsthe transcription software, the text will often have more errors than allowed by thestandard accuracy requirements.
Again, to correct the errors, the text and the HU's voice messages aretransmitted to relay 16 where a CA listens to the voice messages, observes the text onscreen 18 and makes corrections to eliminate transcription errors. The corrected blocksof text are transmitted to the AU's device for display. The corrected blocks may also betransmitted back to the HU’s device for training the captioning software to the HU’svoice. In these cases the text transcribed by the HU's device and the HU's voicemessages may either be transmitted directly from the HU's device to the relay or maybe transmitted to the AU's device 12 and then on to the relay. Where the HU's voicemessages and text are transmitted directly to the relay 16, the voice messages and textmay also be transmitted directly to the AU's device for immediate broadcast and displayand the corrected text blocks may be subsequently used for in line correction.
In these cases the caption request option may be supported so that an AUcan initiate captioning during an on-going call at any time by simply transmitting a signalto the HU's device instructing the HU's device to start the captioning process. Similarly,in these cases the help request option may be supported. Where the help option isfacilitated, the automated text may be presented via the AU's device and, if the AUperceives that too many text errors are being generated, the help button may beselected to cause the HU's device or the AU's device to transmit the automated text tothe relay for CA correction.
One advantage to having a HU's device manage or perform voice-to-texttranscription is that the voice signal being transcribed can be a relatively high qualityvoice signal. To this end, a standard phone voice signal has a range of frequenciesbetween 300 and about 3000 Hertz which is only a fraction of the frequency range usedby most voice-to-text transcription programs and therefore, in many cases, automatedtranscription software does only a poor job of transcribing voice signals that havepassed through a telephone connection. Where transcription can occur within a digitalsignal portion of an overall system, the frequency range of voice messages can beoptimized for automated transcription. Thus, where a HU's computer that is all digitalreceives and transcribes voice messages, the frequency range of the messages isrelatively large and accuracy can be increased appreciably. Similarly, where a HU'scomputer can send digital voice messages to a third party transcription server accuracycan be increased appreciably.
Calls of Different Sound Quality Handled Differently In at least some configurations it is contemplated that the link between anAU's device 12 and a HU's device 14 may be either a standard phone type connectionor may be a digital or high definition (HD) connection depending on the capabilities ofthe HU's device that links to the AU's device. Thus, for instance, a first call may bestandard quality and a second call may be high definition audio. Because highdefinition voice messages have a greater frequency range and therefore can beautomatically transcribed more accurately than standard definition audio voicemessages in many cases, it has been recognized that a system where automatedvoice-to-text program use is implemented on a case by case basis depending upon thetype of voice message received (e.g., digital or analog) would be advantageous. Forinstance, in at least some embodiments, where a relay receives a standard definitionvoice message for transcription, the relay may automatically link to a CA for full CAtranscription service where the CA transcribes and corrects text via revoicing andkeyboard manipulation and where the relay receives a high definition digital voicemessage for transcription, the relay may run an automated voice-to-text transcriptionprogram to generate automated text. The automated text may either be immediatelycorrected by a CA or may only be corrected by an assistant after a help feature isselected by an AU as described above.
Referring to Fig. 12, one process 400 for treating high definition digitalmessages differently than standard definition voice messages is illustrated. Referringalso to Fig. 1, at block 402 a HU's voice messages are received at a relay 16. Atdecision block 404, relay server 30 determines if the received voice message is a highdefinition digital message or is a standard definition message (e.g., sometimes andanalog message). Where a high definition message has been received, control passesto block 406 where server 30 runs an automated voice-to-text program on the voicemessages to generate automated text. At block 408 the automated text is transmittedto the AU's device 12 for display. Referring again to block 404, where the HU's voicemessages are in standard definition audio, control passes to block 412 where a link to aCA is established so that the HU's voice messages are provided to a CA. At block 414the CA listens to the voice messages and transcribes the messages into text. Errorcorrection may also be performed at block 414. After block 414, control passes to block408 where the CA generated text is transmitted to the AU's device 12. Again, in somecases, when automated text is presented to an AU, a help button may be presentedthat, when selected causes automated text to be presented to a CA for correction. Inother cases automated text may be automatically presented to a CA for correction.
Another system is contemplated where all incoming calls to a relay are initiallyassigned to a CA for at least initial captioning where the option to switch to automatedsoftware generated text is only available when the call includes high definition audio andafter accuracy standards have been exceeded. Here, all standard definition HU voicemessages would be captioned by a CA from start to finish and any high definition callswould cut out the CA when the standard is exceeded.
In at least some cases where an AU's device is capable of running automatedvoice-to-text transcription software, the AU's device 12 may be programmed to selecteither automated transcription when a high definition digital voice message is receivedor a relay with a CA when a standard definition voice message is received. Again,where device 12 runs an automated text program, CA correction may be automatic ormay only start when a help button is selected.
Fig. 13 shows a process 430 whereby an AU's device 12 selects eitherautomated voice-to-text software or a CA to transcribe based on the type (e.g., digital oranalog) of voice messages received. At block 432 a HU's voice messages are receivedby an AU's device 12. At decision block 434, a processor in device 12 determines if theAU has selected a help button. Initially no help button is selected as no text has beenpresented so at least initially control passes to block 436. At decision block 436, thedevice processor determines if a HU's voice signal that is received is high definitiondigital or is standard definition. Where the received signal is high definition digital,control passes to block 438 where the AU’s device processor runs automated voice-to-text software to generate automated text which is then displayed on the AU devicedisplay 18 at block 440.
Referring still to Fig. 13, if the help button has been selected at block 434 or ifthe received voice messages are in standard definition, control passes to block 442where a link to a CA at relay 16 is established and the HU's voice messages aretransmitted to the relay. At block 444 the CA listens to the voice messages andgenerates text and at block 446 the text is transmitted to the AU's device 12 where thetext is displayed at block 440.
HU Recognition And Voice Training In has been recognized that in many cases most calls facilitated using anAU's device will be with a small group of other hearing or non-HUs. For instance, inmany cases as much as 70 to 80 percent of all calls to an AU's device will be with oneof five or fewer HU's devices (e.g., family, close friends, a primary care physician, etc.).
For this reason it has been recognized that it would be useful to store voice-to-textmodels for at least routine callers that link to an AU's device so that the automatedvoice-to-text training process can either be eliminated or substantially expedited. Forinstance, when an AU initiates a captioning service, if a previously developed voicemodel for a HU can be identified quickly, that model can be used without a new trainingprocess and the switchover from a full service CA to automated captioning may beexpedited (e.g., instead of taking a minute or more the switchover may be accomplishedin 15 seconds or less, in the time required to recognize or distinguish the HU's voicefrom other voices).
Fig. 14 shows a sub-process 460 that may be substituted for a portion of theprocess shown in Fig. 3 wherein voice-to-text templates or models along with relatedvoice recognition profiles for callers are stored and used to expedite the handoff toautomated transcription. Prior to running sub-process 460, referring again to Fig. 1,server 30 is used to create a voice recognition database for storing HU device identifiersalong with associated voice recognition profiles and associated voice-to-text models. Avoice recognition profile is a data construct that can be used to distinguish one voicefrom others and provide improved speech to text accuracy.
In the context of the Fig. 1 system, voice recognition profiles are usefulbecause more than one person may use a HU's device to call an AU. For instance inan exemplary case, an AU's son or daughter-in-law or one of any of three grandchildrenmay routinely use device 14 to call an AU and therefore, to access the correct voice-to-text model, server 30 needs to distinguish which caller's voice is being received. Thus,in many cases, the voice recognition database will include several voice recognitionprofiles for each HU device identifier (e.g., each HU phone number). A voice-to-textmodel includes parameters that are used to customize voice-to-text software fortranscribing the voice of an associated HU to text.
The voice recognition database will include at least one voice model for eachvoice profile to be used by server 30 to automate transcription whenever a voiceassociated with the specific profile is identified. Data in the voice recognition databasewill be generated on the fly as an AU uses device 12. Thus, initially the voicerecognition database will include a simple construct with no device identifiers, profiles orvoice models.
Referring still to Figs. 1 and 14 and now also to Fig. 3, at decision block 84 inFig. 3, if the help flag is still zero (e.g., an AU has not requested CA help to correctautomated text errors) control may pass to block 464 in Fig 13 where the HU's deviceidentifier (e.g., a phone number, an IP address, a serial number of a HU's device, etc.)is received by server 30. At block 468 server 30 determines if the HU's device identifierhas already been added to the voice recognition database. If the HU's device identifierdoes not appear in the database (e.g., the first time the HU's device is used to connectto the AU's device) control passes to block 482 where server 30 uses a general voice-to-text program to convert the HU's voice messages to text after which control passes toblock 476. At block 476 the server 30 trains a voice-to-text model using transcriptionerrors. Again, the training will include comparing CA generated text to automated textto identify errors and using the errors to adjust model parameters so that the next time aword associated with an error is uttered by the HU, the software will identify the correctword. At block 478, server 30 trains a voice profile for the HU's voice so that the nexttime the HU calls, a voice profile will exist for the specific HU that can be used to identifythe HU. At block 480 the server 30 stores the voice profile and voice model for the HUalong with the HU device identifier for future use after which control passes back up toblock 94 in Fig. 3.
Referring still to Figs. 1 and 14, at block 468, if the HU's device is alreadyrepresented in the voice recognition database, control passes to block 470 where serverruns voice recognition software on the HU's voice messages in an attempt to identifya voice profile associated with the specific HU. At decision block 472, if the HU's voicedoes not match one of the previously stored voice profiles associated with the deviceidentifier, control passes to block 482 where the process described above continues. Atblock 472, if the HU's voice matches a previously stored profile, control passes to block474 where the voice model associated with the matching profile is used to tune thevoice-to-text software to be used to generate automated text.
Referring still to Fig. 14, at blocks 476 and 478, the voice model and voiceprofile for the HU are continually trained. Continual training enables the system toconstantly adjust the model for changes in a HU's voice that may occur over time orwhen the HU experiences some physical condition (e.g., a cold, a raspy voice) thataffects the sound of their voice. At block 480, the voice profile and voice model arestored with the HU device identifier for future use.
In at least some embodiments, server 30 may adaptively change the order ofvoice profiles applied to a HU's voice during the voice recognition process. Forinstance, while server 30 may store five different voice profiles for five different HUs thatroutinely connect to an AU's device, a first of the profiles may be used 80 percent of thetime. In this case, when captioning is commenced, server 30 may start by using the firstprofile to analyze a HU's voice at block 472 and may cycle through the profiles from themost matched to the least matched.
To avoid server 30 having to store a different voice profile and voice model forevery hearing person that communicates with an AU via device 12, in at least someembodiments it is contemplated that server 30 may only store models and profiles for alimited number (e.g., 5) of frequent callers. To this end, in at least some cases serverwill track calls and automatically identify the most frequent HU devices used to link tothe AU's device 12 over some rolling period (e.g., 1 month) and may only store modelsand profiles for the most frequent callers. Here, a separate counter may be maintainedfor each HU device used to link to the AU's device over the rolling period and differentmodels and profiles may be swapped in and out of the stored set based on frequency ofcalls.
In other embodiments server 30 may query an AU for some indication that aspecific HU is or will be a frequent contact and may add that person to a list for which amodel and a profile should be stored for a total of up to five persons.
While the system described above with respect to Fig. 14 assumes that therelay 16 stores and uses voice models and voice profiles that are trained to HU's voicesfor subsequent use, in at least some embodiments it is contemplated that an AU'sdevice 12 processor may maintain and use or at least have access to and use the voicerecognition database to generate automated text without linking to a relay. In this case,because the AU's device runs the software to generate the automated text, the softwarefor generating text can be trained any time the user's device receives a HU's voicemessages without linking to a relay. For example, during a call between a HU and anAU on devices 14 and 12, respectively, in Fig. 1, and prior to an AU requestingcaptioning service, the voice messages of even a new HU can be used by the AU'sdevice to train a voice-to-text model and a voice profile for the user. In addition, prior toa caption request, as the model is trained and gets better and better, the model can beused to generate text that can be used as fill in text (e.g., text corresponding to voicemessages that precede initiation of the captioning function) when captioning is selected.
Fig. 15 shows a process 500 that may be performed by an AU's device totrain voice models and voice profiles and use those models and profiles to automatetext transcription until a help button is selected. Referring also to Fig. 1, at block 502,an AU's device 12 processor receives a HU's voice messages as well as an identifier(e.g. a phone number) of the HU's device 14. At block 504 the processor determines ifthe AU has selected the help button (e.g., indicating that current captioning includes toomany errors). If an AU selects the help button at block 504, control passes to block 522where the AU's device is linked to a CA at relay 16 and the HU's voice is presented tothe CA. At block 524 the AU's device receives text back from the relay and at block 534the CA generated text is displayed on the AU's device display 18.
Where the help button has not been selected, control passes to block 505where the processor uses the device identifier to determine if the HU's device isrepresented in the voice recognition database. Where the HU's device is notrepresented in the database control passes to block 528 where the processor uses ageneral voice-to-text program to convert the HU's voice messages to text after whichcontrol passes to block 512.
Referring again to Figs. 1 and 15, at block 512 the processor adaptively trainsthe voice model using perceived errors in the automated text. To this end, one way totrain the voice model is to generate text phonetically and thereafter perform a contextanalysis of each text word by looking at other words proximate the word to identifyerrors. Another example of using context to identify errors is to look at severalgenerated text words as a phrase and compare the phrase to similar prior phrases thatare consistent with how the specific HU strings words together and identify anydiscrepancies as possible errors. At block 514 a voice profile for the HU is generatedfrom the HU's voice messages so that the HU's voice can be recognized in the future.
At block 516 the voice model and voice profile for the HU are stored for future useduring subsequent calls and then control passes to block 518 where the processdescribed above continues. Thus, blocks 528, 512, 514 and 516 enable the AU'sdevice to train voice models and voice profiles for HUs that call in anew where a newvoice model can be used during an ongoing call and during future calls to providegenerally accurate transcription.
Referring still to Figs. 1 and 15, if the HU's device is already represented inthe voice recognition database at block 505, control passes to block 506 where theprocessor runs voice recognition software on the HU's voice messages in an attempt toidentify one of the voice profiles associated with the device identifier. At block 508,where no voice profile is recognized, control passes to block 528.
At block 508, if the HU’s voice matches one of the stored voice profiles,control passes to block 510 where the voice-to-text model associated with the matchingprofile is used to generate automated text from the HU’s voice messages. Next, atblock 518, the AU’s device processor determine if the caption button on the AU’s devicehas been selected. If captioning has not been selected control passes to block 502where the process continues to cycle. Once captioning has been requested, controlpasses to block 520 where AU’s device 12 displays the most recent 10 seconds ofautomated text and continuing automated text on display 18.
In at least some embodiments it is contemplated that different types of voicemodel training may be performed by different processors within the overall Fig. 1system. For instance, while an AU's device is not linked to a relay, the AU's devicecannot use any errors identified by a call assistance at the relay to train a voice modelas no CA is generating errors. Nevertheless, the AU's device can use context andconfidence factors to identify errors and train a model. Once an AU's device is linked toa relay where a CA corrects errors, the relay server can use the CA identified errors andcorrections to train a voice model which can, once sufficiently accurate, be transmittedto the AU's device where the new model is substituted for the old content based modelor where the two models are combined into a single robust model in some fashion. Inother cases when an AU's device links to a relay for CA captioning, a context basedvoice model generated by the AU’s device for the HU may be transmitted to the relayserver and used as an initial model to be further trained using CA identified errors andcorrections. In still other cases CA errors may be provided to the AU's device and usedby that device to further train a context based voice model for the HU.
Referring now to Fig. 16, a sub-process 550 that may be added to theprocess shown in Fig. 15 whereby an AU's device trains a voice model for a HU usingvoice message content and a relay server further trains the voice model generated bythe AU's device using CA identified errors is illustrated. Referring also to Fig. 15, sub-process 550 is intended to be performed in parallel with block 524 and 534 in Fig. 15.
Thus, after block 522, in addition to block 524, control also passes to block 552 in Fig.16. At block 552 the voice model for a HU that has been generated by an AU's device12 is transmitted to relay 16 and at block 553 the voice model is used to modify a voice-to-text program at the relay. At block 554 the modified voice-to-text program is used toconvert the HU's voice messages to automated text. At block 556 the CA generatedtext is compared to the automated text to identify errors. At block 558 the errors areused to further train the voice model. At block 560, if the voice model has an accuracybelow the required standard, control passes back to block 502 in Fig. 15 where theprocess described above continues to cycle. At block 560, once the accuracy exceedsthe standard requirement, control passes to block 562 wherein server 30 transmits thetrained voice model to the AU's device for handling subsequent calls from the HU forwhich the model was trained. At block 564 the new model is stored in the databasemaintained by the AU's device.
Referring still to Fig. 16, in addition to transmitting the trained model to theAU's device at block 562, once the model is accurate enough to meet the standardrequirements, server 30 may perform an automated process to cut out the CA andinstead transmit automated text to the AU's device as described above in Fig. 1. In thealternative, once the model has been transmitted to the AU's device at block 562, therelay may be programmed to hand off control to the AU's device which would then usethe newly trained and relatively more accurate model to perform automated transcriptionso that the relay could be disconnected.
Several different concepts and aspects of the present disclosure have beendescribed above. It should be understood that many of the concepts and aspects maybe combined in different ways to configure other triage systems that are more complex.
For instance, one exemplary system may include an AU's device that attemptsautomated captioning with on the fly training first and, when automated captioning bythe AU's device fails (e.g., a help icon is selected by an AU), the AU's device may link toa third party captioning system via the internet or the like where another moresophisticated voice-to-text captioning software is applied to generate automated text.
Here, if the help button is selected a second time or a "CA" button is selected, the AU'sdevice may link to a CA at the relay for CA captioning with simultaneous voice-to-textsoftware transcription where errors in the automated text are used to train the softwareuntil a threshold accuracy requirement is met. Here, once the accuracy requirement isexceeded, the system may automatically cut out the CA and switch to the automatedtext from the relay until the help button is again selected. In each of the transcriptionhand offs, any learning or model training performed by one of the processors in thesystem may be provided to the next processor in the system to be used to expedite thetraining process.
Line Check Words In at least some embodiments an automated voice-to-text engine may beutilized in other ways to further enhance calls handled by a relay. For instance, in caseswhere transcription by a CA lags behind a HU's voice messages, automatedtranscription software may be programmed to transcribe text all the time and identifyspecific words in a HU's voice messages to be presented via an AU's displayimmediately when identified to help the AU determine when a HU is confused by acommunication delay. For instance, assume that transcription by a CA lags a HU'smost current voice message by 20 seconds and that an AU is relying on the CAgenerated text to communicate with the HU. In this case, because the CA generatedtext lag is substantial, the HU may be confused when the AU's response also lags asimilar period and may generate a voice message questioning the status of the call. Forinstance, the HU may utter "Are you there?" or "Did you hear me?" or "Hello" or "Whatdid you say?". These phrases and others like them querying call status are referred toherein as "line check words" (LCWs) as the HU is checking the status of the call on theline.
If the line check words are not presented until they occurred sequentially inthe HU's voice messages, they would be delayed for 20 or more seconds in the aboveexample. In at least some embodiments it is contemplated that the automated voiceengine may search for line check words (e.g., 50 common line check phrases) in a HU'svoice messages and present the line check words immediately via the AU's deviceduring a call regardless of which words have been transcribed and presented to an AU.
The AU, seeing line check words or a phrase can verbally respond that the captioningservice is lagging but catching up so that the parties can avoid or at least minimizeconfusion. In the alternative, a system processor may automatically respond to any linecheck words by broadcasting a voice message to the HU indicating that transcription islagging and will catch up shortly. The automated message may also be broadcast tothe AU so that the AU is also aware of the HU's situation.
When line check words are presented to an AU the words may be presentedin-line within text being generated by a CA with intermediate blanks representing wordsyet to be transcribed by the CA. To this end, see again Fig. 17 that shows line checkwords "Are you still there?" in a highlighting box 590 at the end of intermediate blanks216 representing words yet to be transcribed by the CA. Line check words will, in atleast some embodiments, be highlighted on the display or otherwise visuallydistinguished. In other embodiments the line check words may be located at someprominent location on the AU's display screen (e.g., in a line check box or field at thetop or bottom of the display screen).
One advantage of using an automated voice engine to only search for specificwords and phrases is that the engine can be tuned for those words and will be relativelymore accurate than a general purpose engine that transcribes all words uttered by aHU. In at least some embodiments the automated voice engine will be run by an AU'sdevice processor while in other embodiments the automated voice engine may be runby the relay server with the line check words transmitted to the AU's device immediatelyupon generation and identification.
In still other cases where automated text is presented immediately upongeneration to an AU, line check words may be presented in a visually distinguishedfashion (e.g., highlighted, in different color, as a distinct font, as a uniquely sized font,etc.) so that an AU can distinguish those words from others and, where appropriate,provide a clarifying remark to a confused HU.
Referring now to Fig. 19, a process 600 that may be performed by an AU'sdevice 12 and a relay to transcribe HU's voice messages and provide line check wordsimmediately to an AU when transcription by a CA lags in illustrated. At block 602 aHU's voice messages are received by an AU's device 12. After block 602 controlcontinues along parallel sub-processes to blocks 604 and 612. At block 604 the AU'sdevice processor uses an automated voice engine to transcribe the HU's voicemessages to text. Here, it is assumed that the voice engine may generate severalerrors and therefore likely would be insufficient for the purposes of providing captioningto the AU. The engine, however, is optimized and trained to caption a set (e.g., 10 to100) line check words and/or phrases which the engine can do extremely accurately. Atblock 606, the AU's device processor searches for line check words in the automatedtext. At block 608, if a line check word or phrase is not identified control passes back upto block 602 where the process continues to cycle. At block 608, if a line check word orphrase is identified, control passes to block 610 where the line check word/phrase isimmediately presented (see phrase "Are you still there?" in Fig. 18) to the AU viadisplay 18 either in-line or in a special location and, in at least some cases, in a visuallydistinct manner.
Referring still to Fig. 19, at block 612 the HU's voice messages are sent to arelay for transcription. At block 614, transcribed text is received at the AU's device backfrom the relay. At block 616 the text from the relay is used to fill in the intermediateblanks (see again Fig. 17 and also Fig. 18 where text has been filled in) on the AU'sdisplay.
ASR Suggests Errors In CA Generated Text In at least some embodiments it is contemplated that an automated voice-to-text engine may operate all the time and may check for and indicate any potential errorsin CA generated text so that the CA can determine if the errors should be corrected.
For instance, in at least some cases, the automated voice engine may highlightpotential errors in CA generated text on the CA's display screen inviting the CA tocontemplate correcting the potential errors. In these cases the CA would have the finalsay regarding whether or not a potential error should be altered.
Consistent with the above comments, see Fig. 20 that shows a screen shot ofa CA's display screen where potential errors have been highlighted to distinguish theerrors from other text. Exemplary CA generated text is shown at 650 with errors shownin phantom boxes 652, 654 and 656 that represent highlighting. In the illustratedexample, exemplary words generated by an automated voice-to-text engine are alsopresented to the CA in hovering fields above the potentially erroneous text as shown at658, 660 and 662. Here, a CA can simply touch a suggested correction in a hoveringfield or use a pointing device such as a mouse controlled cursor to select a presentedword to make a correction and replace the erroneous word with the automated textsuggested in the hovering field. If a CA instead touches an error, the CA can manuallychange the word to another word. If a CA does not touch an error or an associatedcorrected word, the word remains as originally transcribed by the CA. An "Accept All"icon is presented at 669 that can be selected to accept all of the suggestions presentedon a CA's display. All corrected words are transmitted to an AU's device to bedisplayed.
Referring to Fig. 21, a method 700 by which a voice engine generates text tobe compared to CA generated text and for providing a correction interface as in Fig. 20for the CA is illustrated. At block 702 the HU's voice messages are provided to a relay.
After block 702 control follows to two parallel paths to blocks 704 and 716. At block 704the HU's voice messages are transcribed into text by an automated voice-to-text enginerun by the relay server before control passes to block 706. At block 716 a CAtranscribes the HU's voice messages to CA generated text. At block 718 the CAgenerated text is transmitted to the AU's device to be displayed. At block 720 the CAgenerated text is displayed on the CA's display screen 50 for correction after whichcontrol passes to block 706.
Referring still to Fig. 21, at block 706 the relay server compares the CAgenerated text to the automated text to identify any discrepancies. Where theautomated text matches the CA generated text at block 708, control passes back up toblock 702 where the process continues. Where the automated text does not match theCA generated text at block 708, control passes to block 710 where the server visuallydistinguishes the mismatched text on the CA's display screen 50 and also presentssuggested correct text (e.g., the automated text). Next, at block 712 the server monitorsfor any error corrections by the CA and at block 714 if an error has been corrected, thecorrected text is transmitted to the AU's device for in-line correction.
In at least some embodiments the relay server may be able to generate sometype of probability or confidence factor related to how likely a discrepancy betweenautomated and CA generated text is related to a CA error and may only indicate errorsand present suggestions for probable errors or discrepancies likely to be related toerrors. For instance, where an automated text segment is different than an associatedCA generated text segment but the automated segment makes no sense contextually ina sentence, the server may not indicate the discrepancy or may not show theautomated text segment as an option for correction. The same discrepancy may beshown as a potential error at a different time if the automated segment makescontextual sense.
In still other embodiments automated voice-to-text software that operates atthe same time as a CA to generate text may be trained to recognize words often missedby a CA such as articles, for instance, and to ignore other words that CAs moreaccurately transcribe.
The particular embodiments disclosed above are illustrative only, as theinvention may be modified and practiced in different but equivalent manners apparent tothose skilled in the art having the benefit of the teachings herein. Furthermore, nolimitations are intended to the details of construction or design herein shown, other thanas described in the claims below. It is therefore evident that the particular embodimentsdisclosed above may be altered or modified and all such variations are consideredwithin the scope and spirit of the invention. Accordingly, the protection sought herein isas set forth in the claims below.
Thus, the invention is to cover all modifications, equivalents, and alternativesfalling within the spirit and scope of the invention as defined by the following appendedclaims. For example, while the methods above are described as being performed byspecific system processors, in at least some cases various method steps may beperformed by other system processors. For instance, where a HU's voice is recognizedand then a voice model for the recognized HU is employed for voice-to-texttranscription, the voice recognition process may be performed by an AU's device andthe identified voice may be indicated to a relay 16 which then identifies a related voicemodel to be used. As another instance, a HU's device may identify a HU's voice andindicate the identity of the HU to the AU's device and/or the relay.
As another example, while the system is described above in the context of atwo line captioning system where one line links an AU's device to a HU's device and asecond line links the AU's device to a relay, the concepts and features described abovemay be used in any transcription system including a system where the HU's voice istransmitted directly to a relay and the relay then transmits transcribed text and the HU'svoice to the AU's device.
As still one other example, while inputs to an AU's device may includemechanical or virtual on screen buttons/icons, in some embodiments other inputsarrangements may be supported. For instance, in some cases help or a captioningrequest may be indicated via a voice input (e.g., verbal a request for assistance or forcaptioning) or via a gesture of some type (e.g., a specific hand movement in front of acamera or other sensor device that is reserved for commencing captioning).
As another example, in at least some cases where a relay includes first andsecond differently trained CAs where first CAs are trained to be capable of transcribingand correcting text and second CAs are only trained to be capable of correcting text, aCA may always be on a call but the automated voice-to-text software may aid in thetranscription process whenever possible to minimize overall costs. For instance, whena call is initially linked to a relay so that a HU's voice is received at the relay, the HU'svoice may be provided to a first CA fully trained to transcribe and correct text. Here,voice-to-text software may train to the HU's voice while the first CA transcribes the textand after the voice-to-text software accuracy exceeds a threshold, instead of completelycutting out the relay or CA, the automated text may be provided to a second CA that isonly trained to correct errors. Here, after training the automated text should haveminimal errors and therefore even a minimally trained CA should be able to makecorrections to the errors in a timely fashion. In other cases, a first CA assigned to a callmay only correct errors in automated voice-to-text transcription and a fully trainedrevoicing and correcting CA may only be assigned after a help or caption request isreceived.
In other systems an AU's device processor may run automated voice-to-textsoftware to transcribe HU's voice messages and may also generate a confidence factorfor each word in the automated text based on how confident the processor is that theword has been accurately transcribed. The confidence factors over a most recentnumber of words (e.g., 100) or a most recent period (e.g., 45 seconds) may beaveraged and the average used to assess an overall confidence factor for transcriptionaccuracy. Where the confidence factor is below a threshold level, the device processormay link to a relay for more accurate transcription either via more sophisticatedautomated voice-to-text software or via a CA. The automated process for linking to arelay may be used instead of or in addition to the process described above whereby anAU selects a "caption" button to link to a relay.
User Customized Complex Words In addition to storing HU voice models, a system may also store otherinformation that could be used when an AU is communicating with specific HU's toincrease accuracy of automated voice-to-text software when used. For instance, aspecific HU may routinely use complex words from a specific industry when conversingwith an AU. The system software can recognize when a complex word is corrected by aCA or contextually by automated software and can store the word and the pronunciationof the word by the specific HU in a HU word list for subsequent use. Then, when thespecific HU subsequently links to the AU's device to communicate with the AU, thestored word list for the HU may be accessed and used to automate transcription. TheHU's word list may be stored at a relay, by an AU's device or even by a HU's devicewhere the HU's device has data storing capability.
In other cases a word list specific to an AU's device (i.e., to an AU) thatincludes complex or common words routinely used to communicate with the AU may begenerated, stored and updated by the system. This list may include words used on aregular basis by any HU that communicates with an AU. In at least some cases this listor the HU's word lists may be stored on an internet accessible database (e.g., in the"cloud") so that the AU or some other person has the ability to access the list(s) and editwords on the list via an internet portal or some other network interface.
Where an HU's complex or hard to spell word list and/or an AU's word list isavailable, when a CA is creating CA generated text (e.g., via revoicing, typing, etc.), anASR engine may always operate to search the HU voice signal to recognize when acomplex or difficult to spell word is annunciated and the complex or hard to spell wordsmay be automatically presented to the CA via the CA display screen in line with the CAgenerated text to be considered by the CA. Here, while the CA would still be able tochange the automatically generated complex word, it is expected that CA correction ofthose words would not occur often given the specialized word lists for the specificcommunicating parties.
Dialect And Other Basis For Specific Transcription Programs In still other embodiments various aspects of a HU's voice messages may beused to select different voice-to-text software programs that are optimized for voiceshaving different characteristic sets. For instance, there may be different voice-to-textprograms optimized for male and female voices or for voices having different dialects.
Here, system software may be able to distinguish one dialect from others and select anoptimized voice engine/software program to increase transcription accuracy. Similarly,a system may be able to distinguish a high pitched voice from a low pitched voice andselect a voice engine accordingly.
In some cases a voice engine may be selected for transcribing a HU's voicebased on the region of a country in which a HU's device resides. For instance, where aHU's device is located in the southern part of the United States, an engine optimized fora southern dialect may be used while a device in New England may cause the systemto select an engine optimized for another dialect. Different word lists may also be usedbased on region of a country in which a HU's device resides.
Indicating/Selecting Caption Source In at least some cases it is contemplated that an AU's device will provide atext or other indication to an AU to convey how text that appears on an AU devicedisplay 18 is being generated. For instance, when automated voice-to-text software(e.g., an automated voice recognition (ASR) system) is generating text, the phrase"Software Generated Text" may be persistently presented (see 729 in Fig. 22) at the topof a display 18 and when CA generated text is presented, the phrase "CA GeneratedText" (not illustrated) may be presented. A phrase "CA Corrected Text" (not illustrated)may be presented when automated Text is corrected by a CA.
In some cases a set of virtual buttons (e.g., 68 in Fig. 1) or mechanicalbuttons may be provided via an AU device allowing an AU to select captioningpreferences. For instance, captioning options may include "Automated/SoftwareGenerated Text", "CA Generated Text" (see virtual selection button 719 in Fig. 22) and"CA Corrected Text" (see virtual selection button 721 in Fig. 22). This feature allows anAU to preemptively select a preference in specific cases or to select a preferencedynamically during an ongoing call. For example, where an AU knows from pastexperience that calls with a specific HU result in excessive automated text errors, theAU could select "CA generated text" to cause CA support to persist during the durationof a call with the specific HU.
Caption Confidence Indication In at least some embodiments, automated voice-to-text accuracy may betracked by a system and indicated to any one or a subset of a CA, an AU, and an HUeither during CA text generation or during automated text presentation, or both. Here,the accuracy value may be over the duration of an ongoing call or over a short mostrecent rolling period or number of words (e.g., last 30 seconds, last 100 words, etc.), orfor a most recent HU turn at talking. In some cases two averages, one over a full callperiod and the other over a most recent period, may be indicated. The accuracy valueswould be provided via the AU device display 18 (see 728 in Fig. 22) and /or the CAworkstation display 50. Where an HU device has a display (e.g., a smart phone, atablet, etc.), the accuracy value(s) may be presented via that display in at least somecases. To this end, see the smart phone type HU device 800 in Fig. 24 where anaccuracy rate is displayed at 802 for a call with an AU. It is expected that seeing a lowaccuracy value would encourage an HU to try to annunciate words more accurately orslowly to improve the value.
Non-Text Communication Enhancements Human communication has many different components and the meaningsascribed to text words are only one aspect of that communication. One other aspect ofhuman non-text communication includes how words are annunciated which often beliesa speakers emotions or other meaning. For instance, a simple change in volume whilewords are being spoken is often intended to convey a different level of importance.
Similarly, the duration over which a word is expressed, the tone or pitch used when aphrase is annunciated, etc., can convey a different meaning. For instance, annunciatingthe word "Yes" quickly can connote a different meaning than annunciating the word"Yes" very slowly or such that the "s" sound carries on for a period of a few seconds. Asimple text word representation is devoid of a lot of meaning in an originally spokenphrase in many cases.
In at least some embodiments of the present disclosure it is contemplatedthat volume changes, tone, length of annunciation, pitch, etc., of an HU's voice signalmay be sensed by automated software and used to change the appearance of orotherwise visually distinguish transcribed text that is presented to an AU via a devicedisplay 18 so that the AU can more fully understand and participate in a richercommunication session. To this end, see, for instance, the two textual effects 732 and734 in AU device text 730 in Fig. 22 where an arrow effect 732 represents a longannunciation period while a bolded/italicized effect 734 represents an appreciablechange in HU voice signal volume. Many other non-textual characteristics of an HUvoice signal are contemplated and may be sensed and each may have a differentappearance. For instance, pitch, speed of speaking, etc., may all be automaticallydetermined and used to provide effect distinct visual cues along with the transcribedtext.
The visual cues may be automatically provided with or used to distinguish textpresented via an AU device display regardless of the source of the text. For example,in some cases automated text may be supplemented with visual cues to indicate othercommunication characteristics and in at least some cases even CA generated text maybe supplemented with automatically generated visual cues indicating how an HUannunciates various words and phrases. Here, as voice characteristics are detected foran HU's utterances, software tracks the voice characteristics in time and associatesthose characteristics with specific text words or phrases generated by the CA. Then,the visual cues for each voice characteristic are used to visually distinguish theassociated words when presented to the AU.
In at least some cases an AU may be able to adjust the degree to which textis enhanced via visual cues or even to select preferred visual cues for differentautomatically identified voice characteristics. For instance, a specific AU may find fullyenabled visual queuing to be distracting and instead may only want bold capital lettervisual queuing when an HU's volume level exceeds some threshold value. AU devicepreferences may be set via a display 18 during some type device of commissioningprocess.
In some embodiments it is contemplated that the automated software thatidentifies voice characteristics will adjust or train to an HU's voice during the first fewseconds of a call and will continue to train to that voice so that voice characteristicidentification is normalized to the HU's specific voice signal to avoid excessive visualqueuing. Here, it has been recognized that some people's voices will have persistentvoice characteristics that would normally be detected as anomalies if compared to avoice standard (e.g., a typical male or female voice). For instance, a first HU mayalways speak loudly and therefore, if his voice signal was compared to an average HUvolume level, the voice signal would exceed the average level most if not all the time.
Here, to avoid always distinguishing the first HU's voice signal with visual queuingindicating a loud voice, the software would use the HU voice signal to determine that thefirst HU's voice signal is persistently loud and would normalize to the loud signal so thatwords uttered within a range of volumes near the persistent loud volume would not bedistinguished as loud. Here, if the first HU's voice signal exceeds the range about hispersistent volume level, the exceptionally loud signal may be recognized as a cleardeviation from the persistent volume level for the normalized voice and thereforedistinguished with a visual queue for the AU when associated text is presented. Thevoice characteristic recognizing software would automatically train to the persistentvoice characteristics for each HU including for instance, pitch, tone, speed ofannunciation, etc., so that persistent voice characteristics of specific HU voice signalsare not visually distinguished as anomalies.
In at least some cases, as in the case of voice models developed and storedfor specific HUs, it is contemplated that HU voice models may also be automaticallydeveloped and stored for specific HU's for specifying voice characteristics. Forinstance, in the above example where a first HU has a particularly loud persistent voice,the volume range about the first HU's persistent volume as well as other persistentcharacteristics may be determined once during an initial call with an AU and then storedalong with a phone number or other HU identifying information in a system database.
Here, the next time the first HU communicates with an AU via the system, the HU voicecharacteristic model would be automatically accessed and used to detect voicecharacteristic anomalies and to visually distinguish accordingly.
Referring again to Fig. 22, in addition to changing the appearance oftranscribed text to indicate annunciation qualities or characteristics, other visual cuesmay be presented. For instance, if an HU persistently talks in a volume that is muchhigher than typical for the HU, a volume indicator 717 may be presented or visuallyaltered in some fashion to indicate the persistent volume. As another example, avolume indicator 715 may be presented above or otherwise spatially proximate anyword annunciated with an unusually high volume. In some cases the distinguishingvisual queue for a specially annunciated word may only persist for a short duration (e.g.,3 seconds, until the end of a related sentence or phrase, for the next 5 words of anutterance, etc.) and then be eliminated. Here, the idea is that the visual queuing issupposed to mimic the effect of an annunciated word or phrase which does not persistlong term (e.g., the loud effect of a high volume word only persists as the word is beingannunciated).
The software used to generate the HU voice characteristic models and/or todetect voice anomalies to be visually distinguished may be run via any of an HU deviceprocessor, an AU device processor, a relay processor and a third party operatedprocessor linkable via the internet or some other network. In at least some cases it willbe optimal for an HU device to develop the HU model for an HU that is associated withthe device and to store the model and apply the model to the HU's voice to detectanomalies to be visually distinguished for several reasons. In this regard, a particularlyrich acoustic HU voice signal is available at the HU device so that anomalies can bebetter identified in many cases by the HU device as opposed to some processordownstream in the captioning process.
Sharing Text With HU Referring again to Fig. 24, in at least some embodiments where an HU device800 includes a display screen 801, an HU voice text transcription 804 may also bepresented via the HU device. Here, an HU viewing the transcribed text could formulatean independent impression of transcription accuracy and whether or not a more robusttranscription process (e.g., CA generation of text) is required or would be preferred. Inat least some cases a virtual "CA request" button 806 or the like may be provided on theHU screen for selection so that the HU has the ability to initiate CA text transcription andor CA correction of text. Here, an HU device may also allow an HU to switch back toautomated text if an accuracy value 802 exceeds some threshold level. Where HUvoice characteristics are detected, those characteristics may be used to visuallydistinguish text at 804 in at least some embodiments.
Captioning Via HU's Device Where an HU device is a smart phone, a tablet computing device or someother similar device capable of downloading software applications from an applicationstore, it is contemplated that a captioning application may be obtained from anapplication store for communication with one or more AU devices 12. For instance, theson or daughter of an AU may download the captioning application to be used any timethe device user communicates with the AU. Here, the captioning application may haveany of the functionality described in this disclosure and may result in a much betteroverall system in various ways.
For instance, a captioning application on an HU device may run automatedvoice-to-text software on a digital HU voice signal as described above where that text isprovided to the AU device 12 for display and, at times, to a relay for correction, voicemodel training, voice characteristic model training, etc. As another instance, an HUdevice may train a voice model for an HU any time an HU's voice signal is obtainedregardless of whether or not the HU is participating in a call with an AU. For example, ifa dictation application on an HU device which is completely separate from a captioningapplication is used to dictate a letter, the HU voice signal during dictation may be usedto train a general HU voice model for the HU and, more specifically, a general modelthat can be used subsequently by the captioning system or application. Similarly, anHU voice signal captured during entry of a search phrase into a browser or an addressinto mapping software which is independent of the captioning application may be usedto further train the general voice model for the HU. Here, the general voice model maybe extremely accurate even before used in by AU captioning application. In addition, anaccuracy value for an HU's voice model may be calculated prior to an initial AUcommunication so that, if the accuracy value exceeds a high or required accuracystandard, automated text transcription may be used for an HU-AU call without requiringCA assistance, at least initially.
For instance, prior to an initial AU call, an HU device processor training to anHU voice signal may assign confidence factors to text words automatically transcribedby an ASR engine from HU voice signals. As the software trains to the HU voice, theconfidence factor values would continue to increase and eventually should exceedsome threshold level at which initial captioning during an AU communication wouldmeet accuracy requirements set by the captioning industry.
As another instance, an HU voice model stored by or accessible by the HUdevice can be used to automatically transcribe text for any AU device without requiringcontinual redevelopment or teaching of the HU voice model. Thus, one HU device maybe used to communicate with two separate hearing impaired persons using two differentAU devices without each sub-system redeveloping the HU voice model.
As yet another instance, an HU's smart phone or tablet device running acaptioning application may link directly to each of a relay and an AU's device to provideone or more of the HU voice signal, automated text and/or an HU voice model or voicecharacteristic model to each. This may be accomplished through two separate phonelines or via two channels on a single cellular line or via any other combination of twocommunication links.
In some cases an HU voice model may be generated by a relay or an AU'sdevice or some other entity (e.g., a third party ASR engine provider) over time and theHU voice model may then be stored on the HU device or rendered accessible via thatdevice for subsequent transcription. In this case, one robust HU voice model may bedeveloped for an HU by any system processor or server independent of the HU deviceand may then be used with any AU device and relay for captioning purposes.
Assessing/Indicating Communication Characteristics In still other cases, at least one system processor may monitor and assessline and/or audio conditions associated with a call and may present some type ofindication to each or a subset of an AU, an HU and a CA to help each or at least one ofthe parties involved in a call to assess communication quality. For instance, an HUdevice may be able to indicate to an AU and a CA if the HU device is being used as aspeaker phone which could help explain an excessive error rate and help with adecision related to CA captioning involvement. As another instance, an HU's devicemay independently assess the level of non-HU voice signal noise being picked up by anHU device microphone and, if the determined noise level exceeds some threshold valueeither by itself or in relation to the signal strength of the HU voice signal, may performsome compensatory or corrective function. For example, one function may be toprovide a signal to the HU indicating that the noise level is high. Another function maybe to provide a noise level signal to the CA or the AU which could be indicated on oneor both of the displays 50 and 18. Yet another function would be to offer one or morecaptioning options to any of the HU or AU or even to a text correcting CA when thenoise level exceeds the threshold level. Here, the idea is that as the noise levelincreases, the likelihood of accurate ASR captioning will typically decrease andtherefore more accurate and robust captioning options should be available.
As another instance, an HU device may transmit a known signal to an AUdevice which returns the known signal to the HU device and the HU device maycompare the received signal to the known signal to determine line or communication linkquality. Here, the HU may present a line quality value as shown at 808 in Fig. 24 for theHU to consider. Similarly, an AU device may generate a line quality value in a similarfashion and may present the line quality signal (not illustrated) to the AU to beconsidered.
In some cases system devices may monitor a plurality of different systemoperating characteristics such as line quality, speaker phone use, non-voice noise level,voice volume level, voice signal pace, etc., and may present one or more "coaching"indications to any one of or a subset of the HU, CA and AU for consideration. Here, thecoaching indications should help the parties to a call understand if there is somethingthey can do to increase the level of captioning accuracy. Here, in at least some casesonly the most impactful coaching indications may be presented and different entitiesmay receive different coaching indications. For instance, where noise at HU locationexceeds a threshold level, a noise indicating signal may only be presented to the HU.
Where the system also recognizes that line quality is only average, that indication maybe presented to the AU and not to the HU while the HU's noise level remains high. Ifthe HU moves to a quieter location, the noise level indication on the HU device may bereplaced with a line quality indication. Thus, the coaching indications should helpindividual call entities recognize communication conditions that they can effect or thatmay be the cause of or may lead to poor captioning results for the AU.
In some cases coaching may include generating a haptic feedback or audiblesignal or both and a text message for an HU and/or an AU. To this end, while AU'sroutinely look at their devices to see captions during a caption assisted call, many HUsdo not look at their devices during a call and simply rely on audio during communication.
In the case of an AU, in some cases even when captioning is presented to an AU theAU may look away from their device display at times when their hearing is sufficient. Byproviding a haptic or audible or both additional signals, a user's attention can be drawnto their device displays where a warning or call state text message may present moreinformation such as, for instance, an instruction to "Speak louder" or "Move to a lessnoisy space", for consideration.
Text Lag Constraints In some embodiments an AU may be able to set a maximum text lag timesuch that automated text generated by an ASR engine is used to drive an AU devicescreen 18 when a CA generated text lag reaches the maximum value. For instance, anAU may not want text to lag behind a broadcast HU voice signal by more than 7seconds and may be willing to accept a greater error rate to stay within the maximumlag time period. Here, CA captioning/correction may proceed until the maximum lagtime occurs at which point automated text may be used to fill in the lag period up to acurrent HU voice signal on the AU device and the CA may be skipped ahead to thecurrent HU signal automatically to continue the captioning process. Again, here, anyautomated fill in text or text not corrected by a CA may be visually distinguished on theAU device display as well as on the CA display for consideration.
It has been recognized that many AU's using text to understand a broadcastHU voice signal prefer that the text lag behind the voice signal at least some shortamount of time. For instance, an AU talking to an HU may stair off into space whilelistening to the HU voice signal and, only when a word or phrase is not understood, maylook to text on display 18 for clarification. Here, if text were to appear on a display 18immediately upon audio broadcast to an AU, the text may be several words beyond themisunderstood word by the time the AU looks at the display so that the AU would berequired to hunt for the word. For this reason, in at least some embodiments, a shortminimum text delay may be implemented prior to presenting text on display 18. Thus,all text would be delayed at least 2 seconds in some cases and perhaps longer where atext generation lag time exceeds the minimum lag value. As with other operatingparameters, in at least some cases an AU may be able to adjust the minimum voice-to-text lag time to meet a personal preference.
It has been recognized that in cases where transcription switchesautomatically from a CA to an ASR engine when text lag exceeds some maximum lagtime, it will be useful to dynamically change the threshold period as a function of how acommunication between an HU and an AU is progressing. For instance, periods ofsilence in an HU voice signal may be used to automatically adjust the maximum lagperiod. For example, in some cases if silence is detected in an HU voice signal formore than three seconds, the threshold period to change from CA text to automatic textgeneration may be shortened to reflect the fact that when the HU starts speaking again,the CA should be closer to a caught up state. Then, as the HU speaks continuously fora period, the threshold period may again be extended. The threshold period prior toautomatic transition to the ASR engine to reduce or eliminate text lag may bedynamically changed based on other operating parameters. For instance, rate of errorcorrection by a CA, confidence factor average in ASR text, line quality, noiseaccompanying the HU voice signal, or any combination of these and other factors maybe used to change the threshold period.
One aspect described above relates to an ASR engine recognizing specific orimportant phrases like questions (e.g., see phrase "Are you still there?") in Fig. 18 priorto CA text generation and presenting those phrases immediately to an AU upondetection. Other important phrases may include phrases, words or sound anomaliesthat typically signify "turn markers" (e.g., words or sounds often associated with achange in speaker from AU to HU or vice versa). For instance, if an HU utters thephrase "What do you think?" followed by silence, the combination including the silentperiod may be recognized as a turn marker and the phrase may be presentedimmediately with space markers (e.g., underlined spaces) between CA text and thephrase to be filled in by the CA text transcription once the CA catches up to the turnmarker phrase.
To this end, see the text at 731 in Fig. 22 where CA generated text is shownat 733 with a lag time indicated by underlined spaces at 735 and an ASR recognizedturn marker phrase presented at 737. In this type of system, in some cases the ASRengine will be programmed with a small set (e.g., 100-300) of common turn markerphrases that are specifically sought in an HU voice signal and that are immediatelypresented to the AU when detected. In some cases, non-text voice characteristics likethe change in sound that occurs at the end of a question which is often the signal for aturn marker may be sought in an HU voice signal and any ASR generated text withinsome prior period (e.g., 5 seconds, the previous 8 words, etc.) may be automaticallypresented to an AU.
Automatic Voice Signal Routing Based On Call Type It has been recognized that some types of calls can almost always beaccurately handled by an ASR engine. For instance, auto-attendant type calls cantypically be transcribed accurately via an ASR. For this reason, in at least someembodiments, it is envisioned that a system processor at the AU device or at the relaymay be able to determine a call type (e.g., auto-attendant or not, or some other call typeroutinely accurately handled by an ASR engine) and automatically route calls within theoverall system to the best and most efficient/effective option for text generation. Thus,for example, in a case where an AU device manages access to an ASR operated by athird party and accessible via an internet link, when an AU places a call that is receivedby an auto-attendant system, the AU device may automatically recognize the answeringsystem as an auto-attendant type and instead of transmitting the auto-attendant voicesignal to a relay for CA transcription, may transmit the auto-attendant voice signal to thethird party ASR engine for text generation.
In this example, if the call type changes mid-stream during its duration, theAU device may also transmit the received voice signal to a CA for captioning ifappropriate. For instance, if an interactive voice recognition auto-attendant systemeventually routes the AU's call to a live person (e.g., a service representative for acompany), once the live person answers the call, the AU device processor mayrecognize the person's voice as a non-auto-attendant signal and route that signal to aCA for captioning as well as to the ASR for voice model training. In these cases, theASR engine may be specially tuned to transcribe auto-attendant voice signals to textand, when a live HU gets on the line, would immediately start training a voice model forthat HU's voice signal.
Synchronizing Voice And Text For Playback In cases or at times when HU voice signals are transcribed automatically totext via an ASR engine when a CA is only correcting ASR generated text, the relay mayinclude a synchronizing function or capability so that, as a CA listens to an HU's voicesignal during an error correction process, the associated text from the ASR is presentedgenerally synchronously to the CA with the HU voice signal. For instance, in somecases an ASR transcribed word may be visually presented via a CA display 50 atsubstantially the same instant at which the word is broadcast to the CA to hear. Asanother instance, the ASR transcribed word may be presented one, two, or moreseconds prior to broadcast of that word to the CA.
In still other cases, the ASR generated text may be presented for correctionvia a CA display 50 immediately upon generation and, as the CA controls broadcastspeed of the HU voice signal for correction purposes, the word or phraseinstantaneously audibly broadcast may be highlighted or visually distinguished in somefashion. To this end, see Fig. 23 where automated ASR generated text is shown at 748where a word instantaneously audibly broadcast to a CA (see 752) is simultaneouslyhighlighted at 750. Here, as the words are broadcast via CA headset 54, the textrepresentations of the words are highlighted or otherwise visually distinguished to helpthe error correcting CA follow along. Here, highlighting may be linked to the start timeof a word being broadcast, to the end time of the word being broadcast, or in any otherway to the start or end time of the word. For instance, in some cases a word may behighlighted one second prior to broadcast of the word and may remain highlighted forone second subsequent to the end time of the broadcast so that several words aretypically highlighted at a time generally around a currently audibly broadcast word.
As another example, see Fig. 23A where ASR generated text is shown at748A. Here, a word 752A instantaneously broadcast to a CA via headset 54 ishighlighted at 750A. In this case, however, ASR text scrolls up as words are audiblybroadcast to the CA so that a line of text including an instantaneously broadcast word isalways generally located at the same vertical height on the display screen 50 (e.g., justabove a horizontal center line in the exemplary embodiment in Fig. 23A). Here, byscrolling the text up, unless correcting text in a different line, the CA can simply focus onthe one line of text presented in stationary field 753 and specifically the highlighted wordat 750A to focus on the word audibly broadcast. In other cases it is contemplated thatthe highlight at 750A may in fact be a stationary word field and that even the line of textin field 753 may scroll from right to left so that the instantaneously broadcast word willbe located in a stationary word field generally near the center of the screen 50. In thisway the CA may be able to simply concentrate on one screen location to view thebroadcast word.
Referring still to Fig. 23A, a selectable button 751 (hereinafter a "captionsource switch button" unless indicated otherwise) allows a CA to manually switch fromthe ASR text generation to full CA assistance where the CA generates text and correctsthat text instead of starting with ASR generated text. In addition, a "seconds behind"field 755 is presented proximate the highlighted broadcast word 750A so that the CAhas ready access to that field to ascertain how far behind the CA is in terms of listeningto the HU voice message for correction. In addition, an HU silent field 757 is presentedthat indicates a duration of time between HU voice message segments during which theHU remains silent (e.g., does not speak). Here, in some cases the HU may simplypause to allow the AU to respond and that pause would be considered silence.
Referring still to Fig. 23A, field 755 indicates that the audible broadcast is only12.2 seconds behind despite the illustrated 20 seconds of HU silence at 757 and manyASR words that follow the instantaneously broadcast word at 750A. Here, a systemprocessor accounts for the 20 seconds of HU silence when calculating the secondsbehind value as the system can remove that silent period from CA consideration so thatthe CA can catch up more quickly. Thus, in the Fig. 23A example, the duration of timebetween when an HU actually uttered the words "restaurant" at 750A and "not" at 759may be 32.2 seconds but the system can recognize that the HU was silent during 20 ofthose seconds so that the seconds behind calculation may be 12.2 seconds as shown.
In at least some cases when the seconds behind delay exceeds somethreshold value, the system may automatically indicate that condition as a warning oralert to the CA. For instance, assume that the threshold delay is four seconds. Here,when the second behind value exceeds four seconds, in at least some cases, theseconds behind field may be highlighted or otherwise visually distinguished as an alert.
In Fig. 23A, field 755 is shown as left down to right cross hatched to indicate the colorred as an alert because the four second delay threshold is exceeded.
In at least some cases it is contemplated that more sophisticated algorithmsmay be implemented for determining when to alert the CA to a circumstance where theseconds behind period becomes problematic. For instance, where a seconds behindduration is 12.2 seconds as in Fig. 23A, that magnitude of duration may not warrant analert if confidence factors associated with ASR generated text thereafter are allextremely high as accurate ASR text thereafter should enable the CA to catch uprelatively quickly to reduce the seconds behind period rapidly. For instance, where ASRtext confidence factors are high, the system may automatically double the broadcastrate of the HU voice signal so that the 12.2 second delay can be worked to a zero valuein half that time.
As another instance, because HUs speak at different rates at different times,rate of HU speaking or density of words spoken during a time segment may be used toqualify the delay between a broadcast word and a most recent ASR word generated.
For instance, assume a 15 second delay between when a word is broadcast to a CAand the time associated with the most recent ASR generated text. Here, in some casesan HU may utter 3 words during the 15 second period while in other cases the HU mayhave uttered 30 words during that same period. Clearly, the time required for a CA's towork the 15 second delay downward is a function of the density of words uttered by theHU in the intervening time. Here, whether or not to issue the alert would be a functionof word density during the delay period.
As yet one other instance, instead of assessing delay by a duration of time,the relay may be based on a number of words between a most recently generated ASRword and the word that is currently being considered by a CA (e.g., the most currentword in an HU voice signal considered by the CA). Here, an alert may be issued to theCA when the CA is a threshold number of words behind the most recent ASR generatedword. For example, the threshold may be 12 words.
Many other factors may be used to determine when to issue CA delay alerts.
For instance, a CAs metrics related to specific HU voice characteristics, voice signalquality factors, etc., may each be used separately or in combination with other factors toassess when an alert is prudent.
In addition to affecting when to issue a delay alert to a user, the above factorsmay be used to alter the seconds behind value in field 755 to reflect an anticipatedduration of time required by a specific CA to catch up to the most recently generatedASR text. For instance, in Fig. 23A if, based on one or more of the above factors, thesystem anticipates that it will take the CA 5 seconds to catch up on the 12.2 seconddelay, the seconds behind value may be 5.0 seconds as opposed to 12.2 (e.g., in acase where the system speeds up the rate of HU voice signal broadcast through highconfidence ASR text).
In at least some cases an error correcting CA will be able to skip back andforth within the HU voice signal to control broadcast of the HU voice signal to the CA.
For instance, as described above, a CA may have a foot pedal or other control interfacedevice useable to skip back in a buffered HU voice recording 5, 10, etc., seconds toreplay an HU voice signal recording. Here, when the recording skips back, thehighlighted text in representation 748 would likewise skip back to be synchronized withthe broadcast words. To this end, see Fig. 25 where, in at least some cases, a footpedal activation or other CA input may cause the recording to skip back to the word"pizza" which is then broadcast as at 764 and highlighted in text 748 as shown at 762.
In other cases, the CA may simply single tap or otherwise select any word presented ondisplay 50 to skip the voice signal play back and highlighted text to that word. Forinstance, in Fig. 25 icon 766 represents a single tap which causes the word "pizza" tobe highlighted and substantially simultaneously broadcast. Other word selectinggestures (e.g., a mouse control click, etc.) are contemplated.
In some embodiments when a CA selects a text word to correct, the voicesignal replay may automatically skip to some word in the voice buffer relative to theselected word and may halt voice signal replay automatically until the correction hasbeen completed. For instance, a double tap on the word "pals' in Fig. 23 may causethat word to be highlighted for correction and may automatically cause the point in theHU voice replay to move backward to a location a few words prior to the selected word"pals." To this end, see in Fig. 25 that the word "Pete's" that is still highlighted as beingcorrected (e.g., the CA has not confirmed a complete correction) has been typed in toreplace the word "Pals" and the word "pizza" that precedes the word "Pete's" has beenhighlighted to indicate where the HU voice signal broadcast will again commence afterthe correction at 760 has been completed. While backward replay skipping has beendescribed, forward skipping is also contemplated.
In some cases, when a CA selects a word in presented text for correction orat least to be considered for correction, the system may skip to a location a few wordsprior to the selected word and may represent the HU voice signal stating at that pointand ending a few words after that point to give a CA context in which to hear the word tobe corrected. Thereafter, the system may automatically move back to a subsequentpoint in the HU voice signal at which the CA was when the word to be corrected wasselected. For instance, again, in Fig. 25, assume that the HU voice broadcast to a CAis at the word "catch" 761 when the CA selects the word "Pete's 760 for correction. Inthis case, the CA's interface may skip back in the HU voice signal to the word pizza at762 and re-broadcast the phrase parts from the word "pizza" to the word "want" 763 toprovide immediate context to the CA. After broadcasting the word "want", the interfacewould skip back to the word "catch" 761 and continue broadcasting the HU voice signalfrom that point on.
In at least some embodiments where an ASR engine generates automatictext and a CA is simply correcting that text prior to transmission to an AU, the ASRengine may assign a confidence factor to each word generated that indicates how likelyit is that the word is accurate. Here, in at least some cases, the relay server mayhighlight any text on the correcting CA's display screen that has a confidence factorlower than some threshold level to call that text to the attention of the CA for specialconsideration. To this end, see again Fig. 23 where various words (e.g., 777, 779, 781)are specially highlighted in the automatically generated ASR text to indicate a lowconfidence factor.
While AU voice signals are not presented to a CA in most cases for privacyreasons, it is believed that in at least some cases a CA may prefer to have some type ofindication when an AU is speaking to help the CA understand how a communication isprogressing. To this end, in at least some embodiments an AU device may sense anAU voice signal and at least generate some information about when the AU is speaking.
The speaking information, without word content, may then be transmitted in real time tothe CA at the relay and used to present an indication that the AU is speaking on the CAscreen. For instance, see again Fig. 23 where lines 783 are presented on display 50 toindicate that an AU is speaking. As shown, lines 783 are presented on a right side ofthe display screen to distinguish the AU's speaking activity from the text and other visualrepresentations associated with the HU's voice signal. As another instance, when theAU speaks, a text notice 797 or some graphical indicator (e.g., a talking head) may bepresented on the CA display 50 to indicate current speaking by an AU. While notshown it is contemplated that some type of non-content AU speaking indication like 783may also be presented to an AU via the AU's device to help the AU understand how thecommunication is progressing.
Sequential Short Duration Third Party Caption Requests It has been recognized that some third party ASR systems available via theinternet or the like tend to be extremely accurate for short voice signal durations (e.g.,-30 seconds) after which accuracy becomes less reliable. To deal with ASRaccuracy degradation during an ongoing call, in at least some cases where a third partyASR system is employed to generate automated text, the system processor (e.g., at therelay, in the AU device or in the HU device) may be programmed to generate a series ofautomatic text transcription requests where each request only transmits a short sub-setof a complete HU voice signal. For instance, a first ASR request may be limited to afirst 15 seconds of HU voice signal, a second ASR request may be limited to a next 15seconds of HU voice signal, a third ASR request may be limited to a third 15 seconds ofHU voice signal, and so on. Here, each request would present the associated HUsignal to the ASR system immediately and continuously as the HU voice signal isreceived and transcribed text would be received back from the ASR system during thesecond period. As the text is received back from the ASR system, the text would becobbled together to provide a complete and relatively accurate transcript of the HUvoice signal.
While the HU voice signal may be divided into consecutive periods in somecases, in other cases it is contemplated that the HU voice signal slices or sub-periodssent to the ASR system may overlap at least somewhat to ensure all words uttered byan HU are transcribed and to avoid a case where words in the HU voice signal are splitamong periods. For instance, voice signal periods may be 30 seconds long and eachmay overlap a preceding period by 10 seconds and a following period by 10 seconds toavoid split words. In addition to avoiding a split word problem, overlapping HU voicesignal periods presented to an ASR system allows the system to use contextrepresented by surrounding words to better (e.g., contextually) covert HU voiced wordsto text. Thus, a word at the end of a first 20 second voice signal period will be near thefront end of the overlapping portion of a next voice signal period and therefore, typically,will have contextual words prior to and following the word in the next voice signal periodso that a more accurate contextually considered text representation can be generated.
In some cases, a system processor may employ two, three or moreindependent or differently tuned ASR systems to automatically generate automated textand the processor may then compare the text results and formulate a single besttranscript representation in some fashion. For instance, once text is generated by eachengine, the processor may poll for most common words or phrases and then selectmost common as text to provide to an AU, to a CA, to a voice modeling engine, etc.
Default ASR, User Selects Call Assistance In most cases automated text (e.g., ASR generated text) will be generatedmuch faster than CA generated text or at least consistently much faster. It has beenrecognized that in at least some cases an AU will prefer even uncorrected automatedtext to CA corrected text where the automated text is presented more rapidly generatedand therefore more in sync with an audio broadcast HU voice signal. For this reason, inat least some cases, a different and more complex voice-to-text triage process may beimplemented. For instance, when an AU-HU call commences and the AU requires textinitially, automated ASR generated text may initially be provided to the AU. If a goodHU voice model exists for the HU, the automated text may be provided without CAcorrection at least initially. If the AU, a system processor, or an HU determines that theautomated text includes too many errors or if some other operating characteristic (e.g.,line noise) that may affect text transcription accuracy is sensed, a next level of the triageprocess may link an error correcting CA to the call and the ASR text may be presentedin essentially real time to the CA via display 50 simultaneously with presentation to theAU via display 18.
Here, as the CA corrects the automated text, corrections are automaticallysent to the AU device and are indicated via display 18. Here, the corrections may be in-line (e.g., erroneous text replaced), above error, shown after errors, may be visuallydistinguished via highlighting or the like, etc. Here, if too many errors continue to persistfrom the AU's perspective, the AU may select an AU device button (e.g., see 68 again inFig. 1) to request full CA transcription. Similarly, if an error correcting CA perceives thatthe ASR engine is generating too many errors, the error correcting CA may performsome action to initiate full CA transcription and correction. Similarly, a relay processoror even an AU device processor may detect that an error correcting CA is having tocorrect too many errors in the ASR generated text and may automatically initiate full CAtranscription and correction.
In any case where a CA takes over for an ASR engine to generate text, theASR engine may still operate on the HU voice signal to generate text and use that textand CA generated text, including corrections, to refine a voice model for the HU. Atsome point, once the voice model accuracy as tested against the CA generated textreaches some threshold level (e.g., 95% accuracy), the system may again automaticallyor at the command of the transcribing CA or the AU, revert back to the CA correctedASR text and may cut out the transcribing CA to reduce costs. Here, if the ASR engineeventually reaches a second higher accuracy threshold (e.g., 98% accuracy), thesystem may again automatically or at the command of an error correcting CA or an AU,revert back to the uncorrected ASR text to further reduce costs.
AU Accuracy-Speed Preference Selection In at least some cases it is contemplated that an AU device may allow an AUto set a personal preference between text transcription accuracy and text speed. Forinstance, a first AU may have fairly good hearing and therefore may only rely on a texttranscript periodically to identify a word uttered by an HU wile a second AU hasextremely bad hearing and effectively reads every word presented on an AU devicedisplay. Here, the first AU may prefer text speed at the expense of some accuracywhile the second AU may require accuracy even when speed of text presentation orcorrection is reduced. An exemplary AU device tool is shown as an accuracy/speedscale 770 in Fig. 18 where an accuracy/speed selection arrow 772 indicates a currentselected operating characteristic. Here, moving arrow 772 to the left, operatingparameters like correction time, ASR operation etc., are adjusted to increase accuracyat the expense of speed and moving arrow 772 right on scale 770 increases speed oftext generation at the expense of accuracy.
In at least some embodiments when arrow 772 is moved to the right so speedis preferred over greater accuracy, the system may respond to the setting adjustment byopting for automated text generation as opposed to CA text generation. In other caseswhere a CA may still perform at least some error corrections despite a high speedsetting, the system may limit the window of automated text that a CA is able to correctto a small time window trailing a current time. Thus, for instance, instead of allowing aCA to correct the last 30 seconds of automated text, the system may limit the CA tocorrecting only the most recent 7 seconds of text so that error corrections cannot lag toofar behind current HU utterances.
Where an AU moves arrow 772 to the left so that speed is sacrificed forgreater caption accuracy, the system may delay delivery of even automated text to anAU for some time so that at least some automated error corrections are made prior todelivery of initial text captions to an AU. The delay may even be until a CA has made atleast some or even all caption corrections. Other ways of speeding up text generationor increasing accuracy at the expense of speed are contemplated.
Audio-Text Synchronization Adjustment In at least some embodiments when text is presented to an error correctingCA via a CA display 50, the text may be presented at least slightly prior to broadcast of(e.g., 1/4 to 2 seconds) an associated HU voice signal. In this regard, it has beenrecognized that many CAs prefer to see text prior to hearing a related audio signal andlink the two optimally in their minds when text precedes audio. In other cases specificCAs may prefer simultaneous text and audio and still others may prefer audio beforetext. In at least some cases it is contemplated that a CA workstation may allow a CA toset text-audio sync preferences. To this end, see exemplary text-audio sync scale 765in Fig. 25 that includes a sync selection arrow 767 that can be moved along the scale tochange text-audio order as well as delay or lag between the two.
In at least some embodiments an on-screen tool akin to scale 765 and arrow767 may be provided on an AU device display 18 to adjust HU voice signal broadcastand text presentation timing to meet an AU's preferences.
System Options Based On HU's Voice Characteristics It has been recognized that some AU's can hear voice signals with a specificcharacteristic set better than other voice signals. For instance, one AU may be able tohear low pitch traditionally male voices better than high pitch traditionally female voicesignals. In some embodiments an AU may perform a commissioning procedurewhereby the AU tests capability to accurately hear voice signals having differentcharacteristics and results of those capabilities may be stored in a system database.
The hearing capability results may then be used to adjust or modify the way textcaptioning is accomplished. For instance, in the above case where an AU hears lowpitch voices well but not high pitch voices, if a low pitch HU voice is detected when acall commences, the system may use the ASR function more rapidly than in the case ofa high pitched voice signal. Voice characteristics other than pitch may be used to adjusttext transcription and ASR transition protocols in similar ways.
In some cases it is contemplated that an AU device or other system devicemay be able to condition an incoming HU voice signal so that the signal is optimized fora specific AU's hearing deficiency. For instance, assume that an AU only hears highpitch voices well. In this case, if a high pitch HU voice signal is received at an AU'sdevice, the AU's device may simply broadcast that voice signal to the AU to be heard.
However, if a low pitch HU voice signal is received at the AU's device, the AU's devicemay modify that voice signal to convert it to a high pitch signal prior to broadcast to theAU so that the A can better hear the broadcast voice. This automatic voice conditioningmay be performed regardless of whether or not the system is presenting captioning toan AU.
In at least some cases where an HU device like a smart phone, tablet,computing device, laptop, smart watch, etc., has the ability to store data or to accessdata via the internet, a WIFI system or otherwise that is stored on a local or remote(e.g., cloud) server, it is contemplated that every HU device or at least a subset used byspecific HUs may store an HU voice model for an associated HU to be used by acaptioning application or by any software application run by the HU device. Here, theHU model may be trained by one or more applications run on the HU device or by someother application like an ASR system associated with one of the captioning systemsdescribed herein that is run by an AU device, the relay server, or some third partyserver or processor. Here, for example, in one instance, an HU's voice model stored onan HU device may be used to drive a voice-to-text search engine input tool to providetext for an internet search independent of the captioning system. The multi-use andperhaps multi-application trained HU voice model may also be used by a captioningASR system during an AU-HU call. Here, the voice model may be used by an ASRapplication run on the HU device, run on the AU device, run by the relay server or runby a third party server.
In cases where an HU voice model is accessible to an ASR engineindependent of an HU device, when an AU device is used to place a call to an HUdevice, an HU model associated with the number called may be automatically preparedfor generating captions even prior to connection to the HU device. Where a phone orother identifying number associated with an HU device can be identified prior to an AUanswering a call from the HU device, again, an HU voice model associated with the HUdevice may be accessed and readied by the captioning system for use prior to theanswering action to expedite ASR text generation. Most people use one or a smallnumber of phrases when answering an incoming phone call. Where an HU voice modelis loaded prior to an HU answering a call, the ASR engine can be poised to detect oneof the small number of greeting phrases routinely used to answer calls and to comparethe HU's voice signal to the model to confirm that the voice model is for the specific HUthat answers the call. If the HU's salutation upon answering the call does not match thevoice model, the system may automatically link to a CA to start a CA controlledcaptioning process.
While at least some systems will include HU voice models, it should beappreciated that other systems may not and instead may rely on robust voice to textsoftware algorithms that train to specific voices over relatively short durations so thatevery new call with an HU causes the system to rapidly train anew to a received HUvoice signal. For instance, in many cases a voice model can be at least initially trainedwithin tens of seconds to specific voices after which the models continue to train overthe duration of a call to become more accurate as a call proceeds. In at least some ofthese cases there is no need for voice model storage.
Presenting Captions For AU Voice Messages While a captioning system must provide accurate text corresponding to an HUvoice signal for an AU to view when needed, typical relay systems for deaf and hard ofhearing person would not provide a transcription of an AU's voice signal. Here,generally, the thinking has been that an AU knows what she says in a voice signal andan HU hears that signal and therefore text versions of the AU's voice was notnecessary. This, coupled with the fact that AU captioning would have substantiallyincreased the transcription burden on CAs (e.g., would have required CA revoicing ortyping and correction of more voice signal (e.g., the AU voice signal)) meant that AUvoice signal transcription simply was not supported. Another reason AU voicetranscription was not supported was that at least some AUs, for privacy reasons, do notwant both sides of conversations with HUs being listened to by CAs.
In at least some embodiments, it is contemplated that the AU side of aconversation with an HU may be transcribed to text automatically via an ASR engineand presented to the AU via a device display 18 while the HU side of the conversation istranscribed to text in the most optimal way given transcription triage rules or algorithmsas described above. Here, the AU voice captions and AU voice signal would never bepresented to a CA. Here, while AU voice signal text may not be necessary in somecases, in others it is contemplated that many AUs may prefer that text of their voicesignals be presented to be referred back to or simply as an indication of how theconversation is progressing. Seeing both sides of a conversation helps a viewer followthe progress more naturally. Here, while the ASR generated AU text may not always beextremely accurate, accuracy in the AU text is less important because, again, the AUknows what she said.
Where an ASR engine automatically generates AU text, the ASR engine maybe run by any of the system processors or devices described herein. In particularlyadvantageous systems the ASR engine will be run by the AU device 12 where thesoftware that transcribes the AU voice to text is trained to the voice of the AU andtherefore is extremely accurate because of the personalized training.
Thus, referring again to Fig. 1, for instance, in at least some embodiments,when an AU-HU call commences, the AU voice signal may be transcribed to text by AUdevice 12 and presented as shown at 822 in Fig. 26 without providing the AU voicesignal to relay 16. The HU voice signal, in addition to being audibly broadcast via AUdevice 12, may be transmitted in some fashion to relay 16 for conversion to text whensome type of CA assistance is required. Accurate HU text is presented on display 18 at820. Thus, the AU gets to see both AU text, albeit with some errors, and highlyaccurate HU text. Referring again to Fig. 24, in at least some cases, AU and HU textmay also be presented to an HU via an HU device (e.g., a smart phone) in a fashionsimilar to that shown in Fig. 26.
Referring still to Fig. 26, where both HU and AU text are generated andpresented to an AU, the HU and AU text may be presented in staggered columns asshown along with an indication of how each text representation was generated (e.g.,see titles at top of each column in Fig. 26).
In at least some cases it is contemplated that an AU may, at times, not evenwant the HU side of a conversation to be heard by a CA for privacy reasons. Here, in atleast some cases, it is contemplated that an AU device may provide a button or othertype of selectable activator to indicate that total privacy is required and then to re-establish relay or CA captioning and/or correction again once privacy is no longerrequired. To this end, see the "Complete Privacy" button or virtual icon 826 shown onthe AU device display 18 in Fig. 26. Here, it is contemplated that, while an AU-HUconversation is progressing and a CA generates/corrects text 820 for an HU's voicesignal and an ASR generates HU text 822, if the AU wants complete privacy but stillwants HU text, the AU would select icon 826. Once icon 826 is selected, the HU voicesignal would no longer be broadcast to the CA and instead an ASR engine wouldtranscribe the AU voice signal to automated text to be presented via display 18. Icon826 in Fig. 26 would be changed to "CA Caption" or something to that effect to allow theAU to again start full CA assistance when privacy is less of a concern.
Other Triggers For Automated Catch Up Text In addition to a voice-to-text lag exceeding a maximum lag time, there may beother triggers for using ASR engine generated text to catch an AU up to an HU voicesignal. For instance, in at least some cases an AU device may monitor for an utterancefrom an AU using the device and may automatically fill in ASR engine generated textcorresponding to an HU voice signal when any AU utterance is identified. Here, forexample, where CA transcription is 30 seconds behind an HU voice signal, if an AUspeaks, it may be assumed that the AU has been listening to the HU voice signal and isresponding to the broadcast HU voice signal in real time. Because the AU responds tothe up to date HU voice signal, there may be no need for an accurate text transcriptionfor prior HU voice phrases and therefore automated text may be used to automaticallycatch up. In this case, the CA's transcription task would simply be moved up in time toa current real time HU voice signal automatically and the CA would not have to considerthe intervening 30 seconds of HU voice for transcription or even correction. When thesystem skips ahead in the HU voice signal broadcast to the CA, the system maypresent some clear indication that it is skipping ahead to the CA to avoid confusion. Forinstance, when the system skips ahead, a system processor may present asimultaneous warning on the CA display screen indicating that the system is skippingintervening HU voice signal to catch the CA up to real time.
As another example, when an AU device or other system device recognizes aturn marker in an HU voice signal, all ASR generated text that is associated with a lagtime may be filled in immediately and automatically.
As still one other instance, an AU device or other device may monitor AUutterances for some specific word or phrase intended to trigger an update of textassociated with a lag time. For instance, the AU may monitor for the word "Update"and, when identified, may fill in the lag time with automated text. Here, in at least somecases, the AU may be programmed to cancel the catch-up word "Update" from the AUvoice signal sent to the HU device. Thus, here, the AU utterance "Update" would havethe effect of causing ASR text to fill in a lag time without being transmitted to the HUdevice. Other commands may be recognized and automatically removed from the AUvoice signal.
Thus, it should be appreciated that various embodiments of a semi-automated automatic voice recognition or text transcription system to aid hearingimpaired persons when communicating with HUs have been described. In each systemthere are at least three entities and at least three devices and in some cases there maybe a fourth entity and an associated fourth device. In each system there is at least oneHU and associated device, one AU and associated device and one relay andassociated device or sub-system while in some cases there may also be a third partyprovider (e.g., a fourth party) of ASR services operating one or more servers that runASR software. The HU device, at a minimum, enables an HU to annunciate words thatare transmitted to an AU device and receives an AU voice signal and broadcasts thatsignal audibly for the HU to hear.
The AU device, at a minimum, enables an AU to annunciate words that aretransmitted to an HU device, receives an HU voice signal and broadcasts that signal(e.g., audibly, via Bluetooth where an AU uses a hearing aid) for the AU to attempt tohear, receives or generates transcribed text corresponding to an HU voice signal anddisplays the transcribed text to an AU on a display to view.
The relay, at a minimum, at times, receives the HU voice signal andgenerates at least corrected text that may be transmitted to another system device.
In some cases where there is no fourth party ASR system, any of the otherfunctions/processes described above may be performed by any of the HU device, AUdevice and relay server. For instance, the HU device in some cases may store an HUvoice model and/or voice characteristics model, an ASR application and a softwareprogram for managing which text, ASR or CA generated, is used to drive an AU device.
Here, the HU may link directly with each of the AU device and relay, and may operateas an intermediary therebetween.
As another instance, HU models, ASR software and caption controlapplications may be stored and used by the AU device processor or, alternatively, bythe relay server. In still other instances different system components or devices mayperform different aspects of a functioning system. For instance, an HU device maystore an HU voice model which may be provided to an AU device automatically at thebeginning of a call and the AU device may transmit the HU voice model along with areceived HU voice signal to a relay that uses the model to tune an ASR engine togenerate automated text as well as provides the HU voice signal to a first CA forrevoicing to generate CA text and a second CA for correcting the CA text. Here, therelay may transmit and transcribe text (e.g., automated and CA generated) to the AUdevice and the AU device may then select one of the received texts to present via theAU device screen. Here CA captioning and correction and transmission of CA text tothe AU device may be halted in total or in part at any time by the relay or, in somecases, by the AU device, based on various parameters or commands received from anyparties (e.g., AU, HU, CA) linked to the communication.
In cases where a fourth party to the system operates an ASR engine in thecloud or otherwise, at a minimum, the ASR engine receives an HU voice signal at leastsome of the time and generates automated text which may or may not be used at timesto drive an AU device display.
In some cases it is contemplated that ASR engine text (e.g., automated text)may be presented to an HU while CA generated text is presented to an AU and a mostrecent word presented to an AU may be indicated in the text on the HU device so thatthe HU has a good sense of how far behind an AU is in following the HU's voice signal.
To this end, see Fig. 27 that shows an exemplary HU smart phone device 800 includinga display 801 where text corresponding to an HU voice signal is presented for the HU toview at 848. The text 848 includes text already presented to an AU prior to andincluding the word "after" that is shown highlighted 850 as well as ASR enginegenerated text subsequent to the highlight 850 that, in at least the illustratedembodiment, may not have been presented to the AU at the illustrated time. Here, anHU viewing display 801 can see where the AU is in receiving text corresponding to theHU voice signal. The HU may use the information presented as a coaching tool to helpthe HU regulate the speed at which the HU converses. In addition to indicating themost recent textual word presented to the AU, the most recent word audibly broadcastto the AU may be visually highlighted as shown at 847 as well.
To be clear, where an HU device is a smart phone or some other type ofdevice that can run an application program to participate in a captioning service, manydifferent linking arrangements between the AU, HU and a relay are contemplated. Forinstance, in some cases the AU and HU may be directly linked and there may be asecond link or line from the AU to the relay for voice and data transmission whennecessary between those two entities. As another instance, when an HU and AU arelinked directly and relay services are required after the initial link, the AU device maycause the HU device to link directly to the relay and the relay may then link to the AUdevice so that the relay is located between the AU and HU devices and allcommunications pass through the relay. In still another instance, an HU device may linkto the relay and the relay to the AU device and the AU device to the HU device so thatany communications, voice or data, between two of the three entities is direct withouthaving to pass through the other entity (e.g., HU and AU voice signals would be directlybetween HU and AU devices, HU voice signal would be direct from the HU device to therelay and transcribed text associated with the HU voice would be directly passed fromthe relay to the AU device to be displayed to the AU. Here, any text generated at therelay to be presented via the HU device would be transmitted directly from the relay tothe HU device and any text generated by either one of the AU or HU devices (e.g., viaan ASR engine) would be directly transmitted to the receiving device. Thus, an HUdevice or captioning application run thereby may maintain a direct dial number oraddress for the relay and be able to link up to the relay automatically when CA or otherrelay services are required.
Referring now to Fig. 28, a schematic is shown of an exemplary semi-automated captioning system that is consistent with at least some aspects of thepresent disclosure. The system enables an HU using device 14 to communicate withan AU using AU device 12 where the AU receives text and HU voice signals via the AUdevice 12. Each of the HU and the AU link into a gateway server or other computingdevice 900 that is linked via a network of some type to a relay. HU voice signals are fedthrough a noise reducing audio optimizer to a 3 pole or path ASR switch device 904 thatis controlled by an adaptive ASR switch controller 932 to select one of first, second andthird text generating processes associated with switch output leads 940, 942 and 944,respectively. The first text generating process is an automated ASR text processwherein an ASR engine generates text without any input (e.g., data entry, correction,etc.) from any CA. The second text generating process is a process wherein a CA 908revoices an HU voice or types to generate text corresponding to an HU voice signal andthen corrects that text. The third text generating process is one wherein the ASRengine generates automated text and a correcting CA 912 makes corrections to theautomated text. In the second process, the ASR engine operates in parallel with the CAto generate automated text in parallel to the CA generated and corrected text.
Referring still to Fig. 28, with switch 904 connected to output lead 940, the HUvoice signal is only presented to ASR engine 906 which generates automated textcorresponding to the HU voice which is then provided to a voice to text synchronizer910. Here, synchronizer 910 simply passes the raw ASR text on through a correctabletext window 916 to the AU device 12.
Referring again to Fig. 28, with switch 904 connected to output lead 942, theHU voice signal, in addition to being linked to the ASR engine, is presented to CA 908for generating and correcting text via traditional CA voice recognition 920 and manualcorrection tools 924 via correction window 922. Here, corrected text is provided to theAU device 12 and is also provided to a text comparison unit or module 930. Raw textfrom the ASR engine 906 is presented to comparison unit 930. Comparison unit 930compares the two text streams received and calculates an ASR error rate which isoutput to switch control 932. Here, where the ASR error rate is low (e.g., below somethreshold), control 932 may be controlled to cut the text generating CA 908 out of thecaptioning process.
Referring still to Fig. 28, with switch 904 connected to output lead 944, the HUvoice signal, in addition to being linked to the ASR engine, is fed through synchronizer910 which delays the HU voice signal so that the HU voice signal lags the raw ASR textby a short period (e.g., 2 seconds). The delayed HU voice signal is provided to a CA912 charged with correcting ASR text generated by engine 906. The CA 912 uses akeyboard or the like 914 to correct any perceived errors in the raw ASR text presentedin window 916. The corrected text is provided to the AU device 12 and is also providedto the text comparison unit 930 for comparison to the raw ASR text. Again, comparisonunit 930 generates an ASR error rate which is used by control 932 to operate switchdevice 904. The manual corrections by CA 912 are provided to a CA error tracking unit918 which counts the number of errors corrected by the CA and compares that numberto the total number of words generated by the ASR engine 906 to calculate a CAcorrection rate for the ASR generated raw text. The correction rate is provided tocontrol 932 which uses that rate to control switch device 904.
Thus, in operation, when an HU-AU call first requires captioning, in at leastsome cases switch device 904 will be linked to output lead 942 so that full CAtranscription and correction occurs in parallel with the ASR engine generating raw ASRtext for the HU voice signal. Here, as described above, the ASR engine may beprogrammed to compare the raw ASR text and the CA generated text and to train to theHU's voice signal so that, over a relatively short period, the error rate generated bycomparison unit 930 drops. Eventually, once the error rate drops below some ratethreshold, control 932 controls device 940 to link to output lead 944 so that CA 908 istaken out of the captioning path and CA 912 is added. CA 912 receives the raw ASRtext and corrects that text which is sent on to the AU device 12. As the CA corrects text,the ASR engine continues to train to the HU voice using the corrected errors.
Eventually, the ASR accuracy should improve to the point where the correction ratecalculated by tracking unit 918 is below some threshold. Once the correction rate isbelow the threshold, control 932 may control switch 904 to link to output link 940 to takethe CA 912 out of the captioning loop which causes the relatively accurate raw ASR textto be fed through to the AU device 12. As described above in at least some cases theAU and perhaps a CA or the HU may be able to manually switch between captioningprocesses to meet preferences or to address perceived captioning problems.
As described above, it has been recognized that at least some ASR enginesare more accurate and more resilient during the first 30 +/- seconds of performing voiceto text transcription. If an HU takes a speaking turn that is longer than 30 seconds theengine has a tendency to freeze or lag. To deal with this issue, in at least someembodiments, all of an HU’s speech or voice signal may be fed into an audio buffer anda system processor may examine the HU voice signal to identify any silent periods thatexceed some threshold duration (e.g., 2 seconds). Here, a silent period would bedetected whenever the HU voice signal audio is out of a range associated with a typicalhuman voice. When a silent period is identified, in at least some cases the ASR engineis restarted and a new ASR session is created. Here, because the process uses anaudio buffer, no portion of the HU’s speech or voice signal is lost and the system cansimply restart the ASR engine after the identified silent period and continue thecaptioning process after removing the silent period.
Because the ASR engine is restarted whenever a silent period of at least athreshold duration occurs, the system can be designed to have several advantageousfeatures. First, the system can implement a dynamic and configurable range of silenceor gap threshold. For instance, in some cases, the system processor monitoring for asilent period of a certain threshold duration can initially seek a period that exceedssome optimal relatively long length and can reduce the length of the threshold durationas the ASR captioning process nears a maximum period prior to restarting the engine.
Thus, for instance, where a maximum ASR engine captioning period is 30 seconds,initially the silent period threshold duration may be 3 seconds. However, after an initialseconds of captioning by an engine, the duration may be reduced to 1.5 seconds.
Similarly, after 25 seconds of engine captioning, the threshold duration may be reducedfurther to one half a second.
As another instance, because the system uses an audio buffer in this case,the system can "manufacture" a gap or silent period in which to restart an ASR engine,holding an HU's voice signal in the audio buffer until the ASR engine starts captioninganew. While the manufactured silent period is not as desirable as identifying a naturalgap or silent period as described above, the manufactured gap is a viable option ifnecessary so that the ASR engine can be restarted without loss of HU voice signal.
In some cases it is contemplated that a hybrid silent period approach may beimplemented. Here, for instance, a system processor may monitor for a silent periodthat exceeds 3 seconds in which to restart an ASR engine. If the processor does notidentify a suitable 3-plus second period for restarting the engine within 25 seconds, theprocessor may wait until the end of any word and manufacture a 3 second period inwhich to restart the engine.
Where a silent period longer than the threshold duration occurs and the ASRengine is restarted, if the engine is ready for captioning prior to the end of the thresholdduration, the processor can take out the end of the silent period and begin feeding theHU voice signal to the ASR engine prior to the end of the threshold period. In this way,the processor can effectively eliminate most of the silent period so that captioningproceeds quickly.
Restarting an ASR engine at various points within an HU voice signal has theadditional benefit of making all hypothesis words (e.g., initially identified words prior tocontextual correction based on subsequent words) firm in at least someembodiments. Doing so allows a CA correcting the text to make corrections or anyother manipulations deemed appropriate for an AU immediately without having to waitfor automated contextual corrections and avoids a case where a CA error correctionmay be replaced subsequently by an ASR engine correction.
In still other cases other hybrid systems are contemplated where a processorexamines an HU voice signal for suitably long silent periods in which to restart an ASRengine and, where no such period occurs by a certain point in a captioning process, theprocessor commences another ASR engine captioning process which overlaps the firstprocess so that no HU voice signal is lost. Here, the processor would work out whichcaptioned words are ultimately used as final ASR output during the overlapping periodsto avoid duplicative or repeated text.
Return On Audio Detector Feature One other feature that may be implemented in some embodiments of thisdisclosure is referred to as a Return On Audio detector (ROA-Detector) feature. In thisregard, a system processor receiving an HU voice signal ascertains whether or not thesignal includes audio in a range that is typical for human speech during an HU turn andgenerates a duration of speech value equal to the number of seconds of speechreceived. Thus, for instance, in a ten second period corresponding to an HU voicesignal turn, there may be 3 seconds of silence during which audio is not in the range oftypical human speech and therefore the duration of speech value would be 7 seconds.
In addition, the processor detects the quantity of captions being generated by an ASRengine. The processor automatically compares the quantity of captions from the ASRwith the duration of speech value to ascertain if there is a problem with the ASR engine.
Thus, for instance, if the quantity of ASR generated captions is substantially less thanwould be expected given the duration of speech value, a potential ASR problem may beidentified. The idea here is that if the duration of speech value is low (e.g., 4 out of 10seconds) while the caption quality value (based on CA error corrections or some otherfactor(s)) is also low, the low caption quality value is likely not associated with thequantity of speech signal to be captioned and instead is likely associated with an ASRproblem. Where an ASR problem is likely, the likely problem may be used by theprocessor to trigger a restart of the ASR engine to generate a better result. As analternative, where an ASR problem is likely, the problem may trigger initiation of a wholenew ASR session. As still one other alternative, a likely ASR problem may trigger aprocess to bring a CA on line immediately or more quickly than would otherwise be thecase.
In still other cases, when a likely ASR error is detected as indicated above,the ROA detector may retrieve the audio (i.e., the HU voice signal) that was originallysent to the ASR from a rolling buffer and replay/resend the audio to the ASR engine.
This replayed audio would be sent through a separate session simultaneously with anynew sessions that are sending ongoing audio to the ASR. Here, the captionscorresponding to the replayed audio would be sent to the AU device and inserted into acorrect sequential slot in the captions presented to the AU. In addition, here, the ROAdetector would monitor the text that comes back from the ASR and compare that text tothe text retrieved during the prior session, modifying the captions to removeredundancies. Another option would be for the ROA to simply deliver a message to theAU device indicating that there was an error and that a segment of audio was likely notproperly captioned. Here, the AU device would present the likely erroneous captions insome way that indicates a likely error (e.g., perhaps visually distinguished by a yellowhighlight or the like).
In some cases it is contemplated that a phone user may want to have just intime (JIT) captions on their phone or other communication device (e.g., a tablet) duringa call with an HU for some reason. For instance, when a smart phone user wants toremove a smart phone from her ear for a short period the user may want to have textcorresponding to an HU's voice presented during that period. Here, it is contemplatedthat a virtual "Text" or "Caption" button may be presented on the smart phone displayscreen or a mechanical button may be presented on the device which, when selectedcauses an ASR to generate text for a preset period of time (e.g. 10 seconds) or untilturned off by the device user. Here, the ASR may be on the smart phone device itself,may be at a relay or at some other deice (e.g., the HU's device). In other cases where asmart phone includes a motion sensor device or other sensor that can detect when auser moves the device away from her ear or when the user looks at the device (e.g., aface recognition or eye gaze sensor), the system may automatically present text to theAU upon a specific motion (e.g., pulling away from the user's ear) or upon recognizingthat the user is likely looking at a display screen on the AU's device.
While HU voice profiles may be developed and stored for any HU calling anAU, in some embodiments, profiles may only be stored for a small set of HUs, such as,for instance, a set of favorites or contacts of an AU. For instance, where an AU has alist of ten favorites, HU voice profiles may be developed, maintained, and morphed overtime for each of those favorites. Here, again, the profiles may be stored at differentlocations and by different devices including the AU device, a relay, via a third partyservice provider, or even an HU device where the HU earmarks certain AUs as havingthe HU as a favorite or a contact.
In some cases it may be difficult technologically for a CA to correct ASRcaptions. Here, instead of a CA correcting captions, another option would simply be fora CA to mark errors in ASR text as wrong and move along. Here, the error could beindicated to an AU via the display on an AU's device. In addition, the error could beused to train an HU voice profile and/or captioning model as described above. Asanother alternative, where a CA marks a word wrong, a correction engine may generateand present a list of alternative words for the CA to choose from. Here, using an onscreen tool, the CA may select a correct word option causing the correction to bepresented to an AU as well as causing the ASR to train to the corrected word.
Metrics - Tracking And Reporting CA And ASR Accuracy In at least some cases it is contemplated that it may be useful to run periodictests on CA generated text captions to track CA accuracy or reliability over time. Forinstance, in some cases CA reliability testing can be used to determine when aparticular CA could use additional or specialized training. In other cases, CA reliabilitytesting may be useful for determining when to cut a CA out of a call to be replaced byautomatic speech recognition (ASR) generated text. In this regard, for instance, if a CAis less reliable than an ASR application for at least some threshold period of time, asystem processor may automatically cut the CA out even if ASR quality remains belowsome threshold target quality level if the ASR quality is persistently above the quality ofCA generated text. As another instance, where CA quality is low, text from the CA maybe fed to a second CA for either a first or second round of corrections prior totransmission to an AU device for display or, a second relatively more skilled CA trainedin handling difficult HU voice signals may be swapped into the transcription process inorder to increase the quality level of the transcribed text. As still one other instance, CAreliability testing may be useful to a governing agency interested in tracking CAaccuracy for some reason.
In at least some cases it has been recognized that in addition to assessingCA captioning quality, it will be useful to assess how accurately an automated speechrecognition system can caption the same HU voice signal regardless of whether or notthe quality values are used to switch the method of captioning. For instance, in at leastsome cases line noise or other signal parameters may affect the quality of HU voicesignal received at a relay and therefore, a low CA captioning quality may be at least inpart attributed to line noise and other signal processing issues. In this case, an ASRquality value for ASR generated text corresponding to the HU voice signal may be usedas an indication of other parameters that affect CA captioning quality and therefore inpart as a reason or justification for a low CA quality value. For instance, where an ASRquality value is 75% out of 100% and a CA quality value is 87% out of 100%, the lowASR quality value may be used to show that, in fact, given the relatively higher CAquality value, that the CA value is quite good despite being below a minimum targetthreshold. Line noise and other parameters may be measured in more direct ways vialine sensors at a relay or elsewhere in the system and parameter values indicative ofline noise and other characteristics may be stored along with CA quality values toconsider when assessing CA caption quality.
Several ways to test CA accuracy and generate accuracy statistics arecontemplated by the present disclosure. One system for testing and tracking accuracymay include a system where actual or simulated HU-AU calls are recorded forsubsequent testing purposes and where HU turns (e.g., voice signal periods) in eachcall are transcribed and corrected by a CA to generate a true and highly accurate (e.g.,approximately 100% accurate) transcription of the HU turns that is referred tohereinafter as the "truth". Here, metrics on the HU voice message speed, dynamicduration of speech value, complexity of voice message words, quality of voice messagesignal, voice message pitch, tone, etc., can all be predetermined and used to assessCA accuracy as well as to identify specific call types with specific characteristics that aCA does best with and others that the assistant has relatively greater difficulty handling.
During testing, without a CA knowing that a test is being performed, the testrecording is presented to the CA as a new AU-HU call for captioning and the CAperceives the recording to be a typical HU-AU call. In many cases, a large number ofrecorded calls may be generated and stored for use by the testing system so that a CAnever listens to the same test recording more than once. In some cases a systemprocessor may track CAs and which test recordings the CA has been exposed topreviously and may ensure that a CA only listens to any test recording once.
As a CA listens to a test recording, the CA transcribes the HU voice signal totext and, in at least some cases, makes corrections to the text. Because the CAgenerated text corresponds to a recorded voice signal and not a real time signal, thetext is not forwarded to an AU device for display. The CA is unaware that the text is notforwarded to the AU device as this exercise is a test. The CA generated text iscompared to the truth and a quality value is generated for the CA generated text(hereinafter a "CA quality value"). For instance, the CA quality value may be a percentaccuracy representing the percent of HU voice signal words accurately transcribed totext. The CA quality value may also be affected by other factors like speed of the voicemessage, dynamic duration of speech value, complexity of voice message words,quality of voice message signal, voice message pitch, tone, etc.
In at least some cases different CA quality values may be generated for asingle CA where each value is associated with a different subset of voice message andcaptioning characteristics. For instance, in a simple case, a first CA may have a highcaption quality value associated with high pitch voices and a relatively lower captionquality value associated with low pitch voices. The same first CA may have a relativelyhigh caption quality value for high pitched voices where a duration of speech value isrelatively low (e.g., less than 50%) when compared to the quality value for a highpitched voice where the duration of speech value is relatively high (e.g., greater than50%). Many other voice message characteristic subsets for qualifying caption qualityvalues are contemplated.
The multiple caption quality values can be used to identify specific call typeswith specific characteristics that a CA does best with and others that the assistant hasrelatively greater difficulty handling. Incoming calls can be routed to CAs that areoptimized (e.g., available and highly effective for calls with specific characteristics) tohandle those calls. CA caption quality values and associated voice messagecharacteristics are stored in a data base for subsequent access.
In addition to generating one or more CA quality values that represent howaccurately a CA transcribes voice to text, in at least some cases the system will beprogrammed to track and record transcription latency that can be used as a second typeof quality factor referred to hereinafter as the "CA latency value". Here, the system maytrack instantaneous latency and use the instantaneous values to generate average andother statistical latency values. For instance, an average latency over an entire call maybe calculated, an average latency over a most recent one minute period may becalculated, a maximum latency during a call, a minimum latency during a call, a latencyaverage taking out the most latent 20% and least latent 20% of a call may be calculatedand stored, etc. In some cases where both a CA quality value and CA latency valuesare generated, the system may combine the quality and latency values according tosome algorithm to generate an overall CA service value that reflects the combination ofaccuracy and latency.
CA latency may also be calculated in other ways. For instance, in at leastsome cases a relay server may be programmed to count the number of words during aperiod that are received from an ASR service provider (see 1006 in Fig. 30) and toassume that the returned number of words over a minute duration represents the actualwords per minute (WPM) spoken by an HU. Here, periods of HU silence may beremoved from the period so that the word count more accurately reflects WPM of thespeaking HU. Then, the number of words generated by a CA for the same period maybe counted and used along with the period duration minus silent periods to determine aCA WPM count. The server may then compare the HU's WPM to the CA WPM count toassess CA delay or latency.
Where actual calls are used to generate CA metrics, in at least some casescall content is not persistently stored as either voice or text for subsequent access.
Instead, in these cases, only audio, caption and correction timing information (e.g.,delay durations) is stored for each call. In other cases, in addition to the timinginformation, call characteristics (e.g., Hispanic voice, HU WPM rate, line signal quality,HU volume, tone, etc.) and/or error types (e.g., visible, invisible, minor, etc.) for eachcorrected and missed error may be stored.
Where pre-recorded test calls are used to generate CA metrics, in at leastsome cases in addition to storing the timing, call characters and error types for eachcall, the system may store the complete text call audio record with time stamps,captioning record and corrections record so that a system administrator has the abilityto go back and view captioning and correction for an entire call to gain insights relatedto CA strengths and weaknesses.
In at least some cases the recorded call may also be provided to an ASR togenerate automatic text. The ASR generated text may also be compared to the truthand an "ASR quality value" may be generated. The ASR quality value may be stored ina database for subsequent use or may be compared to the CA quality value to assesswhich quality value is higher or for some other purpose. Here, also, an ASR latencyvalue or ASR latency values (e.g., max, min, average over a call, average over a mostrecent period, etc.) may be generated as well as an overall ASR service value. Again,the ASR and CA values may be used by a system processor to determine when theASR generated text should be swapped in for the CA generated text and vice versa.
Referring now to Fig. 29, an exemplary system 1000 for testing and trackingCA and ASR quality and latency values using pre-recorded HU-AU calls is illustrated.
System 1000 includes relay components represented by the phantom box at 1001 anda cloud based ASR system 1006 (e.g., a server that is linked to via the internet or someother type of computing network). Two sources of pre-generated information aremaintained at the relay including a set of recorded calls at 1002 and a set of verifiedtrue transcripts at 1010, one truth or true transcript for each recorded call in the set1002. Again, the recorded calls may include actual HU-AU calls or may include mockcalls that occur between two knowing parties that simulate an actual call.
During testing, a connection is linked from a system server that stores thecalls 1002 to a captioning platform as shown at 1004 and one of the recorded calls,hereinafter referred to as a test recording, is transmitted to the captioning platform 1004.
The captioning platform 1004 sends the received test recording to two targets includinga CA at 1008 and the ASR server 1006 (e.g., Google Voice, IBM's Watson, etc.). TheASR generates an automated text transcript that is forwarded on to a first comparisonengine at 1012. Similarly, the CA generates CA generated text which is forwarded on toa second comparison engine 1014. The verified truth text transcript at 1010 is providedto each of the first and second comparison engines 1012 and 1014. The first engine1012 compares the ASR text to the truth and generates an ASR quality value and thesecond engine 1014 compares the CA generated text to truth and generates a CAquality value, each of which are provided to a system database 1016 for storage untilsubsequently required.
In addition, in some cases, some component within the system 1000generates latency values for each of the ASR text and the CA generated text bycomparing when the times at which words are uttered in the HU voice signal to thetimes at which the text corresponding thereto is generated. The latency values arerepresented by clock symbols 1003 and 1005 in Fig. 29. The latency values are storedin the database 1016 along with the associated ASR and CA quality values generatedby the comparison engines 1012 and 1014.
Another way to test CA quality contemplated by the present disclosure is touse real time HU-AU calls to generate quality and latency values. In these cases, a firstCA may be assigned to an ongoing HU-AU call and may operate in a conventionalfashion to generate transcribed text that corresponds to an HU voice signal where thetranscribed text is transmitted back to the AU device for display substantiallysimultaneously as the HU voice is broadcast to the AU. Here, the first CA may performany process to convert the HU voice to text such as, for instance, revoicing the HUvoice signal to a processor that runs voice to text software trained to the voice of the HUto generate text and then correcting the text on a display screen prior to sending the textto the AU device for display. In addition, the CA generated text is also provided to asecond CA along with the HU voice signal and the second CA listens to the HU voicesignal and views the text generated by the first CA and makes corrections to the first CAgenerated text. Having been corrected a second time, the text generated by the secondCA is a substantially error free transcription of the HU voice signal referred tohereinafter as the "truth". The truth and the first CA generated text are provided to acomparison engine which then generates a "CA quality value" similar to the CA qualityvalue described above with respect to Fig. 29 which is stored for subsequent access ina database.
In addition, as is the case in Fig. 29, in the case of transcribing an ongoingHU-AU call, the HU voice signal may also be provided to a cloud based ASR server orservice to generate automated speech recognition text during an ongoing call that canbe compared to the truth (e.g., the second CA generated text) to generate an ASRquality value. Here, while conventional ASRs are fast, there will again be some latencyin text generation and the system will be able to generate an ASR latency value.
Referring now to Fig. 30, an exemplary system 1020 for testing and trackingCA and ASR quality and latency values using ongoing HU-AU calls is illustrated.
Components in the Fig. 30 system 1020 that are similar to the components describedabove with respect to Fig. 29 are labeled with the same numbers and operate in asimilar fashion unless indicated otherwise hereafter. In addition to an HUcommunication device 1040 and an AU communication device 1042 (e.g., a captiontype telephone device), system 1020 includes relay components represented by thephantom box at 1021 and a cloud based ASR system 1006 akin to the cloud basedsystem described above with respect to Fig. 29. Here there is no pre-generated andrecorded call or pre-generated truth text as testing is done using an ongoing dynamiccall. Instead, a second CA at 1030 corrects text generated by a first CA at 1008 tocreate a truth (e.g., essentially 100% accurate text). The truth is compared to ASRgenerated text and the first CA generated text to create quality values to be stored indatabase 1016.
Referring still to Fig. 30, during testing, as in a conventional relay assistedcaptioning system, the AU device 1042 transmits an HU voice signal to the captioningplatform at 1004. The captioning platform 1004 sends the received HU voice signal totwo targets including a first CA at 1008 and the ASR server 1006 (e.g., Google Voice,IBM's Watson, etc.). The ASR generates an automated text transcript that is forwardedon to a first comparison engine at 1012. Similarly, the first CA generates CA generatedtext which is transmitted to at least three different targets. First, the first CA generatedtext which may include text corrected by the first CA is transmitted to the AU device1042 for display to the AU during the call. Second, the first CA generated text istransmitted to the second comparison engine 1014. Third, the first CA generated text istransmitted to a second CA at 1030. The second CA at 1030 views the CA generatedtext on a display screen and also listens to the HU voice signal and makes correctionsto the first CA generated text where the second CA generated text operates as a truthtext or truth. The truth is transmitted to the second comparison engine at 1014 to becompared to the first CA generated text so that a CA quality value can be generated.
The CA quality value is stored in database 1016 along with one or more CA latencyvalues.
Referring again to Fig. 30, the truth is also transmitted from the second CA at1030 to the first comparison engine at 1012 to be compared to the ASR generated textso that an ASR quality value is generated which is also stored along with at least oneASR latency value in the database 1016.
Referring to Fig. 31, another embodiment of a testing relay system is shownat 1050 which is similar to the system 1020 of Fig. 30, albeit where the ASR service1006 provides an initial text transcription to the second CA at 1052 instead of the CAreceiving the initial text from the first CA. Here, the second CA generated the truth textwhich is again provided to the two comparison engines at 1012 and 1014 so that ASRand CA quality factors can be generated to be stored in database 1016.
The ASR text generation and quality testing processes are described aboveas occurring essentially in real time as a first CA generates text for a recorded orongoing call. Here, real time quality and latency testing may be important where adynamic triage transcription process is occurring where, for instance, ASR generatedtext may be swapped in for a cut out CA when ASR generated text achieves somequality threshold or a CA may be swapped in for ASR generated text if the ASR qualityvalue drops below some threshold level. In other cases, however, quality testing maynot need to be real time and instead, may be able to be done off line for somepurposes. For instance, where quality testing is only used to provide metrics to agovernment agency, the testing may be done off line.
In this regard, referring again to Fig. 29, in at least some cases where testingcannot be done on the fly as a CA at 1008 generates text, the CA text and the recordedHU voice signal associated therewith may be stored in database 1016 for subsequentaccess for generating the ASR text at 1006 as well as for comparing the CA generatedtext and the ASR generated text to the verified truth text from 1010. Similarly, referringagain to Fig. 30, where real time quality and latency values are not required, at least theHU portion of a call may be stored in database 1016 for subsequent off line processingby ASR service 1006 and the second CA at 1030 and then for comparisons to the truthat engines 1012 an 1014.
It should be appreciated that current there are Federal and state regulationsthat prohibit storage of any parts of voice communications between two or more peoplewithout authorization from at least one of those persons. For this reason, in at leastsome cases it is contemplated that real voice recordings of AU-HU calls may only beused for training purposes after authorization is sought and received. Here, the samerecording may be used to train multiple CAs. In other cases, "fake" AU-HU callrecordings may be generated and used for training purposes so that regulations and AUand HU privacy concerns cannot be violated. Here, true transcripts of the fake calls canbe generated and stored for use in assessing CA caption quality. One advantage offake call records is that different qualities of HU voice signals can be simulatedautomatically to see how those affect CA caption accuracy speed, etc. For instance, afirst CA may be much more accurate and faster than a second CA at captioningstandard or poor definition or quality voice signals.
One advantage of generating quality and latency values in real time using realHU-AU calls is that there is no need to store calls for subsequent processing. Currentlythere are regulations in at least some jurisdictions that prohibit storing calls for privacyreasons and therefore off line quality testing cannot be done in these cases.
In at least some embodiments it is contemplated that quality and latencytesting may only be performed sporadically and generally randomly so that generatedvalues are sort of an average representation of the overall captioning service. In othercases, while quality and latency testing may be periodic in general, it is contemplatedthat tell tail signs of poor quality during transcription may be used to trigger additionalquality and latency testing. For instance, in at least some cases where an AU isreceiving ASR generated text and the AU selects an option to link to a CA for correction,the AU request may be used as a trigger to start the quality testing process on textreceived from that point on (e.g., quality testing will commence and continue for HUvoice received as time progresses forward). Similarly, when an AU requests full CAcaptioning (e.g., revoicing and text correction), quality testing may be performed fromthat point forward on the CA generated text.
In other cases, it is contemplated that an HU-AU call may be stored duringthe duration of the call and that, at least initially, no quality testing may occur. Then, ifan AU requests CA assistance, in addition to patching a CA into the call to generatehigher quality transcription, the system may automatically patch in a second CA thatgenerates truth text as in Fig. 30 for the remainder of the call. In addition or instead,when the AU requests CA assistance, the system may, in addition to patching a CA into generate better quality text, also cause the recorded HU voice prior to the request tobe used by a second CA to generate truth text for comparison to the ASR generatedtext so that an ASR quality value for the text that caused the AU to request assistancecan be generated. Here, the pre-CA assistance ASR quality value may be generatedfor the entire duration of the call prior to the request or just for a most recent sub-period(e.g., for the prior minute or 30 seconds). Here, in at least some cases, it iscontemplated that the system may automatically erase any recorded portion of an HU-AU call immediately after any quality values associated therewith have been calculated.
In cases where quality values are only calculated for a most recent period of HU voicesignal, recordings prior thereto may be erased on a rolling basis.
As another instance, in at least some cases it is contemplated that sensors ata relay may sense line noise or other signal parameters and, whenever the line noise orother parameters meet some threshold level, the system may automatically start qualitytesting which may persist until the parameters no longer meet the threshold level. Here,there may be hysteresis built into the system so that once a threshold is met, at leastsome duration of HU voice signal below the threshold is required to halt the testingactivities. The parameter value or condition or circumstance that triggered the qualitytesting would, in this case, be stored along with the quality value and latencyinformation to add context to why the system started quality testing in the specificinstance.
As one other example, in a case where an AU signals dissatisfaction with acaptioning service at the end of a call, quality testing may be performed on at least aportion of the call. To this end, in at least some cases as an HU-AU call progresses, thecall may be recorded regardless of whether or not ASR or CA generated text ispresented to an AU. Then, at the end of a call, a query may be presented to the AUrequesting that the AU rate the AU's satisfaction with the call and captioning on somescale (e.g., a 1 through 10 quality scale with 10 being high). Here, if a satisfactionrating were low (e.g., less than 7) for some reason, the system may automatically usethe recorded HU voice or at least a portion thereof to generate a CA quality value in oneof the ways described above. For instance, the system may provide the text generatedby a first CA or by the ASR and the recorded HU voice signal to a second CA forgenerating truth and a quality value may be generated using the truth text for storage inthe database.
In still other cases where an AU expresses a low satisfaction rating for acaptioning service, prior to using a recorded HU voice signal to generate a quality value,the system server may request authorization to use the signal to generate a captioningquality value. For instance, after an AU indicates a 7 (out of 10) or lower on asatisfaction scale, the system may query the AU for authorization to check captioningquality by providing a query on the AU's device display and "Yes" and "No" options.
Here, if the yes option is selected, the system would generate the captioning qualityvalue for the call and memorialize that value in the system database 1016. In addition,if the system identifies some likely factor in a low quality assessment, the system maymemorialize that factor and present some type of feedback indicating the factor as alikely reason for the low quality value. For instance, if the system determines that theAU-HU link was extremely noisy, that factor may be memorialized and indicated to theAU as a reason for the poor quality captioning service.
As another instance, because it is the HU's voice signal that is recorded (e.g.,in some cases the AU voice signal may not be recorded) and used to generate thecaptioning quality value, authorization to use the recording to generate the quality valuemay be sought from an HU if the HU is using a device that can receive and issue anauthorization request at the end of a call. For instance, in the case of a call where anHU uses a standard telephone, if an AU indicates a low satisfaction rating at the end ofa call, the system may transmit an audio recording to the HU requesting authorization touse the HU voice signal to generate the quality value along with instructions to select"1" for yes and "2" for no. In other cases where an HU's device is a smart phone orother computing type device, the request may include text transmitted to the HU deviceand selectable "Yes" and "No" buttons for authorizing or not.
While an HU-AU call recording may be at least temporarily stored at a relay,in other cases it is contemplated that call recordings may be stored at an AU device oreven at an HU device until needed to generate quality values. In this way, an HU or AUmay exercise more control or at least perceive to exercise more control over callcontent. Here, for instance, while a call may be recorded, the recording device may notrelease recordings unless authorization to do so is received from a device operator(e.g., an HU or an AU). Thus, for instance, if the HU voice signal for a call is stored onan HU device during the call and, at the end of a call an AU expresses low satisfactionwith the captioning service in response to a satisfaction query, the system may querythe HU to authorize use of the HU voice to generate captioning quality values. In thiscase, if the HU authorizes use of the HU voice signal, the recorded HU voice signalwould be transmitted to the relay to be used to generate captioning quality values asdescribed above. Thus, the HU or AU device may serve as a sort of software vault forHU voice signal recordings that are only released to the relay after proper authorizationis received from the HU or the AU, depending on system requirements.
As generally known in the industry, voice to text software accuracy is higherfor software that is trained to the voice of a speaking person. Also known is thatsoftware can train to specific voices over short durations. Nevertheless, in most cases itis advantageous if software starts with a voice model trained to a particular voice so thatcaption accuracy can start immediately upon transcription. Thus, for instance, in Fig., when a specific HU calls an AU to converse, it would be advantageous if the ASRservice at 1006 had access to a voice model for the specific HU. One way to do thiswould be to have the ASR service 1006 store voice models for at least HUs thatroutinely call an AU (e.g., a top ten HU list for each AU) and, when an HU voice signal isreceived at the ASR service, the service would identify the HU voice signal either usingrecognition software that can distinguish once voice from others or via some type of anidentifier like the phone number of the HU device used to call the AU. Once the HUvoice is identified, the ASR service accesses an HU voice model associated with theHU voice and uses that model to perform automated captioning.
One problem with systems that require an ASR service to store HU voicemodels is that HUs may prefer to not have their voice models stored by third party ASRservice providers or at least to not have the models stored and associated with specificHUs. Another problem may be that regulatory agencies may not allow a third party ASRservice provider to maintain HU voice models or at least models that are associatedwith specific HUs. Once solution is that no information useable to associate an HU witha voice model may be stored by an ASR service provider. Here, instead of using an HUidentifier like a phone number or other network address associated with an HU's deviceto identify an HU, an ASR server may be programmed to identify an HU's voice signalfrom analysis of the voice signal itself in an anonymous way. It is contemplated thatvoice models may be developed for every HU that calls an AU and may be stored in thecloud by the ASR service provider. Even in cases where there are thousands of storedvoice models, an HU's specific model should be quickly identifiable by a processor orserver.
Another solution may be for an AU device to store HU voice models forfrequent callers where each model is associated with an HU identifier like a phonenumber or network address associated with a specific HU device. Here, when a call isreceived at an AU device, the AU device processor may use the number or addressassociated with the HU device to identify which voice model to associate with the HUdevice. Then, the AU device may forward the HU voice model to the ASR serviceprovider 1006 to be used temporarily during the call to generate ASR text. Similarly,instead of forwarding an HU voice model to the ASR service provider, the AU devicemay simply forward an intermediate identification number or other identifier associatedwith the HU device to the ASR provider and the provider may associate the number witha specific HU voice model stored by the provider to access an appropriate HU voicemodel to use for text transcription. Here, for instance, where an AU supports tendifferent HU voice models for 10 most recent HU callers, the models may be associatedwith number 1 through 10 and the AU may simply forward on one of the intermediateidentifiers (e.g., "7") to the ASR provider 1006 to indicate which one of ten voice modelsmaintained by the ASR provider for the AU to use with the HU voice transmitted.
In other cases an ASR may develop and store voice models for each HU thatcalls a specific AU in a fashion that correlates those models with the AU's identity. Thenwhen the ASR provider receives a call from and AU caption device, the ASR providermay identify the AU and associated HU voice models and use those models to identifythe HU on the call and the model associated therewith.
In still other cases an HU device may maintain one or more HU voice modelsthat can be forwarded on to an ASR provider either through the relay or directly togenerate text.
Visible And Invisible Voice To Text Errors In at least some cases other more complex quality analysis and statistics arecontemplated that may be useful in determining better ways to train CAs as well as inassessing CA quality values. For instance, it has been recognized that voice to texterrors can generally be split into two different categories referred to herein as "visible"and "invisible" errors. Visible errors are errors that result in text that, upon reading, isclearly erroneous while invisible errors are errors that result in text that, despite the errorthat occurred, makes sense in context. For instance, where an HU voices the phrase"We are meeting at Joe's restaurant at 9 PM", in a text transcription "We are meeting atJoe's rodent for pizza at 9PM", the word "rodent" is a "visible" error in the sense that anAU reading the phrase would quickly understand that the word "rodent" makes no sensein context. On the other hand, if the HU's phrase were transcribed as "We are meetingat Joe's room for pizza at 9PM", the erroneous word "room" is not contextually wrongand therefore cannot be easily discerned as an error. Where the word "restaurant" iserroneously transcribed as "room", an AU could easily get a wrong impression and forthat reason invisible errors are generally considered worse than visible errors.
In at least some cases it is contemplate that some mechanism fordistinguishing visible and invisible text transcription errors may be included in a relayquality testing system. For instance, where 10 errors are made during some sub-periodof an HU-AU call, three of the errors may be identified as invisible while 7 are visible.
Here, because invisible errors typically have a worse effect on communicationeffectiveness, statistics that capture relative numbers of invisible to all errors should beuseful in assessing CA or ASR quality.
In at least some systems it is contemplated that a relay server may beprogrammed to automatically identify at least visible errors so that statistics relatedthereto can be captured. For instance, the server may be able to contextually examinetext and identify words of phrases that simply make no sense and may identify each ofthose nonsensical errors as a visible error. Here, because invisible errors makecontextual sense, there is no easy algorithm by which a processor or server can identifyinvisible errors. For this reason in at least some cases a correcting CA (See 1053 inFig. 31) may be required to identify invisible errors or, in the alternative, the system maybe programmed to automatically use CA corrections to identify invisible errors. In thisregard, any time a CA changes a word in a text phrase that initially made sense withinthe phrase to another word that contextually makes sense in the phrase, the systemmay recognize that type of correction to have been associated with an invisible error.
In at least some cases it is contemplated that the decision to switchcaptioning methods may be tied at least in part to the types of errors identified during acall. For instance, assume that a CA is currently generating text corresponding to anHU voice signal and that an ASR is currently training to the HU voice signal but is notcurrently at a high enough quality threshold to cut out the CA transcription process.
Here, there may be one threshold for the CA quality value generally and another for theCA invisible error rate where, if either of the two thresholds are met, the systemautomatically cuts the CA out. For example, the threshold CA quality value may require95% accuracy and the CA invisible error rate may be 20% coupled with a 90% overallaccuracy requirement. Thus, here, if the invisible error rate amounts to 20% or less ofall errors and the overall CA text accuracy is above 90% (e.g., the invisible error rate isless than 2% of all words uttered by the HU), the CA may be cut out of the call and ASRtext relied upon for captioning. Other error types are contemplated and a system fordistinguishing each of several errors types from one another for statistical reporting andfor driving the captioning triage process are contemplated.
In at least some cases when to transition from CA generated text to ASRgenerated text may be a function of not just a straight up comparison of ASR and CAquality values and instead may be related to both quality and relative latency associatedwith different transcription methods. In addition, when to transition in some cases maybe related to a combination of quality values, error types and relative latency as well asto user preferences.
Other triage processes for identifying which HU voice to text method shouldbe used are contemplated. For instance, in at least some embodiments when an ASRservice or ASR software at a relay is being used to generate and transmit text to an AUdevice for display, if an ASR quality value drops below some threshold level, a CA maybe patched in to the call in an attempt to increase quality of the transcribed text. Here,the CA may either be a full revoicing and correcting CA, just a correcting CA that startswith the ASR generated text and makes corrections or a first CA that revoices and asecond CA that makes corrections. In a case where a correcting CA is brought into acall, in at least some cases the ASR generated text may be provided to the AU devicefor display at the same time that the ASR generated text is sent to the CA for correction.
In that case, corrected text may be transmitted to the AU device for in line correctiononce generated by the CA. In addition, the system may track quality of the CAcorrected text and store a CA quality value in a system database.
In other cases when a CA is brought into a call, text may not be transmitted tothe AU device until the CA has corrected that text and then the corrected text may betransmitted.
In some cases, when a CA is linked to a call because the ASR generated textwas not of a sufficiently high quality, the CA may simply start correcting text related toHU voice signal received after the CA is linked to the call. In other cases the CA maybe presented with text associated with HU voice signal that was transcribed prior to theCA being linked to the call for the CA to make corrections to that text and then the CAmay continue to make corrections to the text as subsequent HU voice signal is received.
Thus, as described above, in at least some embodiments an HU'scommunication device will include a display screen and a processor that drives thedisplay screen to present a quality indication of the captions being presented to an AU.
Here, the quality characteristic may include some accuracy percentage, the actual textbeing presented to the AU, or some other suitable indication of caption accuracy or anaccuracy estimation. In addition, the HU device may present one or more options forupgrading the captioning quality such as, for instance, requesting CA correction ofautomated text captioning, requesting CA transcription and correction, etc.
Time Stamping Voice And Text In at least some embodiments described above various HU voice delayconcepts have been described where an HU's voice signal broadcast is delayed in orderto bring the voice signal broadcast more temporally in line with associated captionedtext. Thus, for instance, in a system that requires at least three seconds (and at timesmore time) to transcribe an HU's voice signal to text for presentation, a systemprocessor may be programmed to introduce a three second delay in HU voicebroadcast to an AU to bring the HU voice signal broadcast more into simultaneousalignment with associated text generated by the system. As another instance in asystem where an ASR requires at least two seconds to transcribe an HU's voice signalto text for presentation to a correcting CA, the system processor may be programmed tointroduce a two second delay in the HU voice that is broadcast to an AU to bring the HUvoice signal broadcast for into temporal alignment with the ASR generated text.
In the above examples, the three and two second delays are simply based onthe average minimum voice-to-text delays that occur with a specific voice to text systemand therefore, at most times, will only imprecisely align an HU voice signal withcorresponding text. For instance, in a case where HU voice broadcast is delayed threeseconds, if text transcription is delayed ten seconds, the three second delay would beinsufficient to align the broadcast voice signal and text presentation. As anotherinstance, where the HU voice is delayed three seconds, if a text transcription isgenerated in one second, the three second delay would cause the HU voice to bebroadcast two seconds after presentation of the associated text. In other words, in thisexample, the three second HU voice delay would be too much delay at times and toolittle at other times and misalignment could cause AU confusion.
In at least some embodiments it is contemplated that a transcription systemmay assign time stamps to various utterances in an HU's voice signal and those timestamps may also be assigned to text that is then generated from the utterances so thatthe HU voice and text can be precisely synchronized per user preferences (e.g.,precisely aligned in time or, if preferred by an AU, with an HU's voice preceding ordelayed with respect to text by the same persistent period) when broadcast andpresented to the AU, respectively. While alignment per an AU's preferences may causean HU voice to be broadcast prior to or after presentation of associated text, hereinafter,unless indicated otherwise, it will be assumed that an AU's preference is that the HUvoice and related text be broadcast and presented simultaneously at substantially thesame time (e.g., within 1-2 seconds before or after). It should be recognized that in anyembodiment described hereafter where the description refers to aligned or simultaneousvoice and text, the same teachings will be applicable to cases where voice and text arepurposefully misaligned by a persistent period (e.g., always misaligned by 3 secondsper user preference).
Various systems are contemplated for assigning time stamps to HU voicesignals and associated text words and/or phrases. In a first relatively simple case, anAU device that receives an HU voice signal may assign periodic time stamps tosequentially received voice signal segments and store the HU voice signal segmentsalong with associated time stamps. The AU device may also transmit at least an initialtime stamp (e.g. corresponding to the beginning of the HU voice signal or the beginningof a first HU voice signal segment during a call) along with the HU voice signal to a relaywhen captioning is to commence.
In at least some embodiments the relay stores the initial time stamp inassociation with the beginning instant of the received HU voice signal and continues tostore the HU voice signal as it is received. In addition, the relay operates its own timerto generate time stamps for on-going segments of the HU voice signal as the voicesignal is received and the relay generated time stamps are stored along with associatedHU voice signal segments (e.g., one time stamp for each segment that corresponds tothe beginning of the segment). In a case where a relay operates an ASR engine or tapsinto a fourth party ASR service (e.g., Google Voice, IBM's Watson, etc.) where a CAchecks and corrects ASR generated text, the ASR engine generates automated text forHU voice segments in real time as the HU voice signal is received.
A CA computer at the relay simultaneously broadcasts the HU voicesegments and presents the ASR generated text to a CA at the relay for correction.
Here, the ASR engine speed will fluctuate somewhat based on several factors that areknown in the speech recognition art so that it can be assumed that the ASR engine willtranslate a typical HU voice signal segment to text within anywhere between a fractionof a second (e.g., one tenth of a second) to 10 seconds. Thus, where the CA computeris configured to simultaneously broadcast HU voice and present ASR generated text forCA consideration, in at least some embodiments the relay is programmed to delay theHU voice signal broadcast dynamically for a period within the range of a fraction of asecond up to the maximum number of seconds required for the ASR engine totranscribe a voice segment to text. Again, here, a CA may have control over the timingbetween text presentation and HU voice broadcast and may prefer one or the other ofthe text and voice to precede the other (e.g., HU voice to proceed corresponding text bytwo seconds or vice versa). In these cases, the preferred delay between voice and textcan be persistent and unchanging which results in less CA confusion. Thus, forinstance, regardless of delay between an HU's initial utterance and ASR textgeneration, both the utterance and the associated ASR text can be persistentlypresented simultaneously in at least some embodiments.
After a CA corrects text errors in the ASR engine generated text, in at leastsome cases the relay transmits the time stamped text back to the AU caption device fordisplay to the AU. Upon receiving the time stamped text from the relay, the AU deviceaccesses the time stamped HU voice signal stored thereat and associates the text andHU voice signal segments based on similar (e.g., closest in time) or identical timestamps and stores the associated text and HU voice signal until presented andbroadcasted to the AU. The AU device then simultaneously (or delayed per userpreference) broadcasts the HU voice signal segments and presents the correspondingtext to the AU via the AU caption device in at least some embodiments.
A flow chart that is consistent with this simple first case of time stamping textsegments is shown in Fig. 32 and will be described next. Referring also to Fig. 33, asystem similar to the system described above with respect to Fig. 1 is illustrated wheresimilar elements are labelled with the same numbers used in Fig. 1 and, unlessindicated otherwise, operates in a similar fashion. The primary differences between theFig. 1 system and the system described in Fig. 33 is that each of the AU caption device12 and the relay 16 includes a memory device that stores, among other things, timestamped voice message segments corresponding to a received HU voice signal andthat time stamps are transmitted between AU device 12 and relay server 30 (see 1034and 1036).
Referring to Figs. 32 and 33, during a call between an HU using an HU device14 and an AU using AU device 12, at some point, captioning is required by the AU (e.g.,either immediately when the call commences or upon selection of a caption option bythe AU) at which point AU device 12 performs several functions. First, after captioningis to commence, at block 1102, the HU voice signal is received by the AU device 12. Atblock 1104, AU device 12 commences assignment and continues to assign periodictime stamps to the HU voice signal segments received at the AU device. The timestamps include an initial time stamp t0 corresponding to the instant in time whencaptioning is to commence or some specific instant in time thereafter as well asfollowing time stamps. In addition, at block 1104, AU device 12 commences storing thereceived HU voice signal along with the assigned time stamps that divide up the HUvoice signal into segments in AU device memory 1030.
Referring still to Figs. 32 and 33, at block 1106, AU device 12 transmits theHU voice signal segments to relay 16 along with the initial time stamp t0 correspondingto the instant captioning was initiated where the initial time stamp is associated with thestart of the first HU voice segment transmitted to the relay (see 1034 in Fig. 33). Atblock 1108, relay 16 stores the initial time stamp t0 along with the first HU voice signalsegment in memory 1032, runs its own timer to assign subsequent time stamps to theHU voice signal received and stores the HU voice signal segments and relay generatedtime stamps in memory 1032. Here, because both the AU device and the relay assignthe initial time stamp t0 to the same point within the HU voice signal and each assignsother stamps based on the initial time stamp, all of the AU device and relay time stampsshould be aligned assuming that each assigns time stamps at the same periodicintervals (e.g., every second).
In other cases, each of the AU device and relay may assign second andsubsequent time stamps having the form (t0 + t) where t is a period of time relative tothe initial time stamp t0. Thus, for instance, a second time stamp may be (t0 + 1sec), athird time stamp may be (t0 + 4sec), etc. In this case, the AU device and relay mayassign time stamps that have a different periods where the system simply alignsstamped text and voice when required based on closest stamps in time.
Continuing, at block 1110, relay 16 runs an ASR engine to generate ASRengine text for each of the stored HU voice signal segments and stores the ASR enginetext with the corresponding time stamped HU voice signal segments. At block 1112,relay 16 presents the ASR engine text to a CA for consideration and correction. Here,the ASR engine text is presented via a CA computer display screen 32 while the HUvoice segments are simultaneously (e.g., as text is scrolled onto display 32) broadcastto the CA via headset 54. The CA uses display 32 and/or other interface devices tomake corrections (see block 1116) to the ASR engine text. Corrections to the text arestored in memory 1032 and the resulting text is transmitted at block 1118 to AU device12 along with a separate time stamp for each of the text segments (see 1036 in Fig. 33).
Referring yet again to Figs. 32 and 33, upon receiving the time stamped text,AU device 12 correlates the time stamped text with the HU voice signal segments andassociated time stamps in memory 1130 and stores the text with the associated voicesegments and related time stamps at block 1120. At block 1122, in someembodiments, AU device 12 simultaneously broadcasts and presents the correlated HUvoice signal segments and text segments to the AU via an AU device speaker and theAU device display screen, respectively.
Referring still to Fig. 32, it should be appreciated that the time stamps appliedto HU voice signal segments and corresponding text segments enable the system toalign voice and text when presented to each of a CA and an AU. In other embodimentsit is contemplated that the system may only use time stamps to align voice and text forone or the other of a CA and an AU. Thus, for instance, in Fig. 32, the simultaneousbroadcast step at 1112 may be replaced by voice broadcast and text presentationimmediately when available and synchronous presentation and broadcast may only beavailable to the AU at step 1122. In a different system synchronous voice and text maybe provided to the CA at step 1112 while HU voice signal and caption text areindependently presented to the AU immediately upon reception at steps 1102 and 1122,respectively.
In the Fig. 32 process, the AU only transmits an initial HU voice signal timestamp to the relay corresponding to the instant when captioning commences. In othercases it is contemplated that AU device 12 may transmit more than one time stampcorresponding to specific points in time to relay 16 that can be used to correct any voiceand text segment misalignment that may occur during system processes. Thus, forinstance, instead of sending just the initial time stamp, AU device 12 may transmit timestamps along with specific HU voice segments every 5 seconds or every 10 seconds orevery 30 seconds, etc., while a call persists, and the relay may simply store each newlyreceived time stamp along with an instant in the stream of HU voice signal received.
In still other cases AU device 12 may transmit enough AU device generatedtime stamps to relay 16 that the relay does not have to run its own timer toindependently generate time stamps for voice and text segments. Here, AU device 12would still store the time stamped HU voice signal segments as they are received andstamped and would correlate time stamped text received back from the relay 16 in thesame fashion so that HU voice segments and associated text can be simultaneouslypresented to the AU.
A sub-process 1138 that may be substituted for a portion of the processdescribed above with respect to Fig. 32 is shown in Fig. 34, albeit where all AU devicetime stamps are transmitted to and used by a relay so that the relay does not have toindependently generate time stamps for HU voice and text segments. In the modifiedprocess, referring also and again to Fig. 32, after AU device 12 assigns periodic timestamps to HU voice signal segments at block 1104, control passes to block 1140 in Fig.34 where AU device 12 transmits the time stamped HU voice signal segments to relay16. At block 1142, relay 16 stores the time stamped HU voice signal segments afterwhich control passes back to block 1110 in Fig. 32 where the relay employs an ASRengine to convert the HU voice signal segments to text segments that are stored withthe corresponding voice segments and time stamps. The process described above withrespect to Fig. 32 continues as described above so that the CA and/or the AU arepresented with simultaneous HU voice and text segments.
In other cases it is contemplated that an AU device 12 may not assign anytime stamps to the HU voice signal and, instead, the relay or a fourth party ASR serviceprovider may assign all time stamps to voice and text signals to generate the correlatedvoice and text segments. In this case, after text segments have been generated foreach HU voice segment, the relay may transmit both the HU voice signal and thecorresponding text back to AU device 12 for presentation.
A process 1146 that is similar to the Fig. 32 process described above isshown in Fig. 35, albeit where the relay generates and assigns all time stamps to theHU voice signals and transmits the correlated time stamps, voice signals and text to theAU device for simultaneous presentation. In the modified process 1146, process steps1150 through 1154 in Fig. 35 replace process steps 1102 through 1108 in Fig. 32 andprocess steps 1158 through 1162 in Fig. 35 replace process steps 1118 through 1122 inFig. 32 while similarly numbered steps 1110 through 1116 are substantially identicalbetween the two processes.
Process 1146 starts at block 1150 in Fig. 35 where AU device 12 receives anHU voice signal from an HU device where the HU voice signal is to be captioned.
Without assigning any time stamps to the HU voice signal, AU device 12 links to a relay16 and transmits the HU voice signal to relay 16 at block 1152. At block 1154, relay 16uses a timer or clock to generate time stamps for HU voice signal segments after whichcontrol passes to block 1110 where relay 16 uses an ASR engine to convert the HUvoice signal to text which is stored along with the corresponding HU voice signalsegments and related time stamps. At block 1112, relay 16 simultaneously presentsASR text and broadcasts HU voice segments to a CA for correction and the CA viewsthe text and makes corrections at block 1116. After block 1116, relay 16 transmits thetime stamped text and HU voice segments to AU device 12 and that information isstored by the AU device as indicated at block 1160. At block 1162, AU device 12simultaneously broadcasts and presents corresponding HU voice and text segments viathe AU device display.
In cases where HU voice signal broadcast is delayed so that the broadcast isaligned with presentation of corresponding transcribed text, delay insertion points will beimportant in at least some cases or at some times. For instance, an HU may speak forconsecutive seconds where the system assigns a time stamp every 2 seconds. Inthis case, one solution for aligning voice with text would be to wait until the entire 20second spoken message is transcribed and then broadcast the entire 20 second voicemessage and present the transcribed text simultaneously. This, however, is a poorsolution as it would slow down HU-AU communication appreciably.
Another solution would be to divide up the 20 second voice message into 5second periods with silent delays therebetween so that the transcription process canroutinely catch up. For instance, here, during a first five second period plus a shorttranscription catch up period (e.g., 2 seconds), the first five seconds of the 20 secondHU voice massage is transcribed. At the end of the first 7 seconds of HU voice signal,the first five seconds of HU voice signal is broadcast and the corresponding textpresented to the AU while the next 5 seconds of HU voice signal is transcribed.
Transcription of the second 5 seconds of HU voice signal may take another 7 secondswhich would meant that a 2 second delay or silent period would be inserted after thefirst five seconds of HU voice signal is broadcast to the AU. In other cases the ASR textand HU voice would be sent ASAP when generated or received to deliver to the AU. Inthis case the 7 seconds described would be to complete the segment as opposed to forgetting the first words to the AU for broadcast.
This process of inserting periodic delays into HU voice broadcast and textpresentation while transcription catches up continues. Here, while it is possible that thedelays at the five second times would be at ideal times between consecutive naturalphrases, more often than not, the 5 second point delays would imperfectly divide naturallanguage phrases making it more, not less difficult, to understand the overall HU voicemessage.
A better solution is to insert delays between natural language phrases whenpossible. For instance, in the case of the 20 second HU voice signal example above, afirst delay may be inserted after a first 3 second natural language phrase, a seconddelay may be inserted after a second 4 second natural language phrase, a third delaymay be inserted after a third 5 second natural language phrase, a fourth delay may beinserted after a fourth 2 second natural language phrase and a fifth delay may beinserted after a fifth 2 second natural language phrase, so that none of the naturallanguage phrases during the voice message are broken up by intervening delays.
Software for identifying natural language phrases or natural breaks in an HU'svoice signal may use actual delays between consecutive spoken phrases as one proxyfor where to insert a transcription catch up delay. In some cases software may be ableto perform word, sentence and/or topic segmentation in order to identify naturallanguage phrases. Other software techniques for dividing voice signals into naturallanguage phrases are contemplated and should be used as appropriate.
Thus, while some systems may assign perfectly periodic time stamps to HUvoice signals to divide the signals into segments, in other cases time stamps will beassigned at irregular time intervals that make more sense given the phrases that an HUspeaks, how an HU speaks, etc.
Voice Message Replay Where time stamps are assigned to HU voice and text segments, voicesegments can be more accurately selected for replay via selection of associated text.
For instance, see Fig. 36 that shows a CA display screen 50 with transcribed textrepresented at 1200. Here, as text is generated by a relay ASR engine and presentedto a CA, consistent with at least some of the systems described above, the CA mayselect a word or phrase in presented text via touch (represented by hand icon 1202) toreplay the HU voice signal associated therewith.
When a word is selected in the presented text several things will happen in atleast some contemplated embodiments. First, a current voice broadcast to the CA ishalted. Second, the selected word is highlighted (see 1204) or otherwise visuallydistinguished. Third, when the word is highlighted, the CA computer accesses the HUvoice segment associated with the highlighted word and re-broadcasts the voicesegment for the CA to re-listen to the selected word. Where time stamps are assignedwith short intervening periods, the time stamps should enable relatively precise replay ofselected words from the text. In at least some cases, the highlight will remain and theCA may change the highlighted word or phrase via standard text editing tools. Forinstance, the CA may type replacement text to replace the highlighted word withcorrected text. As another instance, the CA may re-voice the broadcast word or phraseso that software trained to the CA's voice can generate replacement text. Here, thesoftware may use the newly uttered word as well as the words that surround the utteredword in a contextual fashion to identify the replacement word.
In some cases a "Resume" or other icon 1210 may be presented proximatethe selected word that can be selected via touch to continue the HU voice broadcastand text presentation at the location where the system left off when the CA selected theword for re-broadcast. In other cases, a short time (e.g., 1/4th second to 3 seconds)after rebroadcasting a selected word or phrase, the system may automatically revertback to the voice and text broadcast at the location where the system left off when theCA selected the word for re-broadcast.
While not shown, in some cases when a text word is selected, the system willalso identify other possible words that may correspond to the voice segment associatedwith the selected word (e.g., second and third best options for transcription of the HUvoice segment associated with the selected word) and those options may beautomatically presented for touch selection and replacement via a list of touchselectable icons, one for each option, similar to Resume icon 1210. Here, the optionsmay be presented in a list where the first list entry is the most likely substitute textoption, the second entry is the second most likely substitute text option, and so on.
Referring again to Fig. 36, in other cases when a text word is selected on aCA display screen 50, a relay server or the CA's computer may select an HU voicesegment that includes the selected word and also other words in an HU voice segmentor phrase that includes the selected word for re-broadcast to the CA so that the CA hassome audible context in which to consider the selected word. Here, when the phraselength segment is re-broadcast, the full text phrase associated therewith may behighlighted as shown at 1206 in Fig. 36. In some cases, the selected word may behighlighted or otherwise visually distinguished in one way and the phrase lengthsegment that includes the selected word may be highlighted or otherwise visuallydistinguished in a second way that is discernably different to the CA so that the CA isnot confused as to what was selected (e.g., see different highlighting at 1204 and 1206in Fig. 36).
In some cases a single touch on a word may cause the CA computer to re-broadcast the single selected word while highlighting the selected word and theassociated longer phrase that includes the selected word differently while a double tapon a word may cause the phrase that includes the selected word to be re-broadcast toprovide audio context. Where the system divides up an HU voice signal by naturalphrases, broadcasting a full phrase that includes a selected word should be particularlyuseful as the natural language phrase should be associated with a more meaningfulcontext than an arbitrary group of words surrounding the selected word.
Even if the system rebroadcasts a full phrase including a selected word, in atleast some cases CA edits will be made only to the selected word as opposed to the fullphrase. Thus, for instance, in Fig. 36 where a single word is selected but a phraseincluding the word is rebroadcast, any CA edit (e.g., text entry or text generated bysoftware in response to a revoiced word or phrase) would only replace the selectedword, not the entire phrase.
Upon selection of Resume icon 1210, the highlighting is removed from theselected word and the CA computer restarts simultaneously broadcasting the HU voicesignal and presenting associated transcribed text at the point where the computer leftoff when the re-broadcast word was selected. In some cases, the CA computer mayback up a few seconds from the point where the computer left off to restart thebroadcast to re-contextualize the voice and text presented to the CA as the CA againbegins correcting text errors.
In other cases, instead of requiring a user to select a "Resume" option, thesystem may, after a short period (e.g., one second after the selected word or associatedphrase is re-broadcast), simply revert back to broadcasting the HU voice signal andpresenting associated transcribed text at the point where the computer left off when there-broadcast word was selected. Here, a beep or other audibly distinguishable signalmay be generated upon word selection and at the end of a re-broadcast to audiblydistinguish the re-broadcast from broadcast HU voice. In other cases any re-broadcastvoice signal may be audibly modified in some fashion (e.g., higher pitch or tone, greatervolume, etc.) to audibly distinguish the re-broadcast from other HU voice signalbroadcast.
To enable a CA to select a phrase that includes more than one word forrebroadcast or for correction, in at least some cases it is contemplated that when a usertouches a word presented on the CA display device, that word will immediately be fullyhighlighted. Then, while still touching the initially selected and highlighted word, the CAcan slide her finger left or right to select adjacent words until a complete phrase to beselected is highlighted. Upon removing her finger from the display screen, thehighlighted phrase remains highlighted and revoicing or text entry can be used toreplace the entire highlighted phrase.
Referring now to Fig. 37, a screen shot akin to the screen shot shown in Fig.26 is illustrated at 50 that may be presented to an AU via an AU device display, albeitwhere an AU has selected a word from within transcribed text for re-broadcast. In atleast some embodiments, similar to the CA system described above, when an AUselects a word from presented text, the instantaneous HU voice broadcast and textpresentation is halted, the selected word is highlighted or otherwise visuallydistinguished as shown at 1230 and the phrase including the selected word may also bedifferently visually distinguished as shown at 1231. Beeps or other audible signals maybe generated immediately prior to and after re-broadcast of a voice signal segment.
When a word is selected, the AU device speaker (e.g., the speaker in associatedhandset 22) re-broadcasts the HU voice signal that is associated through the assignedtime stamp to the selected word. In other cases the AU device will re-broadcast theentire phrase or sub-phrase that includes the selected word to give audio context to theselected word.
Referring again to Fig. 37, when an AU selects a word for rebroadcasting, inat least some cases if that word is still on a CA's display screen when the AU selectsthe word, that word may be specially highlighted on the CA display to alert or indicate tothe CA that the AU had trouble understanding the selected word. To this end, see inFig. 36 that the word selected in Fig. 37 is highlighted on the exemplary CA displayscreen at 1201. Here, the CA may read the phrase including the word and eitherdetermine that the text is accurate or that a transcription error occurred. Where the textis wrong, the CA may correct the text or may simply ignore the error and continue onwith transcription of the continuing HU voice signal.
While the time stamping concept is described above with respect to a systemwhere an ASR initially transcribes an HU voice signal to text and a CA corrects the ASRgenerated text, the time stamping concept is also advantageously applicable to caseswhere a CA transcribes an HU voice signal to text and then corrects the transcribed textor where a second CA corrects text transcribed by a first CA. To this end, in at leastsome cases it is contemplated that an ASR may operate in the background of a CAtranscription system to generate and time stamp ASR text (e.g., text generated by anASR engine) in parallel with the CA generated text. A processor may be programmedto compare the ASR text and CA generated text to identify at least some matchingwords or phrases and to assign the time stamps associated with the matching ASRgenerated words or phrases to the matching CA generated text.
It is recognized that the CA text will likely be more accurate than the ASR textmost of the time and therefore that there will be differences between the two textstrings. However, some if not most of the time the ASR and CA generated texts willmatch so that many of the time stamps associated with the ASR text can be directlyapplied to the CA generated text to align the HU voice signal segments with the CAgenerated text. In some cases it is contemplated that confidence factors may begenerated for likely associated ASR and CA generated text and time stamps may onlybe assigned to CA generated text when a confidence factor is greater than somethreshold confidence factor value (e.g., 88/100). In most cases it is expected thatconfidence factors that exceed the threshold value will occur routinely and with shortintervening durations so that a suitable number of reliable time stamps can begenerated.
Once time stamps are associated with CA generated text, the stamps may beused to precisely align HU voice signal broadcast and text presentation to an AU or aCA (e.g., in the case of a second "correcting CA") as described above as well as tosupport re-broadcast of HU voice signal segments corresponding to selected text by aCA and/or an AU.
A sub-process 1300 that may be substituted for a portion of the Fig. 32process is shown in Fig. 38, albeit where ASR generated time stamps are applied to CAgenerated text. Referring also to Fig. 32, steps 1302 through 1310 shown in Fig. 38 areswapped into the Fig. 32 process for steps 1112 through 1118. Referring also to Fig.32, after an ASR engine generates and stores time stamped text segments for areceived HU voice signal segment, control passes to block 1302 in Fig. 38 where therelay broadcasts the HU voice signal to a CA and the CA revoices the HU voice signalto transcription software trained to the CA's voice and the software yields CA generatedtext.
At block 1304, a relay server or processor compares the ASR text to the CAgenerated text to identify high confidence "matching" words and/or phrases. Here, thephrase high confidence means that there is a high likelihood (e.g., 95% likely) that anASR text word or phrase and a CA generated text word or phrase both correspond tothe exact same HU voice signal segment. Characteristics analyzed by the comparingprocessor include multiple word identical or nearly identical strings in compared text,temporally when text appears in each text string relative to other assigned time stamps,easily transcribed words where both an ASR and a CA are highly likely to accuratelytranscribe words, etc. In some cases time stamps associated with the ASR text areonly assigned to the CA generated text when the confidence factor related to thecomparison is above some threshold level (e.g., 88/100). Time stamps are assigned atblock 1306 in Fig. 38.
At block 1308, the relay presents the CA generated text to the CA forcorrection and at block 1310 the relay transmits the time stamped CA generated textsegments to the AU device. After block 1310 control passes back to block 1120 in Fig.32 where the AU device correlates time stamped CA generated text with HU voicesignal segments previously stored in the AU device memory and stores the times, textand associated voice segments. At block 1122, the AU device simultaneouslybroadcasts and presents identically time stamped HU voice and CA generated text toan AU. Again, in some cases, the AU device may have already broadcast the HU voicesignal to the AU prior to block 1122. In this case, upon receiving the text, the text maybe immediately presented via the AU device display to the AU for consideration. Here,the time stamped HU voice signal and associated text would only be used by the AUdevice to support synchronized HU voice and text re-play or representation.
In some cases the time stamps assigned to a series of text and voicesegments may simply represent relative time stamps as opposed to actual time stamps.
For instance, instead of labelling three consecutive HU voice segments with actualtimes 3:55:45AM; 3:55:48AM; 3:55:51AM …, the three segments may be labelled t0, t1,t2, etc., where the labels are repeated after they reach some maximum number (e.g.,t20). In this case, for instance, during a 20 second HU voice signal, the 20 secondsignal may have five consecutive labels t0, t1, t2, t3 and t4 assigned, one every fourseconds, to divide the signal into five consecutive segments. The relative time labelscan be assigned to HU voice signal segments and also associated with specifictranscribed text segments.
In at least some cases it is contemplated that the rate of time stampassignment to an HU voice signal may be dynamic. For instance, if an HU is routinelysilent for long periods between intermittent statements, time stamps may only beassigned during periods while the HU is speaking. As another instance, if an HUspeaks slowly at times and more rapidly at other times, the number of time stampsassigned to the user's voice signal may increase (e.g., when speech is rapid) anddecrease (e.g., when speech is relatively slow) with the rate of user speech. Otherfactors may affect the rate of time stamps applied to an HU voice signal.
While the systems describe above are described as ones where time stampsare assigned to an HU voice signal by either or both of an AU's device and a relay, inother cases it is contemplated that other system devices or processors may assign timestamps to the HU voice signal including a fourth party ASR engine provider (e.g., IBM'sWatson, Google Voice, etc.). In still other cases where the HU device is a computer(e.g., a smart phone, a tablet type computing device, a laptop computer), the HU devicemay assign time stamps to the HU voice signal and transmit to other system devicesthat need time stamps. All combinations of system devices assigning new or redundanttime stamps to HU voice signals are contemplated.
In any case where time stamps are assigned to voice signals and textsegments, words, phrases, etc., the engine(s) assigning the time stamps may generatestamps indicating any of (1) when a word or phrase is voiced in an HU voice signalaudio stream (e.g., 16:22 to 16:22:5 corresponds to the word "Now") and (2) the time atwhich text is generated by the ASR for a specific word (e.g., "Now" generated at 16:25).
Where a CA generates text or corrects text, a processor related to the relay may alsogenerate time stamps indicating when a CA generated word is generated as well aswhen a correction is generated.
In at least some embodiments it is contemplated that any time a CA fallsbehind when transcribing an HU voice signal or when correcting an ASR enginegenerated text stream, the speed of the HU voice signal broadcast may beautomatically increased or sped up as one way to help the CA catch up to a currentpoint in an HU-AU call. For instance, in a simple case, any time a CA caption delay(e.g., the delay between an HU voice utterance and CA generation of text or correctionof text associated with the utterance) exceeds some threshold (e.g., 12 seconds), theCA interface may automatically double the rate of HU signal broadcast to the CA untilthe CA catches up with the call.
In at least some cases the rate of broadcast may be dynamic between anominal value representing the natural speaking speed of the HU and a maximum rate(e.g., increase the natural HU voice speed three times), and the instantaneous rate maybe a function of the degree of captioning delay. Thus, for instance, where thecaptioning delay is only 4 or less seconds, the broadcast rate may be "1" representingthe natural speaking speed of the HU, if the delay is between 4 and 8 seconds therebroadcast rate may be "2" (e.g., twice the natural speaking speed), and if the delay isgreater than 8 seconds, the broadcast rate may be "3" (e.g., three times the naturalspeaking speed).
In other cases the dynamic rate may be a function of other factors such as butnot limited to the rate at which an HU utters words, perceived clarity in the connectionbetween the HU and AU devices or between the AU device and the relay or betweenany two components within the system, the number of corrections required by a CAduring some sub-call period (e.g., the most recent 30 seconds), statistics related to howaccurately a CA can generate text or make text corrections at different speaking rates,some type of set AU preference, some type of HU preference, etc.
In some cases the rate of HU voice broadcast may be based on ASRconfidence factors. For instance, where an ASR assigns a high confidence factor to asecond portion of HU voice signal and a low confidence factor to the next 10seconds of the HU voice signal, the HU voice broadcast rate may be set to twice therate of HU speaking speed during the first 15 second period and then be slowed downto the actual HU speaking speed during the next 10 second period or to some otherpercentage of the actual HU speaking speed (e.g., 75% or 125%, etc.).
In some cases the HU broadcast rate may be at least in part based oncharacteristics of an HU's utterances. For instance, where an HU's volume on aspecific word is substantially increased or decreased, the word (or phrase including theword) may always be presented at the HU speaking speed (e.g., at the rate uttered bythe HU). In other cases, where the volume of one word within a phrase is stressed, theentire phrase may be broadcast at speaking speed so that the full effect of the stressedword can be appreciated. As another instance, where an HU draws out pronunciationof a word such as "Well….." for 3 seconds, the word (or phrase including the word) maybe presented at the spoken rate.
In some cases the HU voice broadcast rate may be at least in part based onwords spoken by an HU or on content expressed in an HU's spoken words. Forinstance, simple words that are typically easy to understand including "Yes", "No", etc.,may be broadcast at a higher rate than complex words like some medical diagnosis,multi-syllable terms, etc.
In cases where the system generates text corresponding to both HU and AUvoice signals, in at least some embodiments it is contemplated that during normaloperation only text associated with the HU signal may be presented to an AU and thatthe AU text may only be presented to the AU if the AU goes back in the text record toreview the text associated with a prior part of a conversation. For instance, if an AUscrolls back in a conversation 3 minutes to review prior discussion, ASR generated AUvoice related text may be presented at that time along with the HU text to providecontext for the AU viewing the prior conversation.
In the systems described above, whenever a CA is involved in a captionassisted call, the CA considers an entire HU voice signal and either generates acomplete CA generated text transcription of that signal or corrects ASR generated texterrors while considering the entire HU voice signal. In other embodiments it iscontemplated that where an ASR engine generates confidence factors, the system mayonly present sub-portions of an HU voice signal to a CA that are associated withrelatively low confidence factors for consideration to speed up the error correctionprocess. Here, for instance, where ASR engine confidence factors are high (e.g., abovesome high factor threshold) for a 20 second portion of an HU voice signal and then arelow for the next 10 seconds, a CA may only be presented the ASR generated text andthe HU voice signal may not be broadcast to the CA during the first 20 seconds whilesubstantially simultaneous HU voice and text are presented to the CA during thefollowing 10 second period so that the CA is able to correct any errors in the lowconfidence text. In this example, it is contemplated that the CA would still have theopportunity to select an interface option to hear the HU voice signal corresponding tothe first 20 second period or some portion of that period if desired.
In some cases only a portion of HU voice signal corresponding to lowconfidence ASR engine text may be presented at all times and in other cases, thistechnique of skipping broadcast of HU voice associated with high confidence text mayonly be used by the system during threshold catch up periods of operation. Forinstance, the technique of skipping broadcast of HU voice associated with highconfidence text may only kick in when a CA text correction process is delayed from anHU voice signal by 20 or more seconds (e.g., via a threshold period).
In particularly advantages cases, low confidence text and associated voicemay be presented to a CA at normal speaking speed and high confidence text andassociated voice may be presented to a CA at an expedited speed (e.g., 3 time normalspeaking speed) when a text presentation delay (e.g., the period between the time anHU uttered a word and the time when a text representation of the word is presented tothe CA) is less than a maximum latency period, and if the delay exceeds the maximumlatency period, high confidence text may be presented in block form (e.g., as opposedto rapid sequential presentation of separate words) without broadcasting the HU voiceto expedite the catchup process.
In cases where a system processor or sever determines when toautomatically switch or when to suggest a switch from a CA captioning system to anASR engine captioning system, several factors may be considered including thefollowing:1. Percent match between ASR generated words and CA generated words oversome prior captioning period (e.g., last 30 seconds);2. How accurate ASR confidence factors reflect corrections made by a CA;3. Words per minute spoken by an HU and how that affects accuracy;4. Average delay between ASR and CA generated text over some prior captioningperiod;. An expressed AU preference stored in an AU preferences database accessibleby a system processor;6. Current AU preferences as set during an ongoing call via an on screen or otherinterface tool;7. Clarity of received signal or some other proxy for line quality of the link betweenany two processors or servers within the system;8. Identity of a HU conversing with an AU; and9. Characteristics of a HU's voice signal.
Other factors are contemplated.
Handling Automatic And Ongoing ASR Text Corrections In at least some cases a speech recognition engine will sequentially generatea sequence of captions for a single word or phrase uttered by a speaker. For instance,where an HU speaks a word, an ASR engine may generate a first "estimate" of a textrepresentation of the word based simply on the sound of the individual word and nothingmore. Shortly thereafter (e.g., within 1 to 6 seconds), the ASR engine may considerwords that surround (e.g., come before and after) the uttered word along with a set ofpossible text representations of the word to identify a final estimate of a textrepresentation of the uttered word based on context derived from the surroundingwords. Similarly, in the case of a CA revoicing an HU voice signal to an ASR enginetrained to the CA voice to generate text, multiple iterations of text estimates may occursequentially until a final text representation is generated.
In at least some cases it is contemplated that every best estimate of a textrepresentation of every word to be transcribed will be transmitted immediately upongeneration to an AU device for continually updated presentation to the AU so that theAU has the best HU voice signal transcription that exists at any given time. Forinstance, in a case where an ASR engine generates at least one intermediate textestimate and a final text representation of a word uttered by an HU and where a CAcorrects the final text representation, each of the interim text estimate, the final textrepresentation and the CA corrected text may be presented to the AU where updates tothe text are made as in line corrections thereto (e.g., by replacing erroneous text withcorrected text directly within the text stream presented) or, in the alternative, correctedtext may be presented above or in some spatially associated location with respect toerroneous text.
In cases where an ASR engine generates intermediate and final textrepresentations while a CA is also charged with correcting text errors, if the ASR engineis left to continually make context dependent corrections to text representations, there isthe possibility that the ASR engine could change CA generated text and thereby unduean intended and necessary CA correction.
To eliminate the possibility of an ASR modifying CA corrected text, in at leastsome cases it is contemplated that automatic ASR engine contextual corrections for atleast CA corrected text may be disabled immediately after a CA correction is made oreven once a CA commences correcting a specific word or phrase. In this case, forinstance, when a CA initiates a text correction or completes a correction in textpresented on her device display screen, the ASR engine may be programmed toassume that the CA corrected text is accurate from that point forward. In some cases,the ASR engine may be programmed to assume that a CA corrected word is a truetranscription of the uttered word which can then be used as true context for ascertainingthe text to be associated with other ASR engine generated text words surrounding thetrue or corrected word. In some cases text words prior to and following the CAcorrected word may be corrected by the ASR engine based on the CA corrected wordthat provides new context or independent of that context in other cases. Hereinafter,unless indicated otherwise, when an ASR engine is disabled from modifying a word in atext phrase, the word will be said to be "firm".
In still other embodiments it is contemplated that after a CA listens to a wordor phrase broadcast to the CA or some short duration of time thereafter, the word orphrase may become firm irrespective of whether or not a CA corrects that word orphrase or another word or phrase subsequent thereto. For instance, in some casesonce a specific word is broadcast to a CA for consideration, the word may bedesignated firm. In this case each broadcast word is made firm immediately uponbroadcast of the word and therefore after being broadcast, no word is automaticallymodified by an ASR engine. Here the idea is that once a CA listens to a broadcastword and views a representation of that word as generated by the ASR engine, eitherthe word is correct or if incorrect, the CA is likely about to correct that word andtherefore an ASR correction could be confusing and should be avoided.
As another instance, in some cases where a word forms part of a largerphrase, the word and other words in the phrase may not be designated firm until aftereither (1) a CA corrects the word or a word in the phrase that is subsequent thereto or(2) the entire phrase has been broadcast to the CA for consideration. Here, the idea isthat in many cases a CA will have to listen to an entire phrase in order to assessaccuracy of specific transcribed words so firming up phrase words prior to completebroadcast of the entire phrase may be premature.
As yet one other instance, in some cases automatic firm designations may beassigned to each word in an HU voice signal a few seconds (e.g., 3 seconds) after theword is broadcast, a few words (e.g., 5 words) after the word is broadcast, or in someother time related fashion.
In at least some cases it is contemplated that if a CA corrects a word or wordsat one location in presented text, if an ASR subsequently contextually corrects a word orphrase that precedes the CA corrected word or words, the subsequent ASR correctionmay be highlighted or otherwise visually distinguished so that the CA's attention iscalled thereto to consider the ASR correction. In at least some cases, when an ASRcorrects text prior to a CA text correction, the text that was corrected may be presentedin a hovering tag proximate the ASR correction and may be touch selectable by the CAto revert back to the pre-correction text if the CA so chooses. To this end, see the CAinterface screen shot 1391 shown in Fig. 43 where ASR generated text is shown at1393 that is similar to the text presented in Fig. 39, albeit with a few corrections. Morespecifically, in Fig. 43, it is assumed that a CA corrected the word "cods" to "kids" at1395 (compare again to Fig. 39) after which an ASR engine corrected the prior word"bing" to "bring". The prior ASR corrected word is highlighted or distinguished as shownat 1397 and the word that was changed to make the correction is presented in hoveringtag 1399. Tag 1399 is touch selectable by the CA to revert back to the prior word ifselected.
In other cases where a CA initiates or completes a word correction, the ASRengine may be programmed to disable generating additional estimates or hypothesis forany words uttered by the HU prior to the CA corrected word or within a text segment orphrase that includes the corrected word. Thus, for instance, in some cases, where 30text words appear on a CA's display screen, if the CA corrects the fifth most recentlypresented word, the fifth most recently corrected word and the 25 preceding wordswould be rendered firm and unchangeable via the ASR engine. Here, in some casesthe CA would still be free to change any word presented on her display screen at anytime. In other cases, once a CA corrects a word, that word and any preceding textwords may be firm as to both the CA and the ASR engine.
In at least some embodiments a CA interface may be equipped with somefeature that enables a CA to firm up all current text results prior to some point in acaption representation on the CA's and AU's display screens. For instance, in somecases a specific simultaneous keyboard selection like the "Esc" key and an "F1" keywhile a cursor is at a specific location in a caption representation may cause all text thatprecedes that point, whether ASR initial, ASR corrected, CA initial or CA corrected, tobecome firm. As another instance, in at least some cases where a CA's display screenis touch sensitive, a CA may contact the screen at a location associated with acaptioned word and may perform some on screen gesture to indicate that words priorthereto should be made firm. For example, the on screen gesture may include a swipeupward, a double tap, or some other gesture reserved for firming up prior captioned texton the screen.
In still other cases one or more interface output signals may be used by a CAto help the CA track the CA's correction efforts. For instance, whenever a CA corrects aword or phrase in caption text, all text prior to and including the correction may behighlighted or otherwise visually distinguished (e.g., text color changed) to indicate thepoint of the most recent CA text change. Here, the CA could still make changed prior tothe most recent change but the color change to indicate the latest change in the textwould persist. In still other cases the CA may be able to select specific keys like an"Esc" key and some other key (e.g., "F2") to change text color prior to the selected pointas an indication to the CA that prior text has already been considered. In still othercases it is contemplated that on screen "checked" options may be presented on the CAscreen that are selectable to indicate that text prior thereto has been considered and thecolor should be changed. To this end see Fig. 50 where "Checked" icons (two labelled1544) are presented after each punctuation mark to separate consecutive sentences inASR generated text 1540. Here, if one of the checked icons is selected, text priorthereto may be highlighted or otherwise visually distinguished to indicate prior correctionconsideration.
While not shown, whenever text is firmed up and/or whenever a CA hasindicated that text has been considered for correction, in addition to indicating thatstatus on the CA display screen, in at least some cases that status may be indicated ina similar fashion on an AU device display screen.
When a CA firms up specific text, in at least some cases even if the CA islistening to HU voice signal prior to the point at which the text is firmed up, the systemmay automatically jump the HU voice broadcast point to the firmed up point so that theCA does not hear the intervening HU voice signal. When a voice signal jumps ahead, awarning may be presented to the CA on the CA's display screen confirming the jumpahead. In other cases the CA may still have to listen to the intervening HU voice signal.
In still other cases the system may play the intervening HU voice signal at a double,triple or some other multiple of the original speech rate to expedite the process ofworking through the intervening voice signal.
In at least some cases an AU device may support automatic triggers thatcause CA activity to skip forward to a current time. For instance, in an ASR-CA backedup mode, in at least some cases where an AU has at least some hearing capability, itmay be assumed that when an AU speaks, the AU is responding to a most recent HUvoice signal broadcast and therefore understood the most recent HU voice signal andtherefore that the AU's understanding of the conversation is current. Here, assumingthe AU has a current understanding, the system may automatically skip CA errorcorrection activities to the current HU voice signal and associated ASR text so that anyerror correction delay is eliminated. In a similar fashion, in a CA caption mode, if an AUspeaks, based on the assumption that the AU has a current understanding of theconversation, the system may automatically skip CA text generation and error correctionactivities to the current HU voice signal so that any text generation and error correctiondelay is eliminated. In this case, because there is no ASR text prior to the delayskipping, in parallel with the skipping activity, an ASR may generate fill in text toautomatically for the HU voice signal not already captioned by the CA. Any skippingahead based on AU speech may also firm up all text presented to the AU prior to thatpoint as well as any fill in text where appropriate.
In cases where an AU's voice signal operates as a catch up trigger, in at leastsome cases the trigger may require absence of typical words or phrases that areassociated with a confused state. For instance, an exemplary phrase that indicatesconfusion may be "What did you say?" As another instance, an exemplary phrase maybe "Can you repeat?" In this case, several predefined words or phrases may besupported by the system and, any time one of those words or phrases is uttered by anAU, the system may forego skipping the delayed period so that CA error correction orCA captioning with error correction continues unabated. In other cases the relay servermay apply artificial intelligence to recognize when a word or phrase likely indicatesconfusion and similarly may forego skipping the delayed period so that CA errorcorrection or CA captioning with error correction continues unabated. If the AU's utteredword or phrase is not associated with confusion, as described above, the CA activities(e.g., error correction or captioning and error correction) are skipped ahead to thecurrent HU voice signal.
In some cases there may be restrictions on text corrections that may bemade by a CA. For instance, in a simple case where an AU device can only present amaximum of 50 words to an AU at a time, the system may only allow a CA to correcttext corresponding to the 50 words most recently uttered by an HU. Here, the idea isthat in most cases it will make no sense for a CA to waste time correcting text errors intext prior to the most recently uttered 50 words as an AU will only rarely care to back upin the record to see prior generated and corrected text. Here, the window of text that iscorrectable may be a function of several factors including font type and size selected byan AU on her device, the type and size of display included in an AUs device, etc. Thisfeature of restricting CA corrections to AU viewable text is effectively a limit on how farbehind CA error corrections can lag.
In some cases it is contemplated that a call may start out with full CA errorcorrection so that the CA considers all ASR engine generated text but that, once theerror correction latency exceeds some threshold level, that the CA may only be able toor may be encouraged to only correct low confidence text. For instance, the latencylimit may be 10 seconds at which point all ASR text is presented but low confidence textis visually distinguished in some fashion designed to encourage correction. To this endsee for instance Fig. 40 where low and high confidence text is presented to a CA indifferent scrolling columns. In some cases error correction may be limited to the leftcolumn low confidence text as illustrated. Fig. 40 is described in more detail hereafter.
Where only low confidence text can be corrected, in at least some cases the HU voicesignal for the high confidence text may not be broadcast.
As another example, see Fig. 40A where a CA display screen shot 1351includes the same text 1353 as in Fig. 40 presented in a scrolling fashion and wherephrases (only one shown) that include one or some threshold of low confidence factorwords are visually distinguished (e.g., via a field border 1355, via highlighting, viadifferent text font characteristics, etc.) to indicate the low confidence factor words andphrases. Here, in some cases the system may only broadcast the low confidencephrases skipping from one to the next to expedite the error correction process. In othercases the system may increase the HU voice signal broadcast rate (e.g., 2X, 3X, etc.)between low confidence phrases and slow the rate down to a normal rate during lowconfidence phrases so that the CA continues to be able to consider low confidencephrases in full context.
In some cases, only low confidence factor text and associated HU voicesignal may be presented and broadcast to a CA for consideration with some indicationof missing text and voice between the presented text words or phrases. For instance,turn piping representations (see again 216 in Fig. 17) may be presented to a CAbetween low confidence editable text phrases.
In other cases, while interim and final ASR engine text may be presented toan AU, a CA may only see final ASR engine text and therefore only be able to edit thattext. Here, the idea is that most of the time ASR engine corrections will be accurate andtherefore, by delaying CA viewing until final ASR engine text is generated, the numberof required CA corrections will be reduced appreciably. It is expected that this solutionwill become more advantageous as ASR engine speed increases so that there isminimal delay between interim and final ASR engine text representations.
In still other cases it is contemplated that only final ASR engine text may besent on to an AU for consideration. In this case, for instance, ASR generated text maybe transmitted to an AU device in blocks where context afforded by surrounding wordshas already been used to refine text hypothesis. For instance, words may be sent infive word text blocks where the block sent always includes the 6th through 10th mostrecently transcribed words so that the most recent through fifth most recent words canbe used contextually to generate final text hypothesis for the 6th through 10th mostrecent words. Here, CA text corrections would still be made at a relay and transmittedto the AU device for in line corrections of the ASR engine final text.
In this case, if a CA takes over the task of text generation from an ASRengine for some reason (e.g., an AU requests CA help), the system may switch over totransmitting CA generated text word by word as the text is generated. In this case CAcorrections would again be transmitted separately to the AU device for in line correction.
Here, the idea is that the CA generated text should be relatively more accurate than theASR engine generated text and therefore immediate transmission of the CA generatedtext to the AU would result in a lower error presentation to the AU.
While not shown, in at least some embodiments it is contemplated that turnpiping type indications may be presented to a CA on her interface display as arepresentation of the delay between the CA text generation or correction and the ASRengine generated text. To this end, see the exemplary turn piping 216 in Fig. 17. Asimilar representation may be presented to a CA.
Where CA corrections or even CA generated text is substantially delayed, inat least some cases the system may automatically force a split to cause an ASR engineto catch up to a current time in a call and to firm up (e.g., disable a CA from changingthe text) text before the split time. In addition, the system may identify a preferred splitprior to which ASR engine confidence factors are high. For instance, where ASRengine text confidence factors for spoken words prior to the most recent 15 words arehigh and for the last fifteen words are low, the system may automatically suggest orimplement a split at the 15th most recent word so that ASR text prior to that word isfirmed up and text thereafter is still presented to the CA to be considered and corrected.
Here, the CA may reject the split either by selecting a rejection option or by ignoring thesuggestion or may accept the suggestion by selecting an accept option or by ignoringthe suggestion (e.g., where the split is automatic if not rejected in some period (e.g., 2seconds)). To this end, see the exemplary CA screen shot in Fig. 39 where ASRgenerated text is shown at 1332. In this case, the CA is behind in error correction sothat the CA computer is currently broadcasting the word "want" as indicted by the"Broadcast" tag 1334 that moves along the ASR generated text string to indicate to theCA where the current broadcast point is located within the overall string. A "High CF -Catch Up" tag 1338 is provided to indicate a point within the overall ASR text stringpresented prior to which ASR confidence factors are high and, after which ASRconfidence factors are relatively lower. Here, it is contemplated that a CA would be ableto select tag 1338 to skip to the tagged point within the text. If a CA selects tag 1338,the broadcast may skip to the associated tagged point so that "Broadcast" tag 1334would be immediately moved to the point tagged by tag 1338 where the HU voicebroadcast would recommence. In other cases, selecting high confidence tag 1338 maycause accelerated broadcast of text between tags 1334 and 1338 to expedite catch up.
Referring to Fig. 40, another exemplary CA screen shot 1333 that may bepresented to show low and high confidence text segments and to enable a CA to skip tolow confidence text and associated voice signal is illustrated. Screen shot 1333 dividestext into two columns including a low confidence column 1335 and a high confidencecolumn 1337. Low confidence column 1335 includes text segments that have ASRassigned confidence factors that are less than some threshold value which highconfidence column 1337 include text segments that have ASR assigned confidencefactors that are greater than the threshold value. Column 1335 is presented on the lefthalf of screen shot 1333 and column 1337 is presented on the right half of shot 1333.
The two columns would scroll upward simultaneously as more text is generated. Again,a current broadcast tag 1339 is provided at a current broadcast point in the presentedtext. Also, a "High CF, Catch Up" tag 1341 is presented at the beginning of a lowconfidence text segment. Here, again, it is contemplated that a CA may select the highconfidence tag 1341 to skip the broadcast forward to the associated point to expeditethe error correction process. As shown, in at least some cases, if the CA does not skipahead by selecting tag 1341, the HU voice broadcast may be at 2X or more thespeaking speed so that catch up can be more rapid.
In at least some cases it is contemplated that when a call is received at an AUdevice or at a relay, a system processor may use the calling number (e.g., the numberassociated with the calling party or the calling parties device) to identify the leastexpensive good option for generating text for a specific call. For instance, for a specificfirst caller, a robust and reliable ASR engine voice model may already exist andtherefore be useable to generate automated text without the need for CA involvementmost of the time while no model may exist for a second caller that has not previouslyused the system. In this case, the system may automatically initiate captioning usingthe ASR engine and first caller voice model for first caller calls and may automaticallyinitiate CA assisted captioning for second caller calls so that a voice model for thesecond caller can be developed for subsequent use. Where the received call is from anAU and is outgoing to an HU, a similar analysis of the target HU may cause the systemto initiate ASR engine captioning or CA assisted captioning.
In some embodiments identity of an AU (e.g., an AU's phone number or othercommunication address) may also be used to select which of two or more textgeneration options to use to at least initiate captioning. Thus, some AU's may routinelyrequest CA assistance on all calls while others may prefer all calls to be initiated asASR engine calls (e.g., for privacy purposes) where CA assistance is only needed uponrequest for relatively small sub-periods of some calls. Here, AU phone or addressnumbers may be used to assess optimal captioning type.
In still other cases both a called and a calling number may be used to assessoptimal captioning type. Here, in some cases, an AU number or address may trump anHU number or address and the HU number or address may only be used to assesscaption type to use initially when the AU has no perceived or expressed preference.
Referring again to Fig. 39, it has been recognized that, in addition to textcorresponding to an HU voice signal, an optimal CA interface needs additionalinformation that is related to specific locations within a presented text string. Forinstance, specific virtual control buttons need to be associated with specific text stringlocations. For example, see the "High CF- Catch Up" button in Fig. 39. As otherexamples, a "resume" tag 1233 as in Fig. 36 or a correction word (see Fig. 20) mayneed to be linked to a specific text location. As another instance, in some cases a"broadcast" tag indicating the word currently being broadcast may have to be linked to aspecific text location (see Fig. 39).
In at least some embodiments, a CA interface or even an AU interface willtake a form where text lines are separated by at least one blank line that operates as an"additional information" field in which other text location linked information or contentcan be presented. To this end, see Fig. 39 where additional information fields arecollectively labelled 1215. In other embodiments it is contemplated that the additionalinformation fields may also be provided below associated text lines. In still otherembodiments, other text fields may be presented as separate in line fields within thetext strings (see 1217 in Fig. 40).
Training, Gamification, CA Scoring, CA Profiles In many industries it has been recognized that if a tedious job can begamified, employee performance can be increased appreciably as employees workthrough obstacles to increase personal speed and accuracy scores and, in some cases,to compete with each other. Here, in addition to increased personal performance, anemploying entity can develop insights into best work practices that can be rolled out toother employees attempting to better their performance. In addition, where there areclear differences in CA capabilities under different sets of circumstances, CA scoringcan be used to develop CA profiles so that when circumstances can be used todistinguish optimal CAs for specific calls, an automated system can distribute incomingcalls to optimal CAs for those specific calls or can move calls among CAs mid-call sothat the best CA for each call or parts of calls can be employed.
In the present case, various systems are being designed and tested to addgamification, scoring and profile generating aspects to the text captioning and/orcorrection processes performed by CAs. In this regard, in some cases it has beenrecognized that if a CA simply operates in parallel with an ASR engine to generate text,a CA may be tempted to simply let the ASR engine generate text without diligent errorcorrection which, obviously, is not optimal for AU's receiving system generated textwhere caption accuracy is desired and even required to be at high levels.
To avoid CAs shirking their error correction responsibilities and to help CAsincrease their skills, in at least some embodiments it is contemplated that a systemprocessor that drives or is associated with a CA interface may introduce periodic andrandom known errors into ASR generated text that is presented to a CA as test errors.
Here, the idea is that a CA should identify the test errors and at least attempt to makecorrections thereto. In most cases, while errors are presented to the CA, the errors arenot presented to an AU and instead the likely correct ASR engine text is presented tothe AU. In some cases the system allows a CA to actually correct the erroneous textwithout knowing which errors are ASR generated and which are purposefully introducedas part of the one of the gamification or scoring processes. Here, by requiring the CA tomake the correction, the system can generate metrics on how quickly the CA canidentify and correct caption errors.
In other cases, when a CA selects an introduced text error to make acorrection, the interface may automatically make the correction upon selection so thatthe CA does not waste additional time rendering a correction. In some cases, when anintroduced error is corrected either by the interface or the CA, a message may bepresented to the CA indicating that the error was a purposefully introduced error.
Referring to Fig. 41, a method 1350 that is consistent with at least someaspects of the present disclosure for introducing errors into an ASR text stream fortesting CA alertness is illustrated. At block 1352, an ASR engine generates ASR textsegments corresponding to an HU voice signal. At block 1354, a relay processor orASR engine assigns confidence factors to the ASR text and at block 1356, the relayidentifies at least one high confidence text segment as a "test" segment. At block 1358,the processor transmits the high confidence test segment to an AU device for display toan AU. At block 1360, the processor identifies an error segment to be swapped into theASR generated text for the test segment to be presented to the CA. For instance,where a high confidence test segment includes the phrase "John came home onFriday", the processor may generate an exemplary error segment like "John camphome on Friday".
Referring still to Fig. 41, at block 1362, the processor presents text with theerror segment to the CA as part of an ongoing text stream to consider for errorcorrection. At decision block 1364, the processor monitors for CA selection of words orphrases in the error segment to be corrected. Where the CA does not select the errorsegment for correction, control passes to block 1372 where the processor stores anindication that the error segment was not identified and control passes back up to block1352 where the process continues to cycle. In addition, at block 1372, the processormay also store the test segment, the error segment and a voice clip corresponding tothe test segment that may later be accessed by the CA or an administrator to confirmthe missed error.
Referring again to block 1364 in Fig. 41, if the CA selects the error segmentfor correction, control passes to block 1366 where the processor automatically replacesthe error segment with the test segment so that the CA does not have to correct theerror segment. Here the test segment may be highlighted or otherwise visuallydistinguished so that the CA can see the correction made. In addition, in at least somecases, at block 1368, the processor provides confirmation that the error segment waspurposefully introduced and corrected. To this end, see the "Introduced Error - NowCorrected" tag 1331 in Fig. 39 that may be presented after a CA selects an errorsegment. At block 1370, the processor stores an indication that the error segment wasidentified by the CA. Again, in some cases, the test segment, error segment andrelated voice clip may be stored to memorialize the error correction. After block 1370,control passes back up to block 1352 where the process continues to cycle.
In some cases errors may only be introduced during periods when the rate ofactual ASR engine errors and CA corrections is low. For instance, where a CA isroutinely making error corrections during a one minute period, it would make no senseto introduce more text errors as the CA is most likely highly focused during that periodand her attention is needed to ensure accurate error correction. In addition, if a CA issubstantially delayed in making corrections, the system may again opt to not introducemore errors.
Error introductions may include text additions, text deletions (e.g., removal oftext so that the text is actually missing from the transcript) and text substitutions in someembodiments. In at least some cases the error generating processor or CA interfacemay randomly generate errors of any type and related to any ASR generated text. Inother cases, the processor may be programmed to introduce several different types oferrors including visible errors (e.g., defined above as errors that are clear errors whenplaced in context with other words in a text phrase, e.g., the phrase does not makesense when the erroneous text is included), invisible errors (e.g., errors that makesense and a grammatically right in the context of surrounding words), minor errorswhich are errors that, while including incorrect text, have no bearing on the meaning ofan associated phrase (e.g., "the" swapped for "a") and major errors which are errorsthat include incorrect text and that change the meaning of an associated phrase (e.g.,swapping a 5PM meeting time for a 3PM meeting time). In some cases an error mayhave two designations such as, for instance, visible and major, visible and minor,invisible and major or invisible and minor.
Because at least some ASR engines can understand context, the enginescan also be programmed to ascertain when a simple text error affects phrase meaningand can therefore generate and identify different error types to test a CAs correctionskills. For instance, in some cases introduced errors may include visible, invisible,minor and major errors and statistics related to correcting each error type may bemaintained as well as when a correction results in a different error. For instance, aninvisible major error may be presented to a CA and the CA may recognize that error andincorrectly correct it to introduce a visible minor error which, while still wrong, is betterthan the invisible major error. Here, statistics would reflect that the CA identified andcorrected the invisible major error but made an error when correcting which resulted in avisible minor error. As another instance, a visible minor error may be incorrectlycorrected to introduce an invisible major error which would generate a much worsecaptioning result that could have substantial consequences. Here, statistics wouldreflect that the CA identified and corrected the initial error which is good, but would alsoreflect that the correction made introduced another error and that the new error resultedin a worse transcription result.
In some embodiments gamification can be enhanced by generating ongoing,real time dynamic scores for CA performance including, for instance, a score associatedwith accuracy, a separate score associated with captioning speed and/or separatespeed and accuracy scores under different circumstances such as, for instance, formale and female voices, for east coast accents, Midwest accents, southern accents,etc., for high speed talking and slower speed talking, for captioning with correctingversus captioning alone versus correcting ASR engine text, and any combinations offactors that can be discerned. In Fig. 40, exemplary accuracy and speed scores thatare updated in real time for an ongoing call are shown at 1343 and 1345, respectively.
Where a call persists for a long time, a rolling most recent sub-period of the call may beused as a duration over which at least current scores are calculated and separatescores for associated with an entire call may be generated and stored as well.
CA scores may be stored as part of a CA profile and that profile may beroutinely updated to reflect growing CA effectiveness with experience over time. OnceCA specific scores are stored in a CA profile, the system may automatically route futurecalls that have characteristics that match high scores for a specific CA to that CA whichshould increase overall system accuracy and speed. Thus, for instance, if an HU profileassociated with a specific phone number indicates that an associated HU has a strongsouthern accent and speaks rapidly, when a call is received that is associated with thatphone number, the system may automatically route the call to a CA that has a highgamification score for rapid southern accents if such a CA is available to take the call.
In other cases it is contemplated that when a call is received at a relay where the callcannot be associated with an existing HU voice profile, the system may assign the callto a first CA to commence captioning where a relay processor analyzes the HU voiceduring the beginning of the call and identifies voice characteristics (e.g., rapid, southern,male, etc.) and automatically switches the call to a second CA that is associated with ahigh gamification score for the specific type of HU voice. In this case, speed andaccuracy would be expected to increase after the switch to the second CA.
Similarly, if a call is routed to one CA based on an incoming phone numberand it turns out that a different HU voice is present on the call so that a better voiceprofile fits the HU voice, the call may be switched from an initial CA to a different CAthat is more optimal for the HU voice signal. In some cases a CA switch mid-call mayonly occur if some threshold level of delay or captioning errors is detected. Forinstance, if a first assigned CA's delay and error rate is greater than threshold valuesand a system processor recognizes HU voice characteristics that are much better suitedto a second available CA's skill set and profile, the system may automatically transitionthe call from the first CA to the second CA.
In addition, in some cases it is contemplated that in addition to the individualspeed and accuracy scores, a combined speed/accuracy score can be generated foreach CA over the course of time, for each CA over a work period (e.g., a 6 hourcaptioning day), for each CA for each call that the CA handles, etc. For example, anexemplary single score algorithm may including a running tally that adds one point for acorrect word and adds zero points for an incorrect word, where the correct word point isoffset by an amount corresponding to a delay in word generation after some minimalthreshold period (e.g., 2 seconds after the word is broadcast to the CA for transcriptionor one second after the word is broadcast to and presented to a CA for correction). Forinstance, the offset may be 0.2 points for every second after the minimal thresholdperiod. Other algorithms are contemplated. The single score may be presented to aCA dynamically and in real time so that CA is motivated to focus more. In other casesthe single score per phone call may be presented at the end of each call or an averagescore over a work period may be presented at the end of the work period. In Fig. 40, anexemplary current combined score is shown at 1347.
The single score or any of the contemplated metrics may also be related toother factors such as, for instance: (1) How quickly errors are corrected by a CA; (2) How many ASR errors need to be corrected in a rolling period oftime; (3) ASR delays; (4) How many manufactured or purposefully introduced errors arecaught and corrected; (5) Error types (e.g., visible, invisible, minor and major) (6) Correct and incorrect corrections; (7) Effect of incorrect corrections and non-corrections (e.g., bettercaption or worse caption); (8) Rates of different types of corrections; (9) Error density; (10) Once a CA is behind, how does the CA respond, rate ofcatchup; (11) HU speaking rate (WPM); (12) HU accent or dialect; (13) HU volume, pitch, tone, changes in audible signal characteristics; (14) Voice signal clarity (perhaps as measured by the ASR engine); (15) Communication link quality; (16) Noise level (e.g., HU operating in high wind environment wherenoise is substantial and persistent); (17) Quality of captioned sentence structure (e.g., verb, noun, adverb, inacceptable sequence); (18) ASR confidence factors associated with text generated during a call(as a proxy for captioning complexity), etc.
In at least some embodiments where gamification and training processes areapplied to actual AU-HU calls, there may be restrictions on ability to store captions ofactual conversations. Nevertheless, in these cases, captioning statistics may still bearchived without saving caption text and the statistics may be used to drive scoring andgamification routines. For instance, for each call, call characteristics may be storedincluding, for instance, HU accent, average HU voice signal rate, highest HU voicesignal rate, average volume of HU voice signal, other voice signal defining parameters,communication line clarity or other line characteristics, etc. (e.g., any of the other factorslisted above). In addition, CA timing information may be stored for each audio segmentin the call, for captioned words and for corrective CA activities.
As in the case of the full or pure CA metrics testing and development systemdescribed above, in at least some cases real AU-HU calls may be replaced by pre-recorded text call data sets where audio is presented to a CA while mock ASR enginetext associated therewith is visually presented to the CA for correction. In at least somecases, the pre-stored test data set may only include a mocked up HU voice signal andknown correct or true text associated therewith and the system including an ASR enginemay operate in a normal fashion so the ASR engine generates real time text includingASR errors for the mocked up HU voice signal as a CA views that ASR text and makescorrections. Here, as the CA generates corrected final text, a system processor mayautomatically compare that text to the known correct or true text to generate CA callmetrics including various scoring values.
In other cases, the ASR engine functions may be mimicked by a systemprocessor that automatically introduces known errors of specific types into the correct ortrue text associated with the mocked up HU voice signal to generate mocked up ASRtext that is presented to a CA for correction. Here, again, as the CA generatescorrected final text, a system processor automatically compares that text to the knowntrue text to generate CA call metrics including various scoring values.
In still other cases, in addition to storing the test HU voice signal andassociated true text, the system may also store a test version of text associated with theHU voice signal where the test text version has known errors of known types and,during a test session, the test text with errors may be presented to the CA for correction.
Here, again, as the CA generates corrected final text, a system processor automaticallycompares that text to the known true text to generate CA call metrics including variousscoring values.
In each cases where a mocked up HU voice signal is used during a testsession, the voice signal and CA captioned transcripts can be maintained andcorrelated with the CA's results so that the CA and/or a system administrator can reviewthose results for additional scoring purposes or to identify other insights into a specificCA's strengths and weaknesses or into CA activities more generally.
In at least some cases CAs may be tested using a testing application that, inaddition to generating mock ASR text and ASR corrections for a mocked up AU-HUvoice call, also simulates other exemplary and common AU actions during the call suchas, for instance, switching from an ASR-CA backed up mode to a full CA captioning anderror correction mode. Here, as during a normal call, the CA would listen to HU voicesignal and see ASR generated text on her CA display screen and would edit perceivederrors in the ASR text during the ASR-CA backed up mode operation. Here, the CAwould have full functionality to skip around within the ASR generated text to rebroadcastHU segments during error correction, to firm up ASR text, etc., just as if the mocked upcall were real. At some point, the testing application would then issue a command tothe CA station indicating that the AU requires full CA captioning and correction withoutASR assistance at which point the CA system would switch over to full CA captioningand correction mode. A switch back to the ASR-CA backed up mode may occursubsequently.
Where pre-recorded mock HU voice signals are fed to a CA, a Truth/Scorerprocessor may be programmed to automatically use known HU voice signal text toevaluate CA corrections for accuracy as described above. Here, a final draft of the CAcorrected text may be stored for subsequent viewing and analysis by a systemadministrator or by the CA to assess effectiveness, timing, etc.
Where scoring is to be applied to a live AU-HU call that does not use a pre-recorded HU voice signal so there is no initial "true" text transcript, a system akin to oneof those described above with respect to one of Figs. 30 or 31 may be employed wherea "truth" transcript is generated either via another CA or an ASR or a CA correcting ASRgenerated text for comparison and scoring purposes. Here, the second CA thatgenerates the truth transcript may operate at a much slower pace than the pacerequired to support an AU as caption rate is not as important and can be sacrificed foraccuracy. Once or as the second CA generates the truth transcript, a system processormay compare the first CA captioning results to the truth transcript to identify errors andgenerate statistics related thereto. Here, the truth transcript is ultimately deleted so thatthere is no record of the call and all that persists is statistics related to the CA'sperformance in handling the call.
In other embodiments where scoring is applied to a live AU-HU call that doesnot have a predetermined "truth" transcript, the second CA may receive the first CA'scorrected text and listen to the HU voice signal while correcting the first CA's correctedtext a second time. In this case, a processor tracks corrections by the first CA as wellas statistics related to one or any subset of the call factors (e.g., rate of speech, numberof ASR text errors per some number of words, etc.) listed above. In addition, theprocessor tracks corrections by the second CA where the second CA corrections areconsidered the Truth transcript. Thus, any correction made by the second CA is takenas an error.
In at least some cases, instead of just identifying CA caption errors generally,either a system processor or a second CA/scorer may categorize each error as visible(e.g., in context of phrase, error makes no sense), invisible (e.g., in context of phraseerror makes sense but meaning of phrase changes) or minor (e.g., error that does notchange the meaning of including phrase). Where a scoring second CA has to identifyerror type in a case where a mock AU-HU call is used as the source for CA correction, aprocessor may present a screenshot to the second CA where all errors are identifiedand as well as tallying tools for adding each error to one of several error type buckets.
To this end, see Fig. 51 where an exemplary CA scoring screen shot 1568 isillustrated. The screen shot 1568 includes a CA text transcript at 1572 that includescorrections by a first CA that is being scored by a CA scorer (e.g., a system manager oradministrator). While scoring the text, the scorer listens to the HU voice signal via aheadset and, in at least some cases, a word associated with a currently broadcast HUvoice signal is highlighted to aid the scorer in following along. In the illustratedembodiment, a system processor compares the CA corrected text to a truth transcriptand identifies transcription errors. Each error in Fig. 51 is visually distinguished. Forinstance, see exemplary field indicators 1574, 1576, 1578 and 1580, each of whichrepresents an error.
Referring still to Fig. 51, as the scorer works her way through the CA texttranscript considering each error, the scorer uses judgement to determine if the error isa major error or a minor error and designates each error either major or minor. Forinstance, a scorer may use a mouse or touch to select each error and then use specifickeyboard keys to assign different error types to each error. In the illustrated example, a"V" keyboard selection designates an error as a major error while an "F" selectiondesignates the error as a minor error. In Fig. 51, each time an error type is assigned toan error, a V1 or F1 designator is spatially associated with the error on the screen shot1568 so that the error type is clear. In addition, when an error type is assigned to anerror, the designated error is visually distinguished in a different fashion to help thescorer track which errors have been characterized and which have not. For instance, inFig. 51, fields 1574 and 1576 are shown as left up to right cross hatched to indicate ared color indicating that associated errors have ben categorized while fields 1578 and1580 are shown left down to right cross hatched to indicate a blue color reserved forerrors that have yet to be considered and categorized by the scorer.
In addition, when an error type is assigned to an error, a counter associatedwith the error type is incremented to indicate a total count for that specific type of error.
To this end, a counter field 1570 is presented along the top edge of the screen shot1568 that includes several counters including a major error counter and a minor errorcounter at 1598 and 1600, respectively. The final counts are used to generate variousmetrics related to CA quality and effectiveness.
In at least some cases a scorer may be able to select an error field to accessassociated text from the truth transcript that is associated with the error. To this end,see in Fig. 51 where hand icon 1594 indicates user selection of error field 1578 whichopens up truth text field 1596 in which associated truth text is presented. In theexample, the name "Jane" is the truth text for the error "Jam". Thus, the scorer caneither listen to the broadcast voice or view truth text to compare to error text forassessing error type.
Referring still to Fig. 51, missing text is also an error and is represented bythe term "%missing" as shown at 1580. Here, again, the scorer can select the missingtext field to view truth text associated therewith in at least some embodiments.
A "non-error" is erroneous text that could not possibly be confusing tosomeone reading a caption. For instance, exemplary non-errors include alternatespellings of a word, punctuation, spelled out numbers instead of numerals, etc. Here,while the system may flag non-errors between a truth text and CA generated text, thescorer may un-flag those errors as they are effectively meaningless. The idea here isthat on balance, it is better to have faster captioning with some non-errors than slowercaptioning where there are no non-errors and therefore, at a minimum, CAs should notbe penalized for purposefully or even unintentionally allowing non-errors. When ascorer un-flags a non-error, the appearance of the non-error is changed so that it is notvisually distinguished from other correct text in at least some embodiments. In addition,when a scorer un-flags a non-error, a value in a non-error count field 1602 isincremented by one.
In at least some cases a scorer can highlight word or phrases in a text captioncausing a processor to indicate durations of silence prior to the selected word or eachword in a selected phrase. To this end, see, for instance, the highlighted phrase "maygo out and catch a movie" in Fig. 51 where pre-word delays are shown before eachword in the highlighted phrase including, for instance, delays 1605 and 1607corresponding to the words "may" and "go", respectively. Here, a scorer can use thedelays to develop a sense of whether or not words repeated in CA corrected text aremeaningful. For instance, where a CA corrected transcript includes the phrase "no no",whether or not this word duplication is meaningful or not may be dependent upon thedelay between the two words. For instance, where there is no delay between thewords, the duplication was not necessary as one "no" would have gotten the meaningacross. On the other hand, where there is a several second delay between the first andsecond "no" utterances in the HU voice signal, that indicated that each word was aseparate answer (e.g., the end of one sentence and the beginning of another). A scorercan use this type of information as another metric for scoring CA performance.
One other way to monitor CA attention is to present random or periodicindicators into the ASR engine text that the CA has to recognize within the text in somefashion to confirm the CA's attention. For instance, referring again to Fig. 36, in somecases a separate check box may be presented for each ASR transcript line of text asshown at 1610 where a CA has to select each box to place an "X" therein to indicatethat the line has been examined. In other cases check boxes may be interspersedthroughout the transcript text presented to the CA and the CA may need to select eachof those boxes to confirm her attention.
Other AU Device Features And Processes In at least some of the embodiments described above an AU has the option torequest CA assistance or more CA assistance than currently afforded on a call and or torequest ASR engine text as opposed to CA generated text (e.g., typically for privacypurposes). While a request to change caption technique may be received from a CA, inat least some cases the alternative may not be suitable for some reason and, in thosecases, the system may forego a switch to a requested technique and provide anindication to a requesting AU that the switch request has been rejected. For instance, ifan AU receiving CA generated and corrected text requests a switch to an ASR enginebut accuracy of the ASR engine is below some minimal threshold, the system maypresent a message to the AU that the ASR engine cannot currently support captioningand the CA generation and correction may persist. In this example, once the ASRengine is ready to accurately generate text, the switch thereto may be either automaticor the system may present a query to the AU seeking authorization to switch over to theASR engine for subsequent captioning.
In a similar fashion, if an AU requests additional CA assistance, a systemprocessor may determine that ASR engine text accuracy is low for some reason that willalso affect CA assistance and may notify the AU that the a switch will not be madealong with a reason (e.g., "Communication line fault").
In cases where privacy is particularly important to an AU on a specific call orgenerally, the caption system may automatically, upon request from an AU or per AUpreferences stored in a database, initiate all captioning using an ASR engine. Here,where corrections are required, the system may present short portions of an HU's voicesignal to a series of CAs so that each CA only considers a portion of the text forcorrection. Then, the system would stitch all of the CA corrected text together into anHU text stream to be transmitted to the AU device for display.
In some cases it is contemplated that an AU device interface may present asplit text screen to an AU so that the AU has the option to view essentially real timeASR generated text or CA corrected text when the corrected text substantially lags theASR text. To this end, see the exemplary split screen interface 1450 in Fig. 45 whereCA corrected text is shown in an upper field 1452 and "real time" ASR engine text ispresented in a lower field 1454. As shown, a "CA location" tag 1456 is presented at theend of the CA corrected text while a "Broadcast" tag 1458 is presented at the end of theASR engine text to indicate the CA and broadcast locations within the text string.
Where CA correction latency reaches a threshold level (e.g., the text between the CAcorrection location and the most recent ASR text no longer fits on the display screen),text in the middle of the string may be replaced by a period indicator to indicate theduration of HU voice signal at the speaking speed that corresponds to the replaced text.
Here, as the CA moves on through the text string, text in the upper field 1452 scrolls upand as the HU continued to speak, the ASR text in the bottom field 1454 also scrolls upindependent of the upper field scrolling rate.
In at least some cases it is contemplated that an HU may use acommunication device that can provide video of the HU to an AU during a call. Forinstance, an HU device may include a portable tablet type computing device or smartphone (see 1219 in Fig. 33) that includes an integrated camera for telepresence typecommunication. In other cases, as shown in Fig. 33, a camera 1123 may be linked tothe HU phone or other communication device 14 for collecting HU video when activated.
Where HU video is obtained by an HU device, in most cases the video and voicesignals will already be associated for synchronous playback. Here, the HU voice andvideo signals are transmitted to an AU device, the HU video may be broken down intovideo segments that correspond with time stamped text and voice segments and thestamped text, voice and video segments may be stored for simultaneous replay to theAU as well as to a CA if desired. Here, where there are delays between broadcast ofconsecutive HU voice segments as text transcription progresses, in at least some casesthe HU video will freeze during each delay. In other cases the video and audio voicesignal may always be synchronized even when text is delayed. If the HU voice signal issped up during a catch up period as described above, the HU video may be shown at afaster speed so that the voice and video broadcasts are temporally aligned.
Fig. 42 shows an exemplary AU device screen shot 1308 includingtranscribed text 1382 and a video window or field 1384. Here, assuming that all of theshown text at 1382 has already been broadcast to the AU, if the AU selects the phrase"you should bing the cods along" as indicate by hand icon 1386, the AU device wouldidentify the voice segment and video segment associated with the selected textsegment and replay both the voice and video segments while the phrase remainshighlighted for the user to consider.
Referring yet again to Fig. 33, in some cases the AU device or AU stationmay also include a video camera 1125 for collecting AU video that can be presented tothe HU during a call. Here, it is contemplated that at least some HUs may be reticent toallow an AU to view HU video without having the reciprocal ability to view the AU duringan ongoing call and therefore reciprocal AU viewing would be desirable.
At least four advantages result from systems that present HU video to an AUduring an ongoing call. First, where the video quality is relatively high, the AU will beable to see the HU's facial expressions which can increase the richness of thecommunication experience.
Second, in some cases the HU representation in a video may be useable todiscern words intended by an HU even if a final text representation thereof isinaccurate. For instance, where a text transcription error occurs, an AU may be able toselect the phrase including the error and view the HU video associated with the selectedphrase while listening to the associated voice segment and, based on both the audioand video representations, discern the actual phrase spoken by the HU.
Third, it has been recognized that during most conversations, peopleinstinctively provide visual cues to each other that help participants understand when tospeak and when to remain silent while others are speaking. In effect, the visual cuesoperate to help people take turns during a conversation. By providing videorepresentations to each of an HU and an AU during a call, both participants can have agood sense of when their turn is to talk, when the other participant is struggling withsomething that was said, etc. Thus, for instance, in many cases an HU will be able tolook at the video to determine if an AU is silently waiting to view delayed text andtherefore will not have to ask if there is a delay in AU communication.
Fourth, for deaf AU's that are trained to read lips, the HU video may beuseable by the AU to enhance communication.
In at least some cases an AU device may be programmed to query an HUdevice at the beginning of a communication to determine if the HU device has a videocamera useable to generate an HU video signal. If the HU device has a camera, theAU device may cause the HU device to issue a query to the HU requesting access toand use of the HU device camera during the call. For instance, the query may includebrief instructions and a touch selectable "Turn on camera" icon or the like for turning onthe HU device camera. If the HU rejects the camera query, the system may operatewithout generating and presenting an HU video as described above. If the HU acceptsthe request, the HU device camera is turned on to obtain an HU video signal while theHU voice signal is obtained and the video and voice signal are transmitted to the AUdevice for further processing.
There are video relay systems on the market today where specially trainedCAs provide a sign language service for deaf AUs. In these systems, while an HU andan AU are communicating via a communication link or network, an HU voice signal isprovided to a CA. The CA listens to the HU voice signal and uses her hands togenerate a sequence of signs that correspond at least roughly to the content (e.g.,meaning) of the HU voice messages. A video camera at a CA station captures the CAsign sequence (e.g., "the sign signal") and transmits that signal to an AU device whichpresents the sign signal to the AU via a display screen. If the AU can speak, the AUtalks into a microphone and the AU's voice is transmitted to the HU device where it isbroadcast for the HU to hear.
In at least some cases it is contemplated that a second or even a thirdcommunication signal may be generated for the HU voice signal that can be transmittedto the AU device and presented along with the sign signal to provide additional benefitto the AU. For instance, it has been recognized that in many cases, while signlanguage can come close to the meaning expressed in an HU voice signal, in manycases there is no exact translation of a voice message to a sign sequence and thereforesome meaning can get lost in the voice to sign signal translation. In these cases, itwould be advantageous to present both a text translation and a sign translation to an In at least some cases it is contemplated that an ASR engine at a relay oroperated by a fourth party server linked to a relay may, in parallel with a CA generatinga sign signal, generate a text sequence for an HU voice signal. The ASR text signalmay be transmitted to an AU device along with or in parallel with the sign signal andmay be presented simultaneously as the text and sign signals are generated. In thisway, if an AU questions the meaning of a sign signal, the AU can refer to the ASRgenerated text to confirm meaning or, in many cases, review an actual transcript of theHU voice signal as opposed to a sometimes less accurate sign languagerepresentation.
In many cases an ASR will be able to generate text far faster than a CA willbe able to generate a sign signal and therefore, in at least some cases, ASR engine textmay be presented to an AU well before a CA generated sign signal. In some caseswhere an AU views, reads and understands text segments well prior to generation andpresentation of a sign signal related thereto, the AU may opt to skip ahead and foregosign language for intervening HU voice signal. Where an AU skips ahead in thisfashion, the CA would be skipped ahead within the HU voice signal as well and continuesigning from the skipped to point on.
In at least some cases it is contemplated that a relay or other systemprocessor may be programmed to compare text signal and sign signal content (e.g.,actual meaning ascribed to the signals) so that time stamps can be applied to text andsign segment pairings thus enabling an AU to skip back through communications toreview a sign signal simultaneously with a paired text tag or other indicator. Forinstance, in at least some embodiments as HU voice is converted by a CA to signsegments, a processor may be programmed to assess the content (e.g., meaning) ofeach sign segment. Similarly, the processor may also be programmed to analyze theASR generated text for content and to then compare the sign segment content to thetext segment content to identify matching content. Where sign and text segmentcontent match, the processor may assign a time stamp to the content matchingsegments and store the stamp and segment pair for subsequent access. Here, if an AUselects a text segment from her AU device display, instead of (or in addition to in someembodiments) presenting an associated HU voice segment, the AU device mayrepresent the sign segment paired with the selected text.
Referring again to Fig. 33, the exemplary CA station includes, among othercomponents, a video camera 55 for taking video of a signing CA to be delivered alongwith transcribed text to an AU. Referring also and again to Fig. 42, a CA signing videowindow is shown at 1390 alongside a text field that includes text corresponding to anHU voice signal. In Fig. 42, if an AU selects the phrase labelled 1386, that phrasewould be visually highlighted or distinguished in some fashion and the associated orpaired sign signal segment would be represented in window 1390.
In at least some video relay systems, in addition to presenting sign and textrepresentations of an HU voice signal, an HU video signal may also be used torepresent the HU during a call. In this regard, see again Fig. 42 where both an HUvideo window 1384 and a CA signing window 1390 are presented simultaneously.
Here, all communication representations 1382, 1384 and 1390 may always besynchronized via time stamps in some cases while in other cases the representationsmay not be completely synchronized. For instance, in some cases the HU videowindow 1384 may always present a real time representation of the HU while text andsign signals are 1382 and 1390 are synchronized and typically delayed at leastsomewhat to compensate for time required to generate the sign signal as well as AUreplay of prior sign signal segments.
In still other embodiments it is contemplated that a relay or other systemprocessor may be programmed to analyze sign signal segments generated by a signingCA to automatically generate text segments that correspond thereto. Here the text isgenerated from the sign signal as opposed to directly from the voice signal andtherefore would match the sign signal content more closely in at least someembodiments. Because the text is generated directly from the sign signal, time stampsapplied to the sign signal can easily be aligned with the text signal and there would beno need for content analysis to align signals. Instead of using content to align, a signsignal segment would be identified and a time stamp applied thereto, then the signsignal segment would be translated to text and the resulting text would be stored in thesystem database correlated to the corresponding sign signal segment and the timestamp for subsequent access.
Fig. 44 shows yet another exemplary AU screen shot 1400 where textsegments are shown at 1402 and an HU video window is shown at 1412. The text 1402includes a block of text where the block is presented in three visually distinguishedways. First, a currently audibly broadcast word is highlighted or visually distinguished ina first way as indicated at 1406. Second, the line of text that includes the word currentlybeing broadcast is visually distinguished in a second way as shown at 1404. Other textlines are presented above and below the line 1404 to show preceding text and followingtext for context. In addition, the line at 1404 including the currently broadcast word at1406 is presented in a larger format to call an AU's attention to that line of text and theword being broadcast. The larger text makes it easier for an AU to see the presentedtext. Moreover, the text block 1402 is controlled to scroll upward while keeping the textline that includes the currently broadcast word generally centrally vertically located onthe AU device display so that the AU can simply train her eyes at the central portion ofthe display with the transcribed words scrolling through the field 1404. In this case, aproperly trained AU would know that prior broadcast words can be rebroadcast bytapping a word above field 1404 and that the broadcast can be skipped ahead bytapping one of the words below field 1404. Video window 1412 is provided spatiallyclose to field 1404 so that the text presented therein is intuitively associated with the HUvideo in window 1412.
In at least some embodiments it is contemplated that when a CA replaces anASR engine to generate text for some reason where the CA revoices an HU voicesignal to the ASR engine to generate the text, instead of providing the voice signal re-voiced by the CA to an ASR engine at the relay, the CA revoicing signal may be routedto the ASR engine that was being used prior to convert the HU voice signal to text.
Thus, for instance, where a system was transmitting an HU voice signal to a fourth partyASR engine provider when a CA takes over text generation via re-voicing, when the CAvoices a word, the CA voice signal may be transmitted to the fourth party provider togenerate transcribed text which is then transmitted back to the relay and on to the AUdevice for presentation.
In at least some cases it is contemplated that a system processor may treat atleast some CA inputs into the system differently as a function of how well the ASR islikely performing. For instance, as described above, in at least some cases when a CAselects a word in a text transcript on her display screen for error correction, in normaloperation, the selected word is highlighted for error correction. Here, however, in somecases what happens when a CA selects a text transcript word may be tied to the level ofperceived or likely errors in the phrase that includes the selected word. Where aprocessor determines that the number of likely errors in the phrase is small, the systemmay operate in the normal fashion so that only the selected word or sub-phrase (e.g.,after word selection and a swiping action) is highlighted and prepared for replacementor correction and where the processor determines that the number of likely errors in thephrase is large (e.g., the phrase is predictably error full), the system may operate tohighlight the entire error prone phrase for error correction so that the CA does not haveto perform other gestures to select the entire phrase. Here, when an entire phrase isvisually distinguished to indicate ability to correct, the CA microphone may beautomatically unmuted so the CA can revoice the HU voice signal to rapidly generatecorrected text.
In other cases, while a simple CA word selection may cause that word to behighlighted, some other more complex gesture after word selection may cause thephrase including the word to be highlighted for editing. For instance, a second tap on aword that immediately follows the word selection may cause a processor to highlight anentire word containing phrase for editing. Other gestures for phrase, sentence,paragraph, etc., selection are contemplated.
In at least some embodiments it is contemplated that a system processor maybe programmed to adjust various CA station operating parameters as a function of aCA's stored profile as well as real time scoring of CA captioning. For instance, CAscoring may lead to a CA profile that indicates a preferred or optimal rate of HU voicesignal broadcast (e.g., in words per minute) for a specific CA. Here, the system mayautomatically use the optimal broadcast rate for the specific CA. As another instance, aprocessor may monitor the rate of CA captioning, CA correcting and CA error rates andmay adjust the rate of HU voice signal broadcast that results in optimal time and errorrate statistics. Here, the rate may be increased during a beginning portion of a CA'scaptioning shift until optimal statistics result. Here, if statistics fall off at any time, thesystem may slow the HU voice signal broadcast rate to maintain errors within anacceptable range.
In some cases a CA profile may specify separate optimal system settings foreach of several different HU voice signal types or signal characteristics subsets. Forinstance, for a first CA, a first HU voice signal broadcast rate may be used for aHispanic HU voice signal while a second relatively slower HU voice signal broadcastrate may be used for a Caucasian HU voice signal. Many other HU voice signalcharacteristic subsets and associated optimal station operating characteristics arecontemplated.
ASR-CA Backed Up Mode While several different types of semi-automated systems have beendescribed above, one particularly advantageous system includes an automatic speechrecognition system that at least initially handles incoming HU voice signal captioningwhere the ASR generated text is corrected by a CA and where the CA has the ability tomanually (e.g., via selectin of button or the like) take over captioning whenever deemednecessary. Hereinafter, unless indicated otherwise, this type of ASR text first and CAcorrection second system will be referred to as an ASR-CA backed up mode.
Advantages of an ASR-CA backed up mode include the following. First, initial captiondelay is minimized and remains relatively consistent so that captions can be presentedto an AU as quickly as possible. To this end, ASR engines generate initial captionsrelatively quickly when compared to CA generated text in most cases in steady state.
Second, caption errors associated with current ASR engines can beessentially eliminated by a CA that only corrects ASR errors in most cases and finalcorrected text can be presented to an AU rapidly.
Third, by combining rapid ASR text with the error correction skills of a CA, it ispossible to mix those capabilities in different ways to provide optimal captioning speedand accuracy regardless of characteristics of different calls that are fielded by thecaptioning system.
Fourth, the combination of rapid ASR text and CA error correction enables asystem where an AU can customize their captioning system in many different ways tosuit their own needs and system expectations to enhance their communicationcapabilities.
While various aspects of an ASR-CA backed up mode have been describedabove, some of those aspects are described in greater detail and additional aspects aredescribed hereafter.
While an ASR engine is typically much faster at generating initial caption textthan a CA, in at least some specific cases a CA may in fact be faster than an ASRengine. Whether or not CA captioning is likely to be faster than ASR captioning is oftena function of several factors including, for instance, a CA's particular captioningstrengths and weaknesses as well as characteristics of an HU voice signal that is to becaptioned. For instance, a specific first CA may typically rapidly caption Hispanic voicesignals but may only caption Midwestern voice signals relatively slowly so that whencaptioning a Hispanic signal the CA speed can exceed the ASR speed while the CAtypically cannot exceed the ASR speed when captioning a Midwestern voice signal. Asanother instance, while an ASR may caption high quality HU voice signal faster than thefirst CA, the first CA may caption low quality HU voice signal faster than the ASR.
As described above, in some cases the system may present an option (seecaption source switch button 751 in Fig. 23A) for a CA to change from the ASRgenerating original text and the CA correcting that text to a system where the CAgenerates original text and corrects errors and in other cases a system processor mayautomatically change the system over to CA original and corrected text when the ASR istoo slow, is generating too many meaningful (e.g. "visible", changing the meaning of aphrase) transcription errors, or any combination of both. In still other cases a systemprocessor that determines that a specific CA, based on CA strengths and HU voicesignal characteristics, would likely be able to generate initial text faster than the ASR,may be programmed to offer a suggestion to the CA to switch over.
Thus, in some cases the caption source switch button 751 in Fig. 23A mayonly be presented to a CA as an option when a system processor determines that thespecific CA should be able to generate faster initial captions for an HU voice signal. Inan alternative, button 751 may always be presented to a CA but may have two differentappearances including the full button for selection and a greyed out appearance toindicate that the button is not selectable. Here, by presenting the greyed out buttonwhen not selectable a user will not be confused when that button is absent.
In some cases it may be that it has to be likely a CA can speed uptranscription appreciably prior to presenting button 751 so that small possible increasesin speed do not cause a suggestion to be presented to the CA which could simplydistract the CA from error correction. For instance, in an exemplary case, a processormay have to calculate that it is likely a specific CA can speed up transcription by 15% ormore in order to present button 751 to the CA for selection.
In some cases the system processor may take into account more than initialcaptioning speed when determining when to present caption source switch button 751to a CA. For instance, in some cases the processor may account for some combinationof speed and some factor related to the number of transcription errors generated by anASR to determine when to present button 751. Here, how speed and accuracy factorsare weighed to determine when button 751 should be presented to a user may be amatter of designer choice and should be set to create a best possible AU experience.
In at least some cases it is contemplated that when the system automaticallyswitches to full CA captioning and correction or the CA selects button 751 to switch tofull CA captioning and correction, the ASR may still operate in parallel with the CA togenerate a second initial version (e.g., a second to the CA generated captions) of theHU voice signal and the system may transmit whichever captions are generated first(e.g., ASR or CA) to the AU device for presentation. Here, it has been recognized thateven when a CA takes over full captioning and correction, which captioning is fastest,ASR or CA, may switch back and forth and, in that case, the fastest captions shouldalways be provided to the AU.
As recognized above, in at least some cases third party (e.g., a server in thecloud) ASR engines have at least a couple of shortcomings. First, third party ASRengine accuracy tends to decrease at the end of relatively long voice signal segments tobe transcribed.
Second, ASR engines use context to generate final transcription results andtherefore are less accurate when input voice segments are short. To this end, initialASR results for a word in a voice signal are typically based on phonetics and then, onceinitial results for several consecutive words in a signal are available, the ASR engineuses the context of the words together as well as additional characteristics of the voiceof the speaker generating the voice signal to identify a best final transcription result foreach word. Where a voice segment in an ASR request is short, the signal includes lesscontext in the segment for accurately identifying a final result and therefore the resultstend to be less accurate.
Third, final results tend to be generated in clumps which means thatautomated ASR error corrections presented to a CA or an AU tend to be presented Ispurts which can be distracting. For instance, if five consecutive words are changed intext presented on an AU's device display at the same time, the changes can bedistracting.
As described above, one solution to the third party ASR shortcomings is todivide an HU voice signal into signal slices that overlap to avoid inaccuracies related tolong duration signal segments. In addition, to make sure that all final transcriptionresults are contextually informed, each segment slice should be at least some minimumsegment length to ensure sufficient context. Ideally, segment slices sent to the ASRengine as transcription requests would include a predefined number of words within arange (e.g., 3 to 15 words) where the range is selected to ensure at least some level ofcontext to inform the final result. Unfortunately, an HU voice signal is not transcribedprior to sending it to the ASR engine and therefore there is no way to ascertain thenumber of words in a voice segment prior to receiving transcription results back fromthe ASR.
For this reason segment slices have to be time based as opposed to wordcount based where the time range of each segment is selected so that it is likely thesegment includes an optimal number (e.g., 3 to 15 words) of words spoken by an HU.
In at least some cases the time range will be between 1 and 10 seconds and, inparticularly advantageous cases, the range is between 1 and 3 seconds.
Once initial and/or final transcription results are received back at a relay forone or more HU voice signal segments, a relay processor may count the number ofwords in the transcription and automatically adjust the duration of each HU voice signalsegment up or down to adapt to the HU's rate of speech so that each subsequentsegment slice has the greatest chance of including an optimal number of words. Thus,for instance, where an HU talks extremely quickly, an initial segment slice duration offour seconds may be shortened to a two second duration.
In at least some cases a relay may only use central portions of ASRtranscribed HU voice signal slices for final transcription results to ensure that all finaltranscribed words are contextually informed. Thus, for instance, where a typical voicesignal slice includes 12 words, the relay processor may only use the third through ninthwords in an associated transcription to correct the initial transcription so that all of thewords used in the final results are context informed.
As indicated above, consecutive HU voice segment slices sent to ASRengines may be overlapped to ensure no word is missed. Overlapping segments alsohas the advantage that more context can be presented for each final transcription word.
At the extreme the relay may transmit a separate ASR transcription request for eachsub-period that is likely to be associated with a word (e.g., based on HU speaking rateor average HU speaking rate) and only one or a small number of transcribed words in areturned text segment may be used as the final transcription result. For instance, whereoverlapping segments each return an average of seven final transcribed words, therelay may only use the middle three of those words to correct initial text presented to theCA and the AU.
Where ASR transcription requests include overlapping HU voice signalsegments, consecutive requests will return duplicative transcriptions of the same words.
In at least some cases the relay processor receiving overlapping text transcriptions willidentify duplicative word transcriptions and eliminate duplication in initial text presentedto the CA and the AU as well as in final results.
In at least some cases it is contemplated that overlapping ASR requests maycorrespond to different length HU voice signal segments where some of the segmentlengths are chosen to ensure rapid (e.g., essentially immediate) captions and rapidintermediate correction results while other lengths are chosen to optimize for contextinformed accuracy in final results. To this end, a first set of ASR requests may includeshort HU voice signal slices to expedite captioning and intermediate correction speedalbeit while sacrificing some accuracy, and a second set of ASR requests may berelatively longer so that context informed final text is optimally identified.
Referring to Fig. 46, a schematic is shown that includes a single HU voicesignal line of text where the text is divided into signal segments or slices including firstthrough sixth short slices and first, second and third long slices. The first long sliceincludes voice signal associated with the first through third short slices. The first longslice includes many words usable for immediate initial transcription as well as for finalcontextual transcription correction. Each long slice word is transmitted to an ASRengine essentially immediately as the HU voices the segment (e.g., a link to the ASR isopened at the beginning of the long slice and remains open as the HU voices the slice).
Initial transcription of each word in the first long slice is almost immediate and is fedimmediately to the CA for manual correction and to the AU as an initial text transcriptionirrespective of transcription errors that may exist. As more first slice words are voicedand transmitted to the ASR engine, those words are immediately transcribed andpresented to the CA and AU and are also used to provide context for previouslytranscribed words in the first long slice so that errors in the prior words can becorrected.
Referring still to Fig. 46, the second long slice overlaps the first long slice andincludes a plurality of words that correspond to a second slice duration. To handle thesecond long slice transcription, a second ASR request is transmitted to an ASR engineas the HU voices each word in the second slice and substantially real time or immediatetext is transmitted back from the engine for each received word. In addition, as thesecond slice words are transcribed, those words are also used by the ASR engine tocontextually correct prior transcribed words in the second slice to eliminate anyperceived errors and those corrections are used to correct text presented to the CA andthe AU.
The third long slice overlaps the second long slice and includes a plurality ofwords that correspond to a third slice duration. To handle the third long slicetranscription, a third ASR request is transmitted to an ASR engine as the HU voiceseach word in the third slice and substantially real time or immediate text is transmittedback from the engine for each received word. In addition, as the third slice words aretranscribed, those words are also used by the ASR engine to contextually correct priortranscribed words in the third slice to eliminate any perceived errors and thosecorrections are used to correct text presented to the CA and the AU.
It should be apparent from Fig. 46 because long slices overlap, two (and insome cases more) transcriptions for many HU voice signal words will be received by arelay from one or more ASR engines and therefore a relay processor has to beprogrammed to select which of the two or more initial transcriptions for a word topresent to a CA and an AU and which of two or more final transcriptions for the word touse to correct text already presented to the CA and AU. In at least some embodimentsthe relay processor may be programmed to select the first long slice in an HU voicesignal for generating initial transcription text for all first long slice words, the second longslice in the voice signal for generating initial transcription text for all second long slicewords that follow the end time of the first long slice and the third long slice in the voicesignal for generating initial transcription text for all third long slice words that follow theend time of the second long slice.
In an alternative system, the relay processor may be programmed to selectthe first long slice in an HU voice signal for generating initial transcription text for all firstlong slice words prior to the start time of the second long slice, the second long slice inthe voice signal for generating initial transcription text for all second long slice wordsprior to the start time of the third long slice and the third long slice in the voice signal forgenerating initial transcription text for all third long slice words.
In yet one other alternative system, for words that are included in overlappingsignal slices, the relay processor may pass on the first transcription of any word that isreceived by any ASR engine to the CA and AU devices to be presented irrespective ofwhich slice included the word. Here, a second or other subsequent initial transcriptionof an already presented word may be completely ignored or may be used to correct thealready presented word in some cases.
Referring again to Fig. 46, regarding final ASR text results for error correction,the first long slice transcription includes more contextual content than the second longslice for about the first two thirds of the first slice voice signal, the second long slicetranscription includes more contextual content than the first and third long slices forabout the central half of the second slice voice signal and so on. Thus, to provide mostaccurate ASR transcription error correction, the relay processor may be programmed touse final ASR text from sub-portions of each long signal slice for error correctionincluding final ASR text from the about the first two thirds of the first long slice, about thecentral half of the second long slice and about the last two thirds of the third long slice.
Here, because the slices are time based as opposed to word based, the exact sub-portion of each overlapping slice used for final text results can only be approximate untilthe text results are received back from the ASR engines.
Thus, it should be appreciated that different overlapping voice segments orslices may be used to generate initial and final transcriptions of words in at least someembodiments where the segments are selected to optimize for different purposes (e.g.,speed or contextual accuracy).
Referring still to Fig. 46, while shown as consecutive and distinct, consecutiveshort slices may overlap at least somewhat as described above. Each short slice has arelatively short duration (e.g., 1-3 seconds) and is transmitted to an ASR engine as theHU voices the segment (e.g., a link to the ASR is opened at the beginning of the sliceand remains open as the HU voices the slice). Here, initial transcription of each word ina short segment is almost immediate and could in some cases be used to provide theinitial transcription of words to a CA and an AU in at least some embodiments. Theadvantage of shorter voice signal slices in ASR transcription requests is that the ASRshould be able to generate more rapid final text transcriptions for words in the shortersegments so that error corrections in text presented to the CA and the AU arecompleted more rapidly. Thus, for instance, while an ASR may not finalize correction oftext at the beginning of the first long slice in Fig. 46 until just after that slice ends so thatall of the contextual information in that slice is considered, a different ASR handling thefirst short slice would complete its contextual error correction just after the end time ofthe first short slice. Here because short slice final text is generated relatively rapidlyand only affects a small text segment, it can be used to reduce the amount of sporadiclarge magnitude error corrections that can be distracting to a CA or and AU. In otherwords, short slice final text error correction is more regular and generally of smallermagnitude than long slice final text error correction.
As explained above, one problem with short voice signal slices is that there isnot enough content (e.g., additional surrounding words) in a short slice to result in highlyaccurate final text. Nevertheless, even short slice context results in better accuracythan initial transcription in most cases and can operate as an intermediate textcorrection agent to be followed up by long slice final text error correction. To this end,referring yet again to Fig. 46, in at least some embodiments the long text segments maybe used to generate initial transcribed text presented to a CA and an AU. Intermediateerror corrections in the initial text may be generated via contextual processing of theshort signal segments and used immediately as an intermediate error correction for theinitial text presented to the CA and AU. Final error correction in the intermediatelycorrected text may be generated via contextual processing of the long signal segmentsand used to finally error correct the intermediately corrected text for both the CA and the While initial, intermediate and final ASR text may be presented to each of theCA and an AU in some cases, in other embodiments the intermediate text may only bepresented to one or the other of the CA and the AU. For instance, where initial textresults may be displayed for each of the CA and the AU, intermediate results related tocontextual processing of short voice signal slices may be used to in line correct errors inthe CA presented text only to minimize distractions on the AU's display screen.
While the signal slicing and initial and final text selection processes havebeen described above as being performed by a relay processor, in other embodimentswhere an AU device or even an HU device links to an ASR engine to provide an HUvoice signal thereto and receive text therefrom, the AU or HU device would beprogrammed to slice the voice signal for transmission in a similar fashion and to selectinitial and final and in some cases intermediate text to be presented to system users ina fashion similar to that described above.
While ASR engines operate well under certain circumstances, they are simplyless effective than pure CA transcription systems under other sets of circumstances.
For instance, it has been observed that during a first short time just after an AU-HU callcommences and a second short time at the end of the call when accurate content isparticularly time sensitive as well as often unclear and rushed, full CA modes have aclear advantage over ASR-CA backed up modes. For this reason, in at least someembodiments it is contemplated that one type of system may initially link the HU portionof a call to a full CA mode where a CA transcribes text and corrects that text for at leastthe beginning portion of the call after which the call is converted to an ASR-CA backedup call where an ASR engine generates initial text and ASR corrections with a CAfurther correcting the initial and final ASR text. For instance, in some cases the HUvoice signal during the first 10-15 seconds of an AU-HU call may be handled by the fullCA mode and thereafter the ASR-CA backed up mode may kick in once the ASR hascontext for subsequent words and phrases to increase overall ASR accuracy.
In some cases only a small subset of highly trained CAs may handle the fullCA mode duties and when the ASR-CA backed up mode kicks in, the call may betransferred to a second CA that operates as a correction only CA most of the time. Inother cases a single CA may operate in the full CA mode as well as in the ASR-CAbacked up mode to maintain captioning service flow.
It has been recognized that for many AUs that have at least partial hearingcapabilities, in most cases during an AU-HU call by far the most important caption textis the text associated with the most recently generated HU voice signal. To this end, inmany cases an AU that has at least partial hearing relies on her hearing as opposed tocaption text to understand HU communications. Then, when an AU periodicallymisunderstands an HU voiced word or phrase, the AU will turn to displayed captions toclarify the HU communication. Here, most AUs want immediate correct text in real timeas opposed to three or six or more seconds later after a CA corrects the text so that thecorrections are as simultaneous with a real time HU voice signal broadcast as possible.
To be clear, in these cases, correct text corresponding to the most recent 7 or lessseconds of HU voice signal is far more important most of the time than correct textassociated with HU voice signal from 20 seconds ago.
In these cases and others where accurate substantially real time text isparticularly important, a captioning system processor may be programmed to enforce amaximum cumulative duration of HU voice signal broadcast pause seconds to ensurethat all CA correction efforts are at least somewhat aligned with the HU's real time voicesignal. For instance, in some cases the maximum cumulative pause signal may belimited to seven seconds or five seconds or f even three seconds to ensure thatessentially real time corrections to AU captions occur. In other cases the maximumcumulative delay may be limited by a maximum number of ASR text words so that, forinstance, a CA cannot get more than 3 or 5 or 7 words behind the initially generatedASR text.
Referring now to Fig. 52, an exemplary CA display screen shot 1650 isillustrated that presents ASR text to a CA as the CA listens to a hearing user's voicesignal via headset 54 as indicated at 1654. In this case, the CA is restricted to editingonly text that appears in the most recent two lines 1662 and 1664 of the presented textwhich is visually distinguished by an offsetting box labelled 1656. Box 1660 staysstationary as additional ASR generated text is generated and added to the bottom of thetext block 1652 and the on screen text scrolls upward. Again, as in several other figuresdescribed above, a system processor highlights or otherwise visually distinguishes thetext word that corresponds to the instantaneously broadcast HU voice signal word asshown at 1660. Here, however, when the text 1652 scrolls up one line, if the wordbeing broadcast is in the top line 1662 in box 1660 when scrolling occurs, the broadcastto the CA skips to the first word in the next line 1664 when a new line of text is addedthere below. To this end see Fig. 53 where one line of scrolling occurred while thesystem was still broadcasting a word in line 1662 in Fig. 52 so that the highlighted andbroadcast word is skipped ahead to the word "want" at the beginning of line 1664.
In some cases a limitation on CA corrections may be based on the maximumamount of text that can be presented on the CA display screen. For instance, in a casewhere only approximately 100 ASR generated words can appear on an AU's displayscreen, it would make little sense to allow a CA to correct errors in ASR text prior to themost recent 100 words because it is highly likely that earlier corrections would not bevisible by the AU. Thus, for instance, in some cases a cumulative maximum secondsdelay may be set to 20 seconds where text associated with times prior to the 20 secondthreshold simply cannot be corrected by the CA. In other cases the cumulativemaximum delay may be word count based (e.g., the maximum delay may be no morethan 30 ASR generated words). In other cases the maximum delay may vary with othersensed parameters such as line signal quality, the HU's speaking rate (e.g., words perminute actual or average), a CA's current or average captioning statistics, etc.
A CA's ability to correct text errors may be limited in several different ways.
For instance, relatively aged text that a CA can no longer correct may be visuallydistinguished (e.g., highlighted, scrolled up into a "firm" field, etc.) in a fashion differentfrom text that the CA can still correct. As another instance, text that cannot becorrected may simply be scrolled off or otherwise removed from the CA display screen.
Where a CA is limited to a maximum number of cumulative delay seconds,the cumulative delay count may be reduced by any perceived HU silent periods thatoccur between a current time and a time that precedes the current time by theinstantaneous delay count. Thus, for instance, if a current delay second count is 18seconds, if the most recent 18 seconds includes a 12 second HU silent period (e.g.,during an AU talking turn), then the cumulative delay may be adjusted downward to 6seconds as the system will be able to remove the 12 second silent period from CAconsideration so that the CA can catch up more rapidly.
In at least some cases it has been recognized that signal noise can appear ona communication link where the noise has a volume and perhaps other detectedcharacteristics but that cannot be identified by an ASR engine as articulated words.
Most of the time in these cases the noise is just that, simply noise. In some caseswhere line signal can clearly be identified as noise, a period associated with the noisemay be automatically eliminated from the HU voice signal broadcast to a CA forconsideration so that those noisy periods do not slow down CA captioning of actual HUvoice signal words. In other cases where an ASR cannot identify words in a receivedline signal but cannot rule out the line signal as noise, a relay processor may broadcastthat signal to a CA at a high rate (e.g., 2 to 4 times the rate of HU speech) so that thepossible noisy period is compressed. In most cases where the line signal is actuallynoise, the CA can simply listen to the expedited signal, recognize the signal as noise,and ignore the signal. In other cases the CA can transcribe any perceived words ormay slow down the signal to a normal HU speech rate to better comprehend anyspoken words. Here, once the ASR recognizes a word in the HU voice signal andgenerates a captioned word again, the pace of HU voice signal broadcast can beslowed to the HU's speech rate.
In cases where a CA switches from an ASR-CA backed up mode to a full CAmode, in at least some embodiments, the non-firm ASR generated text is erased fromthe CA's display screen to avoid CA confusion. Thus, for instance, referring again toFig. 23A, if a CA selects the full CA captioning/correction button 751 to initiate a pureCA text transcription and correction process, the CA display screen shot may beswitched to the shot illustrated in Fig. 47. As shown in Fig. 47, firm ASR text prior to thecurrent word considered by the CA at 781 or corrected by the CA persists at 783 butASR generated text thereafter is wiped from the display screen. The label on thecaption source switch button 751 is changed to now present the CA the option to switchback to the ASR-CA backed up type system if desired. The seconds behind field is stillpresent to give the CA a sense of how well she is keeping up with the HU voice signal.
When a CA changes from the ASR-CR backed up mode to a full CA mode,in some embodiments there will be no change in what the AU sees on her displayscreen and no way to discern that the change took place so that there is no issue withvisually disrupting the AU during the switchover. In other embodiments there may besome type of clean break so that the AU has a clear understanding that the captioningprocess has changed. For instance, see Fig. 48 where, after a CA has selected the fullCA mode option, a carriage return occurs after the most recently generated ASRgenerated text 1500 and a line 1502 is presented to delineate initial ASR and CAgenerated text. After line 1502, CA generated text is presented to the AU as indicatedat 1504. Here, all ASR text previously presented to the AU persists regardless ofwhether or not the text is firm or not and any initial CA generated text that is inconsistentwith ASR generated text is used to correct the ASR generated text via inline correctionso that the ASR generated text that is not firm is not completely wiped from the AU'sdevice display screen.
Thus, for instance, in one exemplary system, when a CA takes over initialcaptioning from an ASR, while ASR generated text that follows the point in an HU voicebroadcast most recently listened to or captioned by a CA is removed from the CA'sdisplay screen to avoid CA confusion, that same ASR generated text remains on theAU's display screen so that the AU does not recognize that the switch over to CAcaptioning occurred from the text presented. Then, as the CA re-voices HU voice signalto generate text or otherwise enters data to generate text for the HU voice signal, anydiscrepancies between the ASR generated text on the AU display screen and the CAgenerated text are used to perform in line corrections to the text on the AU display.
Thus, to the CA, the initial CA generated text is seen as new text while the AU sees theinitial text, up to the end of the prior ASR generated text as in line error corrections.
When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, the CA display screen shot may switch from a shot akin to the Fig. 47 shot backto the Fig. 23A shot where the button 751 caption is again switched back to "Full CACaptioning/Correction", the firm text and seconds behind indicator persist at 748A and755 and where ASR generated non-firm text is immediately presented at 769subsequent to the word 750A currently broadcast 752A to the CA for consideration andcorrection.
When a CA initiates a switch from a full CA mode to an ASR-CA backed upmode, again, in some embodiments there may be no change in what the AU sees onher display screen and no way to discern that the switch to the ASR-CA backed upmode took place so that the AU's visual experience of the captioned text is not visuallydisrupted. In other embodiments the AU display screen shot may switch from a shotakin to the Fig. 48 shot to a screen shot akin to the shot shown in Fig. 49 where acarriage return occurs after the most recently generated ASR generated text 1520 and aline 1522 is presented to delineate initial CA generated and corrected text from followingASR generated and CA corrected text. After line 1522, CA generated and correctedtext is presented to the AU as indicated at 1524. Here, all CA generated text previouslypresented to the AU persists.
While the CA and AU display screen shots upon caption source switching aredescribed above in the context of CA initiated caption source switching, it should beappreciated that similar types of switching notifications may be presented when an AUinitiates the switching action. To this end, see, for instance, that in some cases whenthe system is operating as a full CA captioning system as in Fig. 48, an "ASR-CA BackUp" button 771 is presented that can be selected to switch back to an ASR-CA backedup mode operation in which case a screen shot similar to the Fig. 49 shot may bepresented to the AU where line 1522 delineates the breakpoint between the CAgenerated initial text above and the ASR generated initial text that follows.
As another instance, see that in some cases when the system is operating asan ASR-CA backed up mode as in Fig. 49, a "Full CA Captioning/Correction" button 773is presented that can be selected to switch back to full CA captioning and correctionsystem operation in which case a screen shot similar to the Fig. 48 shot may bepresented to the AU where line 1502 delineates the breakpoint between the ASRgenerated initial text above and the CA generated initial text that follows.
In at least some embodiments as the system operates in the ASR-CA backedup mode of operation, as text is presented to a CA to consider the text for correction,the CA may be limited to only correcting errors that occur prior to a current point in theHU voice signal broadcast to the CA. Thus, for instance, referring again to Fig. 23Awhere a currently broadcast HU voice signal word is "restaurant", CA corrections maybe limited to text prior to the word restaurant at 748A so that the CA cannot change anyof the words at 769 until after they are broadcast to the CA.
In at least some embodiments when the system is in the ASR-CA backed upmode, a CA mute feature is enabled whenever the CA has not initiated a correctionaction and automatically disengages when the CA initiates correction. For instance,referring again to Fig. 50, assume a CA is reviewing the ASR generated text to identifytext errors as she is listening to the HU voice signal broadcast. Here, if the CA selectsthe words "Pistol Pals" via touch as indicated at 1560, the selected text is visuallydistinguished, the HU voice signal broadcast to the CA halts at the word "restaurant",CA keyboard becomes active for entering correction text and the muted CA microphoneis activated so that the CA has the option to enter corrective text either via the keyboardor via the microphone. In addition, the HU's voice segment including at least theannunciation related to the selected words "Pistol Pals" is immediately rebroadcast tothe CA for consideration while viewing the words "Pistol Pals". Once the CA correctionsare completed, the CA microphone is again disabled and the HU voice signal broadcastskips back to the word "restaurant" where the signal broadcast recommences. In somecases selection of the phrase "Pistol Pals" may also open a drop down window withother probable options for that phrase generated by the ASR engine or some otherprocessor function where the CA can quickly select one of those other options ifdesired.
In some embodiments when a CA starts to correct a word or phrase in anASR text transcript, once the CA selects the word or phrase for correction, a signal maybe sent immediately to an AU device causing the word or phrase to be highlighted orotherwise visually distinguished so that the AU is aware that it is highly likely that theword or phrase is going to be changed shortly. In this way, an AU can recognize that aword or phrase in an ASR text transcription is likely wrong and if she was relying on thetext representation to understand what the HU said, she can simply continue to view thehighlighted word or phrase until it is modified by the CA or otherwise cleared asaccurate.
Under at least some circumstances an ASR engine may lag an HU voicesignal by a relatively long and unacceptable duration. In at least some embodiments itis contemplated that when a relay operates in an ASR-CA backed up mode (e.g., wherethe ASR generates initial text for correction by a CA), a system processor may trackASR text transcription lag time and, under at least certain circumstances, mayautomatically switch from the ASR backed up mode to a full CA captioning andcorrection mode either for the remainder of a call or for at least some portion of the call.
For instance, when an ASR lag time exceeds some threshold duration (e.g., 1-15seconds), the processor may automatically switch to the full CA mode for apredetermined duration (e.g., 15 seconds) so that a CA can work to eliminate or at leastsubstantially reduce the lag time after which the system may again automatically revertback to the ASR-CA backed up mode. As another instance, once the system switchesto the full CA mode, the system may remain in the full CA mode while the ASRcontinues to generate ASR engine text in parallel and a system processor may continueto track the ASR lag time and when the lag time drops below the threshold value eitherfor a short duration or for some longer threshold duration of time (e.g., 5 consecutiveseconds), the system may again revert back to the ASR-CA backed up operating mode.
In still other cases where a system processor determines that some othercommunication characteristic (e.g., line quality, noise level, etc.) or HU voice signalcharacteristic (e.g., WPM, slurring of words, etc.) is a likely cause of the poor ASRperformance, the system may switch to full CA mode and maintain that mode until theperceived communication or voice signal characteristic is no longer detected.
In at least some cases where a third party provides ASR engine services,ASR delay can be identified whenever an HU voice signal is sent to the engine and notext is received back for at least some inordinate threshold of time.
In at least some cases the ASR text transcript lag time that triggers a switchto a full CA operating mode may be a function of specific skills or capabilities of aspecific CA that would take over full captioning and corrections if a switch over occurs.
Here, for instance, given a persistent ASR delay of a specific magnitude, a first CA maybe able to be substantially faster while a second could not so that a switch over to thesecond CA would only be justifiable if the persistent ASR delay was much longer. Hereit is contemplated that CA profiles will include speed and accuracy metrics forassociated CAs which can be used by the system to assess when to change over to thefull CA system and when not to change over depending on the CA identity and relatedmetrics.
In at least some embodiments it is contemplated that a relay processor maybe programmed to coach a CA on various aspects of her relay workstation and how tohandle calls generally and even specific calls while the calls are progressing. Forinstance, in at least some cases where a CA determines when to switch from an ASR-CA backed operating mode to a full CA mode, a system processor may track one ormore metrics during the ASR-CA backed operating mode and compare that metric tometrics for the CA in the CA profile to determine when a full CA mode would be betterthan the ASR-CA backed mode by at least some threshold value (e.g., 10% faster, 5%more accurate, etc.). Here, instead of automatically switching over to the full CA modewhen that mode would likely be more accurate and/or faster by the threshold value, aprocessor may present a notice or warning to the CA encouraging the CA to make theswitch to full CA mode along with statistics indicating the likely increase in captioningeffectiveness (e.g.., 10% faster, 5% more accurate). To this end, the exemplarystatistics shown at 1541 in Fig. 50 that are associated with a "Full CACaptioning/Correction" button.
In a similar fashion, when a CA operates a relay workstation in a full CAmode, the system may continually track metrics related to the CA's captions andcompare those to estimated ASR-CA backed up mode estimates for the specific CA(e.g., based on the CA's profile performance statistics) and may coach the CA on whento switch to the ASR-CA backed operating mode. In this regard, see for instance thespeed and accuracy statistics shown at 753 in Fig. 47 that are associated with the ASR-CA Back Up button 751.
In at least some embodiments it is contemplated that a CA will be able to setvarious station operating parameters to preferred settings that the CA perceives to beoptimal for the CA while captioning. For instance, in cases where a workstationoperating mode can be switched between ASR-CA backed and full CA, a CA may beable to turn automatic switching on or turn that switching off so that a switch only occurswhen the CA selects an on screen or other interface button to make the switch. Asanother instance, the CA may be able to specify whether or not metrics (e.g., speed andaccuracy as at 753 in Fig. 47) are presented to the CA to encourage a manual modeswitch. As another instance, a CA may be able to adjust a maximum cumulativecaptioning delay period that is enforced during calls. As still one other instance, a CAmay be able to turn on and off a 2 times or 3 times broadcast rate feature that kicks inwhenever a CA latency value exceeds some threshold duration. Many other stationparameters are contemplated that may be set to different operating characteristics by a In at least some cases it is contemplated that a system processor tracking allor at least a subset of CA statistics for all or at least a subset of CAs may routinelycompare CA statistical results to identify high and low performers and may then analyzeCA workstation settings to identify any common setting combinations that arepersistently associated with either high or low performers. Once persistent highperformer settings are identified, in at least some cases a system processor may usethose settings to coach other CAs and, more specifically, low performing CAs on bestpractices. In other cases, persistent high performer settings may be presented to asystem administrator to show a correlation between those settings and performanceand the administrator may then use those settings to develop best practice materials fortraining other CAs.
For example, assume that several CAs set workstation parameters such thata system processor only broadcasts HU voice signal corresponding to phrases thathave confidence factors of 6/10 or less at the HU's speaking rate and speeds upbroadcast of any HU voice signal corresponding to phrases that have 7/10 or greaterconfidence factors to 2X the HU's speaking rate. Also assume that these setting resultin substantially faster CA error correction than other station settings. In this case, anotice may be automatically generated to lower performing CAs encouraging each toexperiment with the expedited broadcast settings based on ASR text confidence factors.
Various system gaming aspects have been described above where CAstatistics are presented to a CA to help her improve skills and captioning services in afun way. In some cases it is contemplated that a system processor may routinelycompare a specific CA with her own average and best statistics and present thatinformation to th CA either routinely during calls or at the end of each call so that the CAcan compete against her own prior statistics. In some cases two or more CAs may bepitted against each other sort of like a race to see who can caption the fastest, correctmore errors in a short period of time, generate the most accurate overall caption text,etc. In some cases CAs may be able to challenge each other and may be presentedreal time captioning statistics during a challenge session where each gets to comparetheir statistics to the other CA's real time statistics. To this end, see the exemplary dualCA statistics shown at 771 in Fig. 47 where the statistics shown include averagecaptioning delay, accuracy level and number of errors corrected for a CA using a stationthat includes the display screen 50 and another CA, Bill Blue, captioning and correctingat a different station. Leaders in each statistical category are visually distinguished. Forinstance, statistic values that are best in each category are shown double cross hatchedin Fig. 47 to indicate green highlighting.
While CA call and performance metrics may be textually represented in somecases, in other cases particularly advantageous metric indicators may have at leastsome graphic characteristics so that metrics can be understood based on a simpleglance. For instance, see the graphical performance representation at 787 in Fig. 47where arrows 789 that represent instantaneous statistics dynamically float alonghorizontal accuracy and speed scales to indicate performance characteristics. In somecases the graphical characteristics may be calculated relative to personal averagesfrom a specific CA's profile and in other cases the characteristics may be calculatedrelative to all or a subset of CAs associated with the system.
In some embodiments it is contemplated that CAs may be automaticallyrewarded for good performance or increases in performance over time. For instance,each 2 hours a CA performs at or above some threshold performance level, she may berewarded with a coupon for coffee or some other type of refreshment. As anotherinstance, when a CA's persistent error correction performance level increases by 5%over time, she may be granted a paid one hour off at the end of the week. As yet oneother instance, where CA's compete head to head in a captioning and correctingcontest, the winner of a contest may be granted some reward to incent performanceincreases over time.
In line error corrections are described above where initial ASR or CAgenerated text is presented to an AU immediately upon being generated and then whena CA or an ASR corrects an error in the initial text, the erroneous text is replaced "inline" in the text already presented to the AU. In at least some cases the corrected textis highlighted or otherwise visually distinguished so that an AU can clearly see whentext has been corrected. Major and minor errors are also described where a minor erroris one that, while wrong, does not change the meaning of an including phrase while amajor error does change the meaning of an including phrase.
It has been recognized that when text on an AU display screen is changedand visually distinguished often, the cumulative highlighted changes can be distracting.
For this reason, in at least some embodiments it is contemplated that a systemprocessor may filter CA error corrections and may only change major errors on an AUdisplay screen so that minor errors that have no effect on the meaning of includingphrases are simply not shown to the AU. In many cases limiting AU text error correctionto major error corrections can decrease in line on screen corrections by 70% or moresubstantially reducing the level of distraction associated with the correction process.
To implement a system where only major errors are corrected on the AUdisplay screen, all CA error corrections may be considered in context by a systemprocessor (e.g., within including phrases) and the processor can determine if thecorrection changes the meaning of the including phrase. Where the correction affectsthe meaning of the including phrase, the correction is sent to the AU device along withinstructions to implement an in line correction. Where the correction does not affect themeaning of the including phrase, the error may simply be disregarded in someembodiments and therefore never sent to the AU device. In other cases where acorrection does not affect the meaning of the including phrase, the error may still betransmitted to the AU device and used to correct the error in a call text archivemaintained by the AU device as opposed to in the on screen text. In this way, if the AUgoes back in a call transcript to review content, all errors including major and minor arecorrected.
In other embodiments, instead of only correcting major errors on an AUdevice display screen, all errors may be corrected but the system may only highlight orotherwise visually distinguish major errors to reduce error correction distraction. Here,the thinking is that if and AU cares at all about error corrections, the most importantcorrections are the ones that change the meaning of an including phrase and thereforethose changes should be visually highlighted in some fashion.
CA Sensors CA station sensor devices can be provided at CA workstations to furtherenhance a CA's captioning and error correction capabilities. To this end, in at leastsome embodiments some type of eye trajectory sensor may be provided at a CAworkstation for tracking the location on a CA display screen that a CA is looking at sothat a word or phrase on the screen at the location instantaneously viewed by the CAcan be associated with the CA's sight. To this end, see, for instance, the CAworkstation 1700 shown in Fig. 54 that includes a display screen 50, keyboard 52 andheadphones 54 as described above with respect to Fig. 1. In addition, the station 1700includes an eye tracking sensor system that is represented by numeral 1702 that isdirected at a CA's location at the station and specifically to capture images or video ofthe CA using the station. The camera field of view (FOV) is indicated at 1712 and isspecifically trained on the face of a CA 1710 that currently occupies the station 1700.
Referring still to Fig. 54 and also to Fig. 55, images from sensor 1702 can beused to identify the CA's eyes and, more specifically, the trajectory of the CA's line ofsight as labelled 1714. As best shown in Fig. 55, the CA's line of sight intersects thedisplay screen 50 at a specific location where the text word "restaurant" is presented. Insome embodiments, as illustrated, the word a CA is currently looking at on the screen50 will be visually highlighted or otherwise distinguished as feedback to the CAindicating where the system senses that the CA is looking. Known eye trackingsystems have been developed that generate invisible bursts of infrared light that reflectsdifferently off a station user's eyes depending on where the user is looking. A camerapicks up images of the reflected light which is then used to determine the CA's line ofsight trajectory. In other cases a CA may wear a headset that tracks headsetorientation in the ambient as well as the CA's pupil to determine the CA's line of sight.
Other eye tracking systems are known in the art and any may be used in variousembodiments.
Here, instead of having to move a mouse cursor to a word on the displayscreen or having to touch the word on the screen to select it, a CA may simply tap aselection button on her keyboard 52 once to select the highlighted word (e.g., the wordsubtended by the CA's light of sight) for error correction. In some cases a double tap ofthe keyboard selection button may cause the entire phrase or several words before andafter the highlighted word to be selected for error correction.
Once a word or phrase is selected for error correction, the current HU voicesignal broadcast 1720A may be halted, the word or phrase selected may be differentlyhighlighted or visually distinguished and then re-broadcast for CA consideration as theCA uses the keyboard or microphone to edit the highlighted word or phrase. Once theword or phrase is corrected, the CA can tap an enter key or other keyboard button toenter the correction and cause the corrected text to be transmitted to the AU device forin line correction. Once the enter key is selected, HU voice signal broadcast wouldrecommence at the word 1720 where it left off.
In some embodiments the eye tracking feature may be used to monitor CAactivity and, specifically, whether or not the CA is considering all text generated by anASR or CA re-voicing software. Here, another metric may include percent of text wordsviewed by a CA for error correction, durations of time required to make errorcorrections, etc.
In at least some embodiments it is contemplated that two or more ASRengines of different types (e.g., developed and operated by different entities) may beavailable for HU voice signal captioning. In these cases, it is contemplated that one ofthe ASR engines may generate substantially better captioning results than otherengines. In some cases it is contemplated that at the beginning of an AU-HU call, theHU voice may be presented to two or more ASR engines so that two or more HU voicesignal text transcripts are generated. Here, a CA may correct one of the ASR texttranscripts to generate a "truth" transcript presented to an AU. Here, the truth transcriptmay be automatically compared by a processor to each of the ASR text transcriptsassociated with the call to rank the ASR engines best to worst for transcribing thespecific call. Then, the system may automatically start using the best ASR engine fortranscription during the call and may scrap use of the other two engines for theremainder of the call. In other cases while the other engines may be disabled, they maybe re-enabled if captioning metrics deteriorate below some threshold level and theprocess above of assigning metrics to each engine as text transcripts are generatedmay be repeated to identify a current best ASR engine to continue servicing the call.
In at least some cases one or more biometric sensors may be included withinan AU's caption device that can be sued for various purposes. For instance, see againFig. 1 where a camera 75 is included in device 12 for obtaining images of an AU usingthe caption device 12 during a voice communication with an HU. Other biometricsensor devices are contemplated such as, for instance, the microphone in handset 22, afinger print reader 23 on device 12 or handset 22, etc., each of which may be sued toconfirm AU user identity.
One purpose for camera 75 or another biometric sensor device may be torecognize a specific AU and only allow the captioning service to be used by a certifiedhearing impaired AU. Thus, for instance, a software application run by a processor indevice 12 or that is run by the system server 30 may perform a face recognition processeach time device 12 is activated, each time any person locates within the field of view ofcamera 75, each time the camera senses movement within its FOV, etc. In this case itis contemplated that any AU that is hearing impaired would have to pre-register with thesystem where the system is initially enabled by scanning the AU's face to generate aface recognition model which would be stored for subsequent device enablementprocesses. In other cases it is contemplated that hearing specialists of physicians may,upon diagnosing an AU with sufficient hearing deficiency to warrant the captioningservice, obtain an image of the AU's face or an entire 3D facial model using a smartphone or the like which is uploaded to a system server 30 and stored with useridentification information to facilitate subsequent facial recognition processes ascontemplated here. In this way, AUs that are not comfortable with computers ortechnology may be spared the burden of commissioning their caption devices at homewhich, for some, may not be intuitive.
After a caption device is set up and commissioned, once an authorized AU isdetected in the camera FOV, device 12 may operate in any of the ways describedabove to facilitate captioned or non-captioned calls for an AU. Where a person notauthorized to use the caption service uses device 12 to make a call, device 12 maysimply not provide any caption related features per the graphical display screen so thatdevice 12 operates like a normal display based phone device.
In other cases images or video from camera 75 may be provided to an HU oreven a CA to give either or both of those people a visual representation of the AU sothat each can get a sense from non-verbal queues of effectiveness of AUcommunications. When a visual representation of the AU is presented to either or bothof the HU and CA, some clear indicator of the visual representation will be given to theAU such as for instance, a warning message of display 18 of device 12. In fact, prior topresenting AU images or video to others, device 12 may seek AU authorization in aclear fashion so that the AU is not caught off guard.
In at least some embodiments described above, ASR or other currently bestcaption text (e.g., CA generated text in a full CA mode of operation) is presentedimmediately or at least substantially immediately to an AU upon generation andsubsequently, when an error in that initial text is corrected, the error is corrected withinthe text presented to the AU by replacing the initial erroneous text with corrected text.
To notify the AU that the text has been modified, the corrected text is highlighted orotherwise visually distinguished in line. It has been recognized that while highlighting orother tagging to distinguish corrected text is useful in most cases, those highlights ortags can become distracting under certain circumstances. For instance, whensubstantial or frequent error corrections are made, the new text highlighting can bedistracting to an AU participating in a call.
In some cases, as described above, a system processor may be programmedto determine if error corrections result in a change in meaning in an including sentenceand may only highlight error corrections that are meaningful (e.g., change the meaningof the included sentence). Here, all error corrections would be made on the AU devicedisplay but only meaningful error corrections would be highlighted.
In other cases it is contemplated that all error corrections may be visuallydistinguished where meaningful corrections are distinguished in one fashion and minor(e.g., not changing meaning of including sentence) error correction are distinguished ina relatively less noticeable fashion. For instance, minor error corrections may beindicated via italicizing text swapped into original text while meaningful corrections areindicated via yellow or green or some other type of highlighting.
In still other cases all error corrections may be distinguished initially uponbeing made but the highlighting or other distinguishing effect may be modified based onsome factor such as time, number of words captioned since the error was corrected,number or error corrections since the error was corrected, or some combination of thesefactors. For example, an error correction may initially be highlighted bright yellow and,over the next 8 seconds, the highlight may be dimmed until it is no longer visuallyidentifiable. As another example, a first error correction may be highlighted brightyellow and that highlighting may persist until each of a second and third error correctionthat follows the first correction is made after which the first error correction highlightingmay be completely turned off. As yet one other instance, an error correction may beinitially highlighted bright yellow and bolded and, after 8 subsequent text words aregenerated, the highlighting may be turned off while the bold effect continues. Then, aftera next two error corrections are made, the bold effect on the first error correction maybe eliminated. Many other expiring error correction distinguishing effects arecontemplated.
Referring now to Fig. 56, a screen shot of an AU interface is shown that maybe presented on a caption device display 18 that shows caption text that includes someerrors where a first error is shown corrected at 2102 (e.g., the term "Pal's" has beencorrected and replaced with "Pete's"). As illustrated the new term "Pete's" is visuallydistinguished in two ways including highlighting and changing the font to be bold anditalic.
Referring also to Fig. 57, a screen shot similar to the Fig. 56 shot is shown,albeit where a second error (e.g., "John") has been corrected and replaced in line withthe term "join" 1204. In this example, the correction distinguishing rules are that a mostrecent error correction is highlighted, bold and italic, a second most recent errorcorrection is indicated only via bold and italic font (e.g., no highlighting) and that whentwo error corrections occur after any error correction, the earliest of those corrections isno longer highlighted (e.g., is shown as regular text). Thus, in Fig. 57, the errorcorrection at 1202 is now distinguished by bold and italic font but is no longerhighlighted and the most recent error correction at 1204 is highlighted and shown viabold and italic font.
Referring to Fig. 58, a screen shot similar to the Fig. 56 and Fig. 57 shots isshown, albeit where a third error (e.g., "rest ant") has been corrected and replaced inline with the term "restaurant" 2106. Consistent with the correction distinguishing rulesdescribed above, the most recent correction 1206 is shown highlighted, bolded anditalic, the prior error correction at 1204 is shown bolded and italic and the errorcorrection at 1202 is shown as normal text with no special effect.
In any case where a second CA is taking over primary captioning from eitheran ASR or a first or initial CA at a specific point in an HU voice signal, the system mayautomatically broadcast at least a portion of the HU voice signal that precedes the pointat which the second CA is taking over captioning to the second CA to provide contextfor the second CA. For instance, the system may automatically broadcast 7 seconds ofHU voice signal that precede the point where the second CA takes over captioning sothat when the CA takes over, the CA has context in which to start captioning the firstfew words of the HU voice signal to be captioned by the CA. In at least some cases thesystem may audibly distinguish HU voice signal provided for context from HU voicesignal to be captioned by the CA so that the CA has a sense of what signal to captionand which is simply presented as context. For instance, the tone or pitch or rate ofbroadcast or volume of the contextual HU voice signal portion may be modified todistinguish that portion of the voice signal form the signal to be captioned.
Systems have been described above where ongoing calls are automaticallytransferred from a first CA to a second CA based on CA expertise in handling calls withspecific detected characteristics. For instance, a call where an HU has a specificaccent may be transferred mid-call to a CA that specializes in the detected accent, acall where a line is particularly noisy may be transferred to a CA that has scored well interms of captioning accuracy and speed for low audio quality calls, etc. One other callcharacteristic that may be detected and used to direct calls to specific CAs is callsubject matter related to specific technical or business fields where specific CAs havingexpertise in those fields will typically have better captioning results. In these case, in atleast some embodiments, a system processor may be programmed to detect specificwords or phrases that are tell tail signs that call subject matter is related to a specificfield or discipline handled best by specific CAs and, once that correlation is determined,an associated call may be transferred from an initial CA to a second CA that specializesin captioning that specific subject matter.
In some cases an AU may work in a specific field in which the AU and manyHUs that the AU converses with use complex field specific terminology. Here, a systemprocessor may be programmed to learn over time that the AU is associated with thespecific field based on conversation content (e.g., content of the HU voice signal and, insome cases, content of an AU voice signal) and, in addition to generating an utteranceand text word dictionary for an AU, may automatically associate specific CAs thatspecialize in the field with any call involving the AU's caption device (as identified by theAU's phone number or caption device address). For instance, if an AU is aneuroscientist and routinely participates in calls with industry colleagues using complexindustry terms, a system processor may recognize the terms and associate the termsand AU with an associated industry. Here, specific CAs may be associated with theneuroscience industry and the system may associate those CAs with the calling numberof the AU so that going forward, all calls involving the AU are assigned to CAsspecializing in the associated industry whenever one of those CAs is available. If aspecialized CA is not available at the beginning of a call involving the AU, the systemmay initiate captioning using a first CA and then once a specialized CA becomesavailable, may transfer the call to the available CA to increase captioning accuracy,speed or both.
In some cases it is contemplated that an AU may specify a specific field orfields that the AU works in so that the system can associate the AU with specific CAsthat specialize in captioning for that field or those fields. For instance, in the aboveexample, a neuroscientist AU may specify neuroscience as her field during an captiondevice commissioning process and the system may then associate ten different CAsthat specialize in calls involving terminology in the field of neuroscience with the AU'scaption device. Thereafter, when the AU participates in a call and requires CAcaptioning, the call may be linked to one of the associated specialized CAs when one isavailable.
In some embodiments it is contemplated that a system may track AUinteraction with her caption device and may generate CAS preference data based onthat interaction that can be used to select or avoid specific CAs in the future. Forinstance, where an AU routinely indicates that the captioning procedure handled by aspecific CA should be modified, once a trend associated with the specific CA for thespecific AU is identified, the system may automatically associate the CA with a list ofCAs that should not be assigned to handle calls for the AU.
In some cases it is contemplated that the system may enable an AU toindicate perceived captioning quality at the end of each call or at the end of specificcalls based on caption confidence factors or some other metric(s) so that the AU candirectly indicate a non-preference for CAs. Similarly, an AU may be able to indicate apreference for a specific CA or that a particular caption session was exceptionally goodin which case the CA may be added to a list of preferred CAs for the AU. In thesecases, calls with the AU would be assigned to preferred CAs and not assigned to CAson the non-preferred list. Here, at the end of each of a subset of calls, an AU may bepresented with touch selectable icons (e.g., "Good Captioning"; "UnsatisfactoryCaptioning") enabling the AU to indicate satisfaction level for captioning service relatedto the call.
While embodiments are described above where specific CAs are associatedwith preferred and non-preferred lists or optimal and non-optimal lists for specific AUs, itshould be appreciated that the similar preferences or optimality ratings may be ascribedto different captioning processes. For instance, a first AU may routinely rank ASRcaptioning poorly but full CA captioning highly and, in that case, the system mayautomatically configure so that all calls for the first AU are handled via full CAcaptioning. For a second AU, the system may automatically generate captionconfidence factors and use those factors to determine that the mix of captioning speedand accuracy is almost always best when initial captions are generated via an ASRsystem and one of 25 CAs that are optimal for the second AU is assigned to performerror corrections on the initial caption text.
To apprise the public of the scope of the present invention the followingclaims are made.