WO2003085639A1

Movatterモバイル変換

Info

Publication number: WO2003085639A1
Application number: PCT/IB2003/001145
Authority: WO
Inventors: Boris E. R. De Ruyter; Steffen C. Pauws
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2002-04-08
Filing date: 2003-03-20
Publication date: 2003-10-16
Also published as: AU2003212587A1

Abstract

A method of controlling an apparatus (400) based on speech comprises the steps of: receiving a series of speech items (104-108), starting with a first speech item (104) of a first user (U1) of the apparatus (400); transforming received speech items (104-114) to voice commands (212-216) corresponding to respective recognized speech items which are classified as belonging to the first user (U1) on basis of a voice profile of the first user; creating an instruction (218) for the apparatus (400) by means of combining the voice commands (212-216); and providing the instruction (218) for execution by the apparatus (400).

Description

Controlling an apparatus based on speech

The invention relates to a method of controlling .an apparatus based on speech.

The invention further relates to an apparatus being arranged to be controlled on basis of speech.

The invention further relates to a consumer electronics system comprising such an apparatus.

The invention further relates to a speech control unit for controlling an apparatus on basis of speech.

Voice control as an interaction modality for products, e.g. consumer products, is getting more mature. However, people perceive it as strange, uncomfortable or even unacceptable to talk to a product like a television. To avoid that conversations or utterances not intended for controlling the products are recognized and executed, most voice controlled system require the user to activate the system resulting in a time span, or also called attention span during which the system is active. Such an activation may be performed via voice, for instance by the user speaking a keyword, like "TV". By using .an anthropomorphic character a barrier for interaction is removed: it is more natural to address the character instead of the product, e.g. by saying "Bello" to a dog-like character. Moreover, a product can make effective use of one object with several appearances, chosen as a result of several state elements. For instance, a basic appearance like a sleeping animal can be used to show that the system is not yet active. A second group of appearances can be used when the system is active, e.g. awake appearances of the animal. The progress of the attention span can then, for instance, be expressed, by the angle of the ears: fully raised at the beginning of the attention span, fully down at the end. The similar appearances can also express whether or not an utterance was understood: an "understanding look" versus a "puzzled look". Also audible feedback can be combined, like a "glad" bark if a speech item has been recognized. A user can quickly grasp the feedback on all such system elements by looking at the one appearance which represents all these elements. E.g. raised ears and an "understanding look", or lowered ears and a "puzzled look". Once a user has started an attention span the product is in a state of accepting further speech items. These speech items will be recognized and associated with voice commands. A number of voice commands together will be combined to one instruction for the product. E.g. a first speech item is associated with "Bello", resulting in a wake-up of the television. A second speech item is associated with the word "channel" and a third speech item is associated with the word "next". The result is that the television will switch, i.e. get tuned to a next broadcasting channel. However, if another user starts talking during the attention span of the television just initiated by the first user, then the communication between the first user and the television might get interfered. The probability is high that the television is not able to construct the appropriate instruction which matches with the intention of the first user.

An embodiment of the apparatus of the kind described in the opening paragraph is known from US 6,230,137. In that patent is disclosed that the control program of the apparatus is configured in such a way that successive voice signals, i.e. speech items, can only form a control command, i.e. instruction, when the successive voice signals are input within a given time period. This means however that the apparatus according to the cited prior art is not capable in dealing with multiple users which provide speech items to the same apparatus within the given time period.

It is an object of the invention to provide a method of the kind described in the opening paragraph with an improved construction of the instruction on the basis of speech from a user.

The object of the invention is achieved in that the method comprises: - receiving a series of speech items, starting with a first speech item of a first user of the apparatus;

- transforming received a selection of the received speech items into voice commands corresponding to respective recognized speech items which are classified as belonging to the first user on basis of a voice profile of the first user; - creating an instruction for the apparatus by means of combining the voice commands; and

- providing the instruction for execution by the apparatus.

An important aspect of the invention is that received speech items are classified: does this speech item belong to the first user? Preferably speech items which do not belong to the user who started the attention span of the apparatus are ignored. The result is that only speech items of the first user are used for other operations to create the eventual instruction for the apparatus. The operations comprise matching with a vocabulary list, i.e. dictionary, and matching with a language model, i.e. grammatical test. Hence, it can be seen as if the attention span is assigned to the first user.

In an embodiment of the method according to the invention, transforming the selection of received speech items comprises:

- classifying the received speech items by means of comparing the received speech items with the voice profile of the first user; and - recognizing speech items being classified as belonging to the first user thereby associating voice commands to respective recognized speech items. In this embodiment the recognition step is performed after the classification step, hi this embodiment only those speech items have to be recognized which have been classified as belonging to the first user. This has a positive influence on the resource usage. An inverse order is also possible: first recognizing and than classifying. In general, the recognition and classification step can be made conditional on basis of the result of the other step.

An embodiment of the method according to the invention further comprises:

- classifying a further selection of the received speech items by means of comparing the received speech items with a further voice profile of a further user of the apparatus;

- recognizing speech items being classified as belonging to the further user and associating further voice commands to respective further recognized speech items;

- creating a further instruction for the apparatus by means of combining the further voice commands corresponding to the further recognized speech items; and - optionally providing the further instruction for execution by the apparatus.

Instead of ignoring the speech items for which it has been concluded that they do not correspond to the first user, these speech items are used for further evaluation in this embodiment of the method according to the invention. Again the speech items are classified. The recognized speech items which have been classified as belonging to one and the same user, i.e. the further user, are combined to the further instruction for the apparatus. Whether the further instruction will be provided to the apparatus for execution is tested first:

- a test might be checking whether the instruction requested by the first user has been executed already. As long as this is not the case the further instruction is halted. - another test might be checking whether the instruction requested by the first user and the further instruction are not mutually conflicting. If that is the case the further instruction will not be provided for execution.

In an embodiment of the method according to the invention the step of creating the instruction is performed after a predetermined time interval which has been started at receiving the first speech item. An attention span for the first user might be started by means of providing the first speech item by the first user. After a predetermined time interval the attention span should be closed. The relevant speech items being processed in the meantime are combined in order to create the instruction. Alternatively, the instruction is created on the fly, but providing of the instruction is postponed till the predetermined time interval has elapsed. hi an embodiment of the method according to the invention the step of creating the instruction is performed after a further predetermined time interval during which no speech items have been classified. An alternative approach for ending an attention span or for a trigger to start with the creation of the instruction is based on the fact that no new valid speech items are received. The advantage of this approach is the flexibility. The duration of the attention span is determined on the received input and not based on a predetermined time interval.

In an embodiment of the method according to the invention the step of creating the instruction is performed after an explicit action of the first user. Another alternative approach for ending an attention span or for a trigger to start with the creation of the instruction is based on the fact that the user performs an explicit action, e.g. uttering a stop-word like "good-bye" The advantage of this approach is the flexibility. The duration of the attention span is determined on the explicit action and not based on a predetermined time interval.

It is a further object of the invention to provide an apparatus of the kind described in the opening paragraph with an improved construction of the instruction on the basis of speech from a user.

This object of the invention is achieved in that the apparatus comprises: - a speech control unit for controlling the apparatus on basis of speech, comprising:

* receiving means for receiving a series of speech items, starting with a first speech item of a first user of the apparatus; * transforming means for transforming a selection of the received speech items into voice commands corresponding to respective recognized speech items which are classified as belonging to the first user on basis of a voice profile of the first user;

* instruction creating means for creating an instruction for the apparatus by means of combining the voice commands; and

- processing means for execution of the instruction. An embodiment of the apparatus according to the invention is arranged to show that the first speech item has been classified as belonging to the first user. Above it is described, i.e. with the example of "Bello", that there are several means to inform the user about system elements such as progress of an attention span and acceptance of speech items. Preferably the apparatus is also arranged to show which user, in the case of multiple users is providing speech items to the apparatus.

An embodiment of the apparatus according to the invention which is arranged to show that the first speech item has been classified as belonging to the first user comprises audio generating means for generating an audio signal representing the first user. By generating an audio signal comprising a representation of the name of the first user, e.g. "Hello Jack" it is clear for the first user that the apparatus is ready to receive speech items from the first user. This concept is also known as auditory greeting.

An embodiment of the apparatus according to the invention which is arranged to show that the first speech item has been classified as belonging to the first user comprises a display device for displaying a visual representation of the first user. By displaying a personalized icon or an image of the first user it is clear for the first user that the apparatus is ready to receive speech items from the first user, hi other words, the apparatus is in an active state of classifying and/or recognizing speech items. An embodiment of the apparatus according to the invention which is arranged to show that the first speech item has been classified as belonging to the first user is developed to show a set of controllable parameters of the apparatus on basis of a preference profile of the first user. Many apparatus have numerous controllable parameters. However not all of these controllable parameters are of interest for each user of the apparatus. Besides that, each of the users has his own preferred default values. Hence, a user has a so-called preference profile. It is advantageously to show the default values of the controllable parameters which are of interest to the first user, i.e. the user who initiated the attention span. Modifications of the method and variations thereof may correspond to modifications and variations thereof of the apparatus described and of the speech control unit described.

These and other aspects of the method, of the apparatus and of the speech control unit according to the invention will become apparent from and will be elucidated with respect to the implementations and embodiments described hereinafter and with reference to the accompanying drawings, wherein: Fig. 1 schematically shows the step of classification of speech items according to the invention;

Fig. 2 schematically shows the behavior of a speech control unit according to the invention;

Fig. 3 schematically shows alternative behavior of a speech control unit according to the invention; and

Fig. 4 schematically shows an embodiment of the apparatus according to the invention. Corresponding reference numerals have same or like meaning in all of the Figs.

Fig. 1 schematically shows the step of classification of speech items 104-114 according to the invention. In Fig. 1 is shown that two users Ul and U2 are speaking. The first user Ul generates a speech signal 100 comprising speech items 104-108. In between these speech items 104-108 the first user Ul is not speaking. The second user U2 generates a speech signal 102 comprising speech items 110-114. In between these speech items 110-114 the second user U2 is not speaking. Because the first user Ul and the second user U2 are speaking during the same attention span the two speech signals 100 and 102 are merged into combined speech signal 103 comprising the speech items 104-114 from both users Ul and U2. The combined speech signal 103 is received by speech control unit 200. The speech items 104-114 are extracted from the combined speech signal 103. An aspect for extraction is dividing the combined speech signal 103 into sub-signals on basis of detected portions of "little signal level", i.e. silence. The first received speech item 104 is classified as belonging to the first user Ul. This is done by means of comparing the received speech items with a voice profile of the first user Ul which is available in the speech control unit. For subsequent speech items 106-114 also classification tests are performed. The speech items 106 and 108 will also be classified as belonging to the first user Ul . For the further speech items 110-114 it will be concluded that these do not belong to the first user Ul. Optionally for these further speech items 110-114 it is tested whether they belong to the second user U2. Fig. 2 schematically shows the behavior of a speech control unit 200 according to the invention. The speech control unit 200 comprises:

- receiving means 202 for receiving a series of speech items 104-114, starting with a first speech item 104 of a first user Ul of the apparatus 400;

- classification means 204 for classifying received speech items 104-114 by means of comparing the received speech items with a voice profile of the first user Ul;

- recognizing means 206 for recognizing speech items 104-108 being classified as belonging to the first user Ul and associating voice commands 212-216 to respective recognized speech items 104-108;

- instruction creating means 208 for creating an instruction 218 for the apparatus 400 by means of combining the voice commands 212-216 corresponding to the recognized speech items 104-108; and

- providing means 209 for providing the instruction 208 to the control processor 210 of the apparatus 400. The control processor is arranged to perform the instruction 208. The receiving means 202 comprises a microphone and an A/D- converter. The other components 204-209 of the speech control unit 200 and the control processor 210 may be implemented using one processor. Normally, both functions are performed under control of a software program product. During execution, normally the software program product is loaded into a memory, like a RAM, and executed from there. The program may be loaded from a background memory, like a ROM, hard disk, or magnetically and/or optical storage, or may be loaded via a network like Internet. Optionally an application specific integrated circuit provides the disclosed functionality.

The behavior of the speech control unit 200 is as follows. From the combined speech signal 103 the speech items 104-114 are extracted. Those speech items 104-108 which correspond to the first user Ul are classified as such. The classified speech items 104-108 are also recognized and voice commands 212-216 are assigned to these speech items 104-108. The voice commands 212-216 are "Bello", "Channel" and "Next", respectively. An instruction "hιcrease_Frequency_Band", which is interpretable for the control processor 210 is created based on these voice commands 212-216. Fig. 3 schematically shows alternative behavior of a speech control unit 200 according to the invention. From the combined speech signal 103 the speech items 104-114 are extracted. Those speech items 104-108 which correspond to the first user Ul are classified as such and the speech items 110-114 which correspond to the second user U2 .are classified as such. The speech items 104-114 are also recognized. Noice commands 212-216 are assigned to the speech items 104-108 being classified as belonging to the first user Ul and voice commands 312-316 are assigned to the speech items 110-114 being classified as belonging to the second user U2. Alternatively the speech items 104-114 are first recognized and then classified. The voice commands 212-216 are "Bello", "Channel" and "Next", respectively. An instruction "frιcrease_Frequency_Band", which is interpretable for the control processor 210 is created based on these voice commands 212-216. The voice commands 312-316 are "Bello", "More" and "Sound", respectively. A further instruction "Increase SoundJ evel", which is interpretable for the control processor 210 is created based on these voice commands 312-316. The instruction "hιcrease_Frequency_Band" is directly provided to the control processor 210 by means of providing means 209. The further instruction "h crease_Sound_Level" is halted for a limited time interval. After the control processor 210 is ready to perform the further instruction it is provided. Optionally the instruction "hιcrease_Frequency_Band" and the further instruction "rncrease_Sound__Level" are compared in order to check whether the further instruction 318 should be blocked. In this case the instructions 218-318 are not mutually conflicting and hence the further instruction "Increase Sound evel" is not blocked. The result is that two instructions are performed which relate to speech items 104-114 of two different users Ul .and U2 which have spoken in the attention span initiated by the first user Ul .

Fig. 4 schematically shows an embodiment of the apparatus 400 according to the invention. The apparatus 400 optionally comprises audio generating means 404 for generating an audio signal representing the first user Ul. By generating an audio signal comprising a representation of the name of the first user Ul, e.g. "Hello Jack" it is clear for the first user Ul that the apparatus is ready to receive speech items 104-108 from the first user Ul. In other words, the apparatus is in an active state of classifying and or recognizing speech items. The generating means 404 comprises a memory device for storage of a sampled audio signal, a sound generator and a load speaker. The apparatus also comprises a display device 402 for displaying a visual representation of the first user Ul. By displaying a personalized icon or an image of the first user it is clear for the first user that the apparatus is ready to receive speech items 104- 108 from the first user Ul . The speech control unit 200 according to the invention is preferably used in a multi-function consumer electronics system, like a TV, set top box, NCR, or DVD player, game box, or similar device. But it may also be a consumer electronic product for domestic use such as a washing or kitchen machine, any kind of office equipment like a copying machine, a printer, various forms of computer work stations etc, electronic products for use in the medical sector or any other kind of professional use as well as a more complex electronic information system. Whereas, the word "multifunction electronic system" as used in the context of the invention may comprise a multiplicity of electronic products for domestic or professional use as well as more complex information systems, the number of individual functions to be controlled by the method would normally be limited to a reasonable level, typically in the range from 2 to 100 different functions. For a typical consumer electronic product like a TV or audio system, where only a more limited number of functions need to be controlled, e.g. 5 to 20 functions, examples of such functions may include volume control including muting, tone control, channel selection and switching from inactive or stand-by condition to active condition and vice versa, which could be initiated, by control commands such as "louder", "softer", "mute", "bass" "treble" "change channel", "on", "off, "stand-by" etcetera.

In the description it is assumed that the speech control unit 200 is located in the apparatus 400 being controlled. It will be appreciated that this is not required and that the control method according to the invention is also possible where several devices or apparatus are connected via a network (local or wide area), and the speech control unit 200 is located in a different device then the device or apparatus being controlled.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be constructed as limiting the claim. The word 'comprising' does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitable programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware.

Claims

CLAIMS:

1. A method of controlling an apparatus (400) based on speech, comprising:

- receiving a series of speech items (104-108), starting with a first speech item (104) of a first user (Ul) of the apparatus (400);

- transforming a selection of the received speech items (104-114) into voice commands (212-216) corresponding to respective recognized speech items which are classified as belonging to the first user (Ul) on basis of a voice profile of the first user;

- creating an instruction (218) for the apparatus (400) by means of combining the voice commands (212-216); and

- providing the instruction (218) for execution by the apparatus (400).

2. A method as claimed in Claim 1, characterized in that transforming the selection of received speech items comprises:

- classifying the received speech items (104-114) by means of comparing the received speech items (104-114) with the voice profile of the first user (Ul); .and - recognizing speech items (104-108) being classified as belonging to the first user (Ul) thereby associating voice commands (212-216) to respective recognized speech items (104-108).

3. A method as claimed in Claim 1, characterized in that the method further comprises:

- classifying a further selection of the received speech items (104-114) by means of comparing the received speech items with a further voice profile of a further user (U2) of the apparatus (400);

- recognizing speech items (110-114) being classified as belonging to the further user (U2) whereby associating further voice commands (312-316) to respective further recognized speech items (110-114);

- creating a further instruction (318) for the apparatus (400) by means of combining the further voice commands (312-316) corresponding to the further recognized speech items (110-114); and - optionally providing the further instruction (318) for execution by the apparatus (400).

4. A method as claimed in Claim 3, characterized in that the step of providing the further instruction (318) is performed if the instruction (218) has been executed.

5. A method as claimed in Claim 3, characterized in that the step of providing the further instruction (318) is performed if the instruction (218) and the further instruction (318) are not mutually conflicting.

6. A method as claimed in Claim 1, characterized in that creating the instruction (218) is performed after a predetermined time interval which has been started at receiving the first speech item (104).

7. A method as claimed in Claim 1, characterized in that creating the instruction

(218) is performed after a further predetermined time interval during which no speech items have been classified.

8. A method as claimed in Claim 1, characterized in that creating the instruction (218) is performed after an explicit action of the first user.

9. An apparatus (400) comprising:

- a speech control unit (200) for controlling the apparatus (400) on basis of speech, comprising: * receiving means (202) for receiving a series of speech items (104-114), starting with a first speech item (104) of a first user (Ul) of the apparatus (400);

* transforming means for transforming a selection of the received speech items (104-114) into voice commands (212-216) corresponding to respective recognized speech items which are classified as belonging to the first user (Ul) on basis of a voice profile of the first user;

* instruction creating means (208) for creating an instruction (218) for the apparatus (400) by means of combining the voice commands (212-216); and

- processing means (210) for execution of the instruction.

10. An apparatus (400) as claimed in Claim 9, characterized in being arranged to show that the first speech item (104) has been classified as belonging to the first user (Ul).

11. An apparatus (400) as claimed in Claim 10, characterized in comprising audio generating means (404) for generating an audio signal representing the first user (Ul).

12. An apparatus (400) as claimed in Claim 10, characterized in comprising a display device (402) for displaying a visual representation of the first user (Ul).

13. An apparatus (400) as claimed in Claim 10, characterized in being arranged to show a set of controllable parameters of the apparatus (400) on basis of a preference profile of the first user (Ul).

14. A consumer electronics system comprising the apparatus (400) as claimed in Claim 9.

15. A speech control unit (200) for controlling an apparatus (400) on basis of speech, comprising:

- receiving means (202) for receiving a series of speech items (104-114), starting with a first speech item (104) of a first user (Ul) of the apparatus (400);

- transforming means for transforming a selection of the received speech items (104-114) into voice commands (212-216) corresponding to respective recognized speech items which are classified as belonging to the first user (Ul) on basis of a voice profile of the first user; - instruction creating means (208) for creating an instruction (218) for the apparatus (400) by means of combining the voice commands (212-216); and

- providing means (209) for providing the instruction to the apparatus (400).