Disclosure of Invention
In view of this, the invention provides a method, a device, a system and an embedded audio playing device for reminding key information, which can effectively remind users of paying attention to the key information.
In order to achieve the above object, as a first aspect of the present invention, there is provided an embedded audio playback apparatus comprising a speaker and a communication unit, further comprising a control unit, a storage unit, a voice recognition unit and a reminder unit,
The communication unit receives an audio stream from a far end;
the voice recognition unit comprises a keyword recognition model unit, wherein the keyword recognition model unit is used for storing a scene-based keyword recognition model;
the keywords are associated with the application scene, wherein the keywords comprise a group of vocabularies which need to be focused in the application scene, and one or more of the vocabularies are pre-designated by a user;
The voice recognition unit extracts a voice signal from the audio stream and detects whether the voice signal contains the keywords or not in real time by adopting the keyword recognition model based on the scene;
the control unit is used for starting recording the received audio stream when the voice signal contains the keyword, and controlling the reminding unit to output the keyword information reminding;
the storage unit is used for storing the recorded audio stream;
the speaker is used for playing the audio stream or playing back the recorded audio stream in response to a playback instruction.
Preferably, the scene-based keyword recognition model may be obtained by training in advance based on a training sample library containing voice samples for the keywords and/or voice samples for specific persons of the keywords by using a deep learning algorithm;
the control unit may be further configured to download the scene-based keyword recognition model from a remote location via the communication unit.
Preferably, the voice recognition unit may further include a voice preprocessing unit for preprocessing an input audio stream to eliminate noise, background human voice, music voice, and extract a voice signal;
Preferably, the voice recognition unit may further include a neural network processing unit, configured to perform data processing on the voice signal or the voice signal processed by the voice preprocessing unit by using a deep learning algorithm based on the keyword recognition model, so as to infer and determine a vocabulary appearing in the voice signal, so as to determine whether the keyword vocabulary is included in the vocabulary.
Preferably, the reminding unit may be one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module and a music message generation module.
Further, the recording device can also comprise an input unit for receiving a recording stop instruction and a playback instruction input by a user;
When the control unit contains keywords in the voice signal, the control unit can start to continuously compress and encode the received audio stream and store the audio stream locally;
the control unit can stop recording when receiving a recording stop instruction or when the duration of recording exceeds a first preset duration;
the control unit can play the locally stored recorded audio stream when receiving a local audio playback instruction;
Further, the control unit may be further configured to send a recording start instruction to the far end when the voice signal includes a keyword, and configured to enable the far end to start continuously recording the sent audio stream, and send a recording stop instruction to the far end when the continuous recording time does not exceed the second predetermined time length and the recording stop instruction is received;
the control unit may send a playback request to the remote end when receiving a playback remote audio command, and receive and play a recorded audio stream stored remotely.
Preferably, the embedded audio playing device is an earphone or a sound box with a band-pass function.
As a second aspect of the present invention, there is provided a key information alert system, comprising an embedded audio playing device and a remote device,
The remote equipment receives a keyword vocabulary customized by a user and/or a voice sample provided by the user and at least containing a specific person of the keyword, which is used for acquiring a keyword recognition model based on a scene;
The scene-based keyword recognition model is obtained by training a training sample library which is based on voice samples for the keywords and/or voice samples for specific people of the keywords in advance;
The embedded audio playing device is communicated with the remote equipment, receives an audio stream from the remote equipment and plays the audio stream;
The embedded audio playing device also acquires a voice signal from the audio stream, carries out voice recognition on the voice signal by adopting a keyword recognition model based on scenes, and detects whether the voice signal contains keywords in real time;
When the voice signal contains keywords, the embedded audio playing device generates a keyword information prompt and starts recording the received audio stream;
The embedded audio playing device responds to the playback instruction and plays the recorded audio stream.
Further, the cloud server can be also included,
The remote equipment is communicated with the cloud server, and the keywords and/or voice samples of specific persons are sent to the cloud server;
The cloud server expands the received keywords and/or the voice samples of the specific person to form a training sample library, and based on the training sample library, the scene-based keyword recognition model is obtained by training through a deep learning algorithm;
The remote device receives the scene-based keyword recognition model from the cloud server and downloads the scene-based keyword recognition model to the embedded audio playing device.
Preferably, the remote device uses keyword vocabulary input by a user and/or voice samples provided by the user and at least containing specific people of the keywords to expand a standard sample library to form a training sample library, and based on the training sample library, the scene-based keyword recognition model is obtained by training through a deep learning algorithm;
the remote device downloads the scene-based keyword recognition model to the embedded audio playing device.
As a third aspect of the present invention, there is provided a key information reminding method, wherein,
The method comprises the steps of receiving a keyword vocabulary customized by a user and/or a voice sample provided by the user and containing at least a specific person of the keyword, wherein the keyword is associated with an application scene and contains a group of vocabularies which need to be focused in the application scene;
Training to obtain the scene-based keyword recognition model based on a training sample library containing voice samples for the keywords and/or voice samples for specific persons of the keywords;
When receiving and playing an audio stream, acquiring a voice signal from the audio stream;
performing voice recognition on the voice signal by adopting the scene-based keyword recognition model, and detecting whether keywords are contained in the voice signal in real time;
When the voice signal contains keywords, generating a keyword information reminder and starting recording the received audio stream;
and playing the recorded audio stream in response to the playback instruction.
Preferably, a wide voice sample is collected in advance to form a standard sample library;
acquiring a voice sample at least containing the keywords according to the keywords;
and expanding the voice sample containing the keywords and/or the voice sample of the specific person to the standard sample library to form a training sample library, and training by adopting a deep learning algorithm based on the training sample library to obtain the scene-based keyword recognition model.
Further, the step of obtaining the voice signal from the audio stream may further include a preprocessing step of eliminating noise, music sound, and background human voice;
Preferably, the voice recognition is performed on the voice signal or the preprocessed voice signal by using the keyword recognition model based on the scene, and whether the voice signal contains keywords or not is detected in real time, which specifically includes constructing a deep learning neural network based on the keyword recognition model, continuously inputting the voice signal into the deep learning neural network for data processing, so as to perform reasoning and judgment on vocabulary appearing in the voice signal, and determining whether the vocabulary contains keywords or not.
Preferably, the recording of the received audio stream may specifically include starting continuous compression encoding and local storage of the received audio stream when the speech signal contains keywords;
Stopping local recording when a recording stop instruction is received or the duration of recording exceeds a first preset time length;
the playing of the recorded audio stream in response to the playback instruction specifically includes playing a locally stored recorded audio stream in response to the playback of the local audio instruction.
Preferably, the recording of the received audio stream may specifically include sending a recording start instruction to a remote end when the speech signal contains a keyword, and starting to record the transmitted audio stream continuously by the remote end, and storing the audio stream remotely;
when the continuous recording time does not exceed the second preset time and a recording stopping instruction is received, a recording stopping instruction is sent to the far end, and the far end stops recording;
The method for playing the recorded audio stream in response to the playback instruction comprises the steps of responding to the playback remote audio instruction, sending a playback request to the remote end, and receiving and playing the recorded audio stream stored in the remote end.
Preferably, the key information prompt can be one or more of a combination of a visual prompt, a touch prompt and an auditory prompt;
The visual reminding comprises a light effect reminding and a remote text message reminding;
the tactile alert includes a vibratory alert;
The audible alert includes a voice alert, a musical alert
Compared with the prior art, the embedded audio playing device provided by the invention has the beneficial effects that the real-time detection, reminding, recording and playback of continuous voice key information based on scenes can be independently completed, and the embedded audio playing device is convenient and fast to use and has good privacy. According to the key information reminding system and the key information reminding method, the key word recognition model highly matched with the application scene is obtained through training according to the actual requirement of the current application scene and the key words or training samples customized by the user, so that the accuracy of recognizing the key information in the continuous voice stream can be effectively improved, reminding and storing are timely output aiming at the information which is important to pay attention to the current application scene and the information which is interesting to the user, and the user experience is excellent.
Detailed Description
In more than 40 years of application of moore's law, the design technology and the manufacturing technology level of a semiconductor chip are improved rapidly, the computing capacity of the chip is improved greatly, the storage capacity on the chip is improved greatly, and meanwhile, the power consumption is reduced continuously, so that the artificial intelligence technology is possible to be widely applied to small-sized embedded equipment with low power consumption. The invention provides a technical improvement aiming at the defect that people easily miss important information from an opposite terminal when using a remote audio/video application in the prior art. Specifically, on the embedded device, aiming at voice information, an artificial intelligent voice recognition technology based on a scene is adopted to recognize the information of interest from the opposite end in real time, and prompt and save key audio streams are output in time. The invention is applicable to different application scenes and meets the personalized requirements of different users, thereby effectively solving the defects of the prior art. As used herein, "real-time" refers to the fact that the embedded audio playback device has sufficient computing power to identify keywords in the audio stream that was being played at the original speed.
The technical scheme of the present application is further exemplarily described below by means of the accompanying drawings and examples. It will be apparent that the described embodiments are only some of the embodiments of the present application and are not exhaustive of all embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
Example 1:
As shown in fig. 1, according to the core idea of the present invention, this embodiment provides a key information reminding method, in which,
Step 100, initializing step.
The step is a processing flow before entering the key information reminding, and is mainly used for checking and updating software and hardware environment configuration, parameter preparation, program preparation and the like required by the key information reminding. The method can comprise the step of establishing communication connection between the local device and the remote device in a wireless communication mode or a wired communication mode, and can further comprise the step of acquiring a keyword recognition model based on a scene.
It should be noted that "local" and "remote" are referred to herein as relative concepts, where "local" refers to one party or end that receives an audio stream and generates a key information alert, and "remote" refers to the other party or end that communicates in a wired or wireless manner, independent of "local" but directly or indirectly through one or more intermediaries, and sends the audio stream to "local". In addition, it should be noted that "far-end" and "opposite-end" commonly used in the art in describing a voice call are not concepts, where "opposite-end" refers to another initiator of an audio stream, and "far-end" refers to an initial receiving party of a call audio stream after being initiated from the opposite-end, and corresponding "local" refers to a final receiving party of the call audio stream.
As a specific embodiment, the "local" may be an audio playing device based on an embedded system (simply referred to as an embedded audio playing device). An "embedded system" as referred to herein is a special purpose computer system embedded in an object system, and is based on computer technology with application as the center, and the software and hardware can be tailored, so that the embedded system is suitable for a special purpose computer system with strict requirements on functions, reliability, cost, volume and power consumption of the application system. The "embedded device" refers to a device containing an embedded system therein, which is generally based on an ARM core and architecture or other low-power core and architecture, and is used for realizing specific functions and applications, and is a device with various functions relative to a general-purpose PC, and specifically may be an earphone, a sound box, a telephone watch, a conference terminal device, and the like. While the "remote" may be an end-user computer system, a network server or server system, a mobile computing device, a consumer electronic device, or other suitable electronic device, or any combination or portion thereof, such as, in particular, a cell phone, tablet, computer, smart television, etc.
The remote audio and video application can be applicable to various scenes, the transmitted voice information is huge in quantity, and the types of key information are different from person to person and different from scene to scene. For example, people often attend video conferences or web lessons through accessory devices such as headsets or speakers for cell phones or computers. In a video conference, it may be most interesting for users that part of the content related to themselves in the conference, such as departments where themselves are located, last persons, businesses related to themselves, etc., so that keywords for identifying key information should be department names, last names, own names, business names, task placement, delivery deadlines, etc., in net lessons, it may be most interesting for students to teach knowledge points of teachers, so that keywords for identifying key information should be important points, difficulties, points of examination, summaries, reviews, etc., while in customer service centers, it may be most interesting for customer-mentioned complaints, so that keywords need to include complaints, suggestions, quality, service attitudes, etc. If full text speech recognition models are employed in these different scenarios and it is desired to ensure recognition accuracy, the speech recognition models must be trained based on a large number of speech samples. However, in general, this is difficult to be implemented in an accessory device of a mobile phone or a computer, on one hand, a huge amount of voice samples are difficult to obtain, and on the other hand, training based on a huge amount of voice samples has very high requirements on computer hardware, so that the popularization and application of the technology in the accessory device of the mobile phone or the computer are limited by high implementation cost.
Therefore, in this embodiment, the step of obtaining the keyword recognition model based on the scene, particularly obtaining the keyword recognition model in the accessory embedded device of the mobile phone or the computer, is used for adjusting and updating the keyword recognition model according to the actual application scene, so that the keyword recognition model is more suitable for the current scene, and meets the user requirement. The keywords are associated with the application scene and comprise a group of vocabularies which need to be focused in the application scene. Different application scenarios may correspond to different keywords. The user can set and specify one or more vocabularies in the keywords according to the actual requirements.
The step of obtaining the keyword recognition model based on the scene specifically comprises the steps of receiving a keyword vocabulary customized by a user and/or receiving a voice sample of a specific person containing at least the keyword provided by the user, expanding a standard sample library by using the keyword and the voice sample of the specific person so as to form a training sample library, and training based on the training sample library to obtain the keyword recognition model based on the scene. Wherein the standard sample library may be a training sample set formed based on a wide range of pre-collected voice samples.
The step of receiving user-defined keyword vocabulary and/or receiving user-provided voice samples of a specific person containing at least the keywords is typically performed at a remote location based on a more abundant user interface on a mobile phone or a computer.
As an alternative implementation mode, a user can set a custom keyword set through a remote end in advance according to own preference, demand and use scene, and a provider of an audio stream can also generate a default keyword set according to various factors such as the use scene, the content of the audio stream, the use habit of the user and the like. The remote end can also display a plurality of default keyword vocabulary in advance for users to select, add and subtract so as to form a keyword vocabulary set associated with the application scene.
To match the hardware environment of the embedded device, an upper limit on the number of vocabulary in the keyword, for example, 30 groups of vocabulary, etc., may be set.
In addition, in speech recognition, various factors such as the sex, age, physiological characteristics of pronunciation, dialect, pronunciation of non-native language, emotion at the time of speaking, and environmental noise may affect the accuracy of recognition, for example, the pronunciation of the same word "key" varies greatly between the Sichuan and Guangdong. Therefore, in this embodiment, a voice sample with a specific accent, which is provided by a user and at least includes a keyword, may be obtained, and the standard sample library may be extended with the voice sample, for example, a student may provide a record of a teacher in a class, and a staff may provide a record of a boss in a meeting, etc.
After the voice sample of a specific person, which at least contains the keywords and is provided by the user, is received, the voice sample of the specific person is expanded into the training sample library. According to the embodiment, the keyword recognition model based on the scene is obtained based on the sample library training comprising the voice samples closely related to the application scene, so that the recognition accuracy can be effectively improved.
The process of training and obtaining the keyword recognition model in the embodiment can be realized by adopting a hidden Markov model (Hidden Markov Model, HMM), a dynamic theme model (Dynamic Topic Models, DTM) and various classical artificial intelligence voice recognition algorithms derived based on the technology, which are successfully used for voice recognition and text recognition at present, or adopting an algorithm based on deep learning and various related algorithms in the future. Deep learning is one of the important areas of machine learning (MACHINE LEARNING) research, the motivation of which is to build, simulate a neural network for analysis learning of the human brain, and interpret data, such as images, sounds, and text, by mimicking the mechanisms of the human brain. The core of deep learning is to learn more useful features by building a machine learning model with multiple hidden layers and a large amount of training data, thus ultimately improving the accuracy of classification or prediction. Currently, in computer vision and natural language, the mainstream deep learning algorithm is a convolutional neural network (Convolutional neural network, CNN), a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) algorithm, and in addition, a Long Short-Term Memory (LSTM) algorithm, a deep full-sequence convolutional neural network (Deep Fully Convolutional Neural Network, abbreviated as DFCNN) algorithm, and the like. When implemented, the present embodiments may employ various applicable deep learning algorithms including, but not limited to, these existing or future ones.
As a preferred implementation, this embodiment employs a continuous speech keyword recognition technique based on a deep learning algorithm. For example, after the training sample library is obtained, a deep learning algorithm such as a convolutional neural network (Convolutional Neural Network, abbreviated as CNN) algorithm, a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) algorithm and the like is adopted, and a keyword recognition model based on a scene is obtained based on training of the training sample library.
The training process of the keyword recognition model adopting the deep learning algorithm can be finished at a far end or at a cloud end. It should be noted that, the "cloud" as used herein refers to a server side of cloud computing or a background server of cloud computing with strong processing and storage capabilities. As a preferred embodiment, the training process is completed in the cloud end, so as to fully utilize hardware resources and strong computing power of the cloud end. The method specifically comprises the steps that after a user inputs keyword vocabulary at a far end or uploads a voice sample of a specific person containing keywords, the far end sends the keywords and/or the voice sample to a cloud end, so that the cloud end can acquire the voice sample containing the keywords in various modes such as the Internet, and expand the voice sample and the voice sample of the specific person into a standard sample library thereof to form a training sample library, and then the training sample library is used for training to obtain a keyword recognition model based on scenes.
Further, after training is completed, the remote end receives the keyword recognition model based on the scene from the cloud.
The process of the training at the far end may refer to the process of the cloud training, which is not described herein.
The initializing step may further include updating a local keyword recognition model, which specifically includes remotely downloading the scene-based keyword recognition model to the local. The remote terminal can actively send the update request to the local, or the remote terminal can respond to the local update request to start the download execution.
After the initialization step is completed, the following key information real-time detection and reminding process can be carried out.
Step 110, when receiving and playing an audio stream, acquiring a voice signal from the audio stream;
The key information reminding process in the embodiment is to identify key information contained in voice information and remind the user when receiving and playing audio streams in the remote audio and video application.
As a preferred embodiment, when the voice signal in the audio stream is obtained, the step further performs background sound elimination on the audio stream, removes noise, background voice, music sound and the like in the background, and extracts a foreground voice signal with a high signal-to-noise ratio so as to improve the signal-to-noise ratio and further improve the success rate of voice recognition.
Step 120, performing voice recognition on the voice signal by using the keyword recognition model based on the scene, and detecting whether the voice signal contains a preset keyword in real time;
when detecting whether the voice signal contains keywords, the voice signal can be considered to contain keywords as long as a group of keyword vocabularies are detected.
As a preferred implementation manner, a continuous voice keyword recognition technology based on a deep learning algorithm is adopted to recognize scene-based keyword information, and specifically, the method comprises the steps of constructing a deep learning neural network based on a keyword recognition model, inputting continuous voice signals to be recognized into the deep learning neural network to perform data processing so as to infer and judge words appearing in the voice signals and determine whether the words contain keyword words.
In this embodiment, the continuous speech keyword recognition technology based on the deep learning algorithm is used to perform scene-based keyword recognition, so that compared with the large vocabulary continuous speech recognition in the prior art, it is unnecessary to recognize all characters, but only detect whether one or more groups of keywords set by a user appear in the continuous speech stream, on one hand, the continuous speech stream can be detected in real time, on the other hand, the requirements on the computing capacity, the storage space and the power consumption of hardware are low, and the method can be applied to a small-sized low-power embedded system, and meanwhile, recognition accuracy can be effectively improved based on scene recognition, and user experience of speech recognition is improved.
When it is detected that the speech signal does not include the keyword, the process returns to step 110, and the subsequently acquired audio stream is continuously detected.
When the voice signal at least contains a group of keywords, steps 130 and 140 are executed;
step 130, generating a key information reminder.
The key information alert may include a visual alert, a tactile alert, and an audible alert;
the visual reminding comprises a light effect reminding and a text message reminding, such as that an LED indicator lamp flashes or displays specific light effect, a flashing pattern appears on a remote screen, a remote text message (such as a notification message of a mobile phone application program APP) and the like;
the tactile alert includes a vibratory alert such as ringing according to a predetermined law;
the audible alert includes a voice alert, a music alert, such as an alert with predetermined voice content or music.
In the implementation, the above one or more reminding modes can be selected according to the actual application scene, for example, only a light effect reminding or a music reminding can be set, and a message can be sent to the associated computer application program APP while the reminding is vibrated, so that the effect of double reminding is obtained.
Step 140, beginning recording the received audio stream;
in this embodiment, when it is determined that the voice information of the current audio stream includes a keyword, in order to help the user to miss important content as little as possible, recording is started on the received audio stream while a reminder is generated.
When the audio stream is recorded, the keyword can be used as a recording starting point, the audio stream received after the keyword appears can be used as the recording starting point, and a section of the audio stream which is already in rolling compression coding in the current audio stream when the keyword appears can be used as the recording starting point by moving forward a fixed time with the keyword as the starting point. That is, the recorded audio stream may or may not include the audio stream at the time of occurrence of the keyword, and may or may not include the audio stream before occurrence of the keyword.
The recorded audio stream will be stored locally after compression encoding for local playback. And continuously recording the audio stream until a recording stop instruction is received or the continuous recording time exceeds a first preset time length, and stopping recording. The first predetermined time period may be set to be relatively short, for example, 1 to 2 minutes, taking into account the limited capacity of the local storage carrier. In general, important content will appear in the voice information in a short time after the keyword appears, so the first shorter predetermined time length, although short, may save the most important voice content, so that the user can quickly learn the important information when playing back the recording.
When the recording start instruction is sent, timing can be started, and when the continuous recording time of the far end is calculated to not exceed the second preset time length and the recording stop instruction is received, the recording stop instruction is sent to the far end, so that the far end can stop recording at any time in the second preset time length due to the fact that the recording stop instruction is received, and the controllability of the recording time length is increased. The remote end may automatically stop recording when the duration exceeds the second predetermined length.
In order to help the user to fully grasp important information as much as possible and reduce information omission, the second predetermined time length may be set to be greater than or equal to the first predetermined time length, that is, the second predetermined time length is longer, for example, 2-5 minutes, so that an audio stream containing key information for a longer time period may be saved for playback by the user.
Of course, in the case where the local storage space is sufficiently large, the first predetermined time period may be set to be longer than or equal to the second predetermined time period, so that the recording is stored locally for a sufficiently long time period, and the recording for a shorter time period is reserved at the far end, so that the user or other people can play back the recording at the far end to quickly learn the key information.
In addition, as an alternative implementation manner, the remote end can also perform full text voice recognition on the recorded audio stream when recording the audio stream so as to obtain corresponding text and store the text information.
Step 150, in response to the playback instruction, playing the recorded audio stream.
In this step, the locally recorded and stored audio stream may be played in response to the playback local audio command, or the remotely recorded audio stream may be received and played in response to the playback remote audio command, with the playback request being sent to the remote end.
As an alternative implementation manner, the recorded audio streams can be stored according to the sequence of the recording start time when being stored locally, and correspondingly, can be played in turn according to the sequence of the recording start time when being played back.
It should be noted that step 104 is a step performed based on the obtained playback specification, and thus it need not necessarily be performed after step 103, or it may be performed to detect playback instructions at any time during use to play back a sound recording.
In a typical application scenario, the key information reminding method of the embodiment may be applied to a call center system. Often, operators of call centers receive hundreds of voice calls each day, which is labor intensive. However, it is often difficult to clearly express the main conversation purpose of the party making a call in a short time due to the difference in the speech expression ability, the accent problem, and even the emotion problem. If the operator cannot concentrate the spirit to a high degree, the operator can miss important information of the opposite party easily, and even misunderstand the meaning of the opposite party, so that adverse effects are caused. By adopting the method of the embodiment, when the operator receives a call, the earphone capable of reminding key information is worn, the earphone automatically identifies whether the voice information of the opposite party contains key words such as alarm, complaint, deception and the like and timely reminds the operator of paying attention to the key information, and the earphone can record the key information or inform a far end (such as a call center management platform, a call telephone transfer platform and the like) in communication connection with the earphone to record the key information. Thus, the operator can more accurately and comprehensively know the key information through the playback function, and the understanding of the conversation intention of the opposite party is enhanced. Therefore, the key information reminding method of the embodiment not only can timely and effectively remind the operator, but also can help the operator to review the call content, thereby reducing the information loss and greatly relieving the working pressure of the operator.
Example 2
Referring to fig. 2, according to the core concept of the present invention, the present embodiment provides an embedded audio playing device, which includes a communication unit, a speaker, a control unit, a storage unit, a voice recognition unit and a reminding unit,
The storage unit is used for storing data, programs and the like related to the operation of the device.
The communication unit can be a wired communication unit or a wireless communication unit, and can also comprise a wired communication module or a wireless communication module. Specifically, the communication unit may be implemented as a bluetooth communication unit, a WIFi communication unit, an Internet network interface, an audio-specific wired transmission interface, a USB interface, a micro USB interface, a mini USB interface, a Type-C interface, a lighting interface, or any of various known or future communication units that may be used in the present embodiment.
The communication unit receives an audio stream from a far end;
The voice recognition unit is used for extracting a voice signal from the audio stream and detecting whether keywords are contained in the voice signal in real time by adopting a keyword recognition model based on scenes;
The control unit is a control center of the device, is connected with other units in the device by utilizing various interfaces and lines, and carries out overall monitoring and scheduling on each unit so as to realize each function of the device, particularly when keywords are contained in the voice signals, the control unit starts recording the received audio stream and controls the reminding unit to output key information reminding;
in this embodiment, the keywords and the application scenario are associated, which includes a group of vocabularies that need to be focused in the application scenario, and one or more of the vocabularies are pre-designated by the user;
The voice recognition unit comprises a keyword recognition model unit used for storing the scene-based keyword recognition model. The scene-based keyword recognition model is obtained by training a training sample library which is based on a voice sample for the keywords and/or a voice sample for a specific person of the keywords in advance, and as a preferable implementation manner, the scene-based keyword recognition model is obtained by training by adopting a deep learning algorithm, and the voice recognition unit can adopt the keyword recognition model to perform continuous voice keyword recognition so as to detect whether the voice signal contains keywords in real time;
the voice recognition unit may further include a voice preprocessing unit, configured to preprocess an input audio stream to eliminate noise, music sound, background voice, and the like, and extract a voice signal with a high signal-to-noise ratio;
The voice recognition unit may further include a neural network processing unit, configured to perform data processing on the voice signal by using a deep learning algorithm based on the keyword recognition model, so as to perform reasoning and decision on the vocabulary appearing in the voice signal, so as to determine whether the keyword vocabulary is included in the vocabulary. The neural network processing unit can be an embedded neural Network Processor (NPU), a special neural network processing array processing unit, a DSP, an embedded processor and other processing modules which can be used for processing massive multimedia data in the neural network.
In this embodiment, the keyword recognition model is externally trained and downloaded into the device prior to use. Thus, the control unit is further configured to download the scene-based keyword recognition model from a remote location via the communication unit.
The reminding unit is one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module and a music message generation module. The indicator lamp module can be an LED indicator lamp which can flash with light or display specific graphics to output a prompt, the vibrator module can generate vibration with preset frequency, the text message generation module can generate text messages according to preset message formats, such as text messages containing currently recognized keywords, the voice message generation module can generate voice messages according to preset voice message formats, such as voice messages containing currently recognized keywords, the voice message generation module can select a section from prestored voice data according to preset modes to serve as voice messages, such as Tone tones including 'drip', 'Ding Dong', and the like.
The loudspeaker is used for playing the audio stream, playing back the recorded audio stream, playing back the voice message or the sound message, and the like. It should be understood that in some embodiments, the speaker may cooperate with the control unit and the storage unit to replace the function of the reminding unit, for example, only by means of sound reminding.
The embedded audio playing device further comprises an input unit, wherein the input unit is used for receiving various control instructions input by a user, such as a playback instruction, a stop reminding instruction, a recording stop instruction and the like input by the user.
The input unit can be a touch panel, a key, a voice command input module and other mechanical or voice input modules.
The storage unit is used for storing the recorded audio stream;
in an alternative implementation mode, when the voice signal contains a keyword, the control unit starts to perform continuous compression coding on the received audio stream and stores the received audio stream locally;
the control unit plays the locally stored recorded audio stream when receiving a local audio playback instruction;
The control unit is further configured to send a recording start instruction to the far end when the voice signal contains a keyword, and is configured to enable the far end to start continuously recording the sent audio stream, and send a recording stop instruction to the far end when the continuous recording time does not exceed a second predetermined time length and the recording stop instruction is received;
and when receiving a remote audio playing instruction, the control unit sends a playback request to the remote end and receives and plays the recorded audio stream stored at the remote end.
In addition, the embedded audio playing device can further comprise a power supply unit, wherein the power supply unit is used for providing a power supply required by the device during operation, and the power supply unit can be a power supply circuit module powered by a button battery or a rechargeable battery, can also be a power supply management module powered by an external input power supply for the device, and can also be a circuit module based on self-power-taking of a wired communication interface.
Obviously, the embedded audio playing device of the present embodiment may be used to implement some or all of the methods, processes or steps of the key information reminding method described in embodiment 1. The same or similar parts as those of embodiment 1 are described, and the description of this embodiment is omitted.
The embedded audio playing device can be embodied as a headset audio playing device, such as various wired earphone devices, wireless earphone devices and the like, can also be embodied as various portable sound boxes, and can also be embodied as accessory devices of a phone watch, a portable game machine, a portable multimedia player and the like of a mobile phone or a computer. For example, in a typical application scenario, the embedded audio playback device is a sound box with a band-pass function. The LED indicator lamp is arranged on the shell of the sound box, a keyword recognition model based on a scene is downloaded in advance in the LED indicator lamp, and real-time continuous detection can be carried out on voice information currently played by the sound box. When the current voice information contains keywords, the LED indicator lights start to flash so as to remind the user. The sound box has an intelligent voice control function, and a user can send out a control instruction through voice so as to control the sound box to execute the functions of closing the LED indicator lamp, stopping recording, playing back and the like. The detailed process of the sound box for realizing the key information reminding may be described with reference to the foregoing embodiment 1 and part of this embodiment, and will not be repeated here.
Example 3
According to the core idea of the invention, the embodiment provides a key information reminding system, which comprises an embedded audio playing device and a remote device,
The remote equipment receives a keyword vocabulary customized by a user and/or a voice sample provided by the user and at least containing a specific person of the keyword, which is used for acquiring a keyword recognition model based on a scene;
The scene-based keyword recognition model is obtained by training a training sample library which is based on voice samples for the keywords and/or voice samples for specific people of the keywords in advance;
The embedded audio playback device communicates with the remote device, receives and plays audio streams from the remote device, and the communication may be in any suitable form of communication, such as wired (e.g., ethernet, USB, lightning, fiber optic) communication or wireless (e.g., wiFi, bluetooth, IR) communication.
The embedded audio playing device also acquires a voice signal from the audio stream, carries out voice recognition on the voice signal by adopting a keyword recognition model based on scenes, and detects whether the voice signal contains keywords in real time;
When the voice signal contains keywords, the embedded audio playing device generates a keyword information prompt and starts recording the received audio stream;
The embedded audio playing device responds to the playback instruction and plays the recorded audio stream.
As an alternative way, the key recognition model completes training on the remote device, the remote device uses user-defined keyword vocabulary and/or voice samples of specific people containing at least the keywords provided by the user to expand a standard sample library thereof to form a training sample library, and the training sample library is used for training to obtain a scene-based key recognition model;
the remote device downloads the scene-based keyword recognition model to the embedded audio playing device.
As another optional implementation manner, the keyword recognition model completes training at the cloud, and the system further comprises a cloud server;
the remote equipment is communicated with the cloud server, and the keywords and/or voice samples of specific persons are sent to the cloud server;
the cloud server is used for expanding a standard sample library of the received keywords and the voice samples of the specific person, and training is carried out on the basis of the training sample library to obtain a scene-based keyword recognition model;
The remote device receives the scene-based keyword recognition model from the cloud server and downloads the scene-based keyword recognition model to the embedded audio playing device.
Obviously, the key information reminding system provided in this embodiment may be used to implement some or all of the method, flow or step in the key information reminding method described in embodiment 1. The embedded audio playing device of embodiment 2 can also be used to implement the key information reminding system of the present embodiment. Similar technical details thereof may be referred to the description of the foregoing embodiments and are not repeated herein.
The following will take a typical application scenario as an example to describe the core idea of the embodiment of the present invention in more detail.
Referring to fig. 3, in this application scenario, the key information reminding system includes a video playing device (such as a tablet pc) 300, an earphone 310, and a cloud server 320.
The earphone 310 may be a headset, an in-ear earphone or an ear-hanging earphone, may be a wired earphone or a wireless earphone, may have only 1 headset 311, may have left and right headsets 311, and may have a single-piece or a split-piece headset 311.
The headphones 310 communicate with the video playback device 300 either by wire or wirelessly, thereby receiving an audio stream from the video playback device 300. The video playing device 300 may be a personal computer, a tablet computer, a smart television, a mobile phone, etc. of the user. The user views a video program through the video playback device 300. Fig. 3 shows a student watching a net lesson through a tablet computer.
The video playback device 300 may also access the cloud server 320 based on a network, which may be a local area network, a wide area network, a cellular network, or a combination thereof.
The earphone 310 is provided with an LED indicator 312 and keys 313-316. The LED indicator 312 may emit flashing red light, the key 313 is a volume up key, the key 314 is a play/pause key, the key 315 is a stop reminder/stop record/playback key, and the key 316 is a volume down key. The key 315 may be set to perform three functions of stopping reminding, stopping recording and starting playback at the same time when pressed 1 time, or may be set to perform stopping reminding and recording at the same time when pressed 1 time, and starting playback when pressed twice consecutively. And may be specifically set according to an actual implementation environment, to which the present invention is not limited.
The LED indicator 312 may also be disposed on an external microphone (not shown) of the earphone 310, so that the user can adjust the external microphone to the front position of his lips when wearing the earphone, and thus the LED indicator 312 is more visible if the user is reminded of light.
In addition, a vibrator (not shown) is also provided in the earphone 310. The vibrator may be implemented using existing or future applicable technologies, and the present invention is not particularly limited. For example, an eccentric motor with a cam may be used.
The cloud server 320 may train to generate keyword recognition models based on the deep learning algorithm described above. In a specific implementation, the cloud server 320 may collect a wide range of voice samples in advance, and perform vocabulary labeling and other processing on the voice samples to form a standard sample library.
In the application scene, the key information reminding system realizes the key information reminding process as follows:
Step one, initializing.
Before the key information reminding process is started, an initialization step is carried out, and software and hardware environment configuration and various parameter settings required by the operation and communication of various devices and equipment in the system are checked and updated.
Setting keywords to obtain a new keyword recognition model. The method comprises the following steps:
the user sets keyword vocabulary through the video playing device 300, for example, students can input words such as "key", "examination", "summary" and their own names as keywords before surfing the net lesson. The keywords which accord with the current application scene and have individuation can be formed through the autonomous setting of the user.
In order to match the hardware power consumption and computational effort of the headset 310, the upper limit of the keyword vocabulary number is set to 20.
When a new vocabulary is input in the keywords of the video playback device 300, the video playback device 300 accesses the cloud server 320, sends a request to update the keyword recognition model to the cloud server 320, and sends the keywords to the cloud server 320.
After the cloud server 320 receives the keywords, it can compare the keywords with the keywords existing on the cloud server 320, when all the words in the keywords sent by the video playing device 300 are included in the keywords existing on the cloud server 320, it directly uses the existing standard sample library as a training sample library, trains the keywords by the deep learning algorithm to obtain a new scene-based keyword recognition model, when some words in the keywords are not included in the keywords existing on the cloud server 320, it obtains the voice sample including the words from the internet, and expands the standard sample library to form the training sample library, and trains to generate a new keyword recognition model.
The user may also upload a voice sample of a specific person containing one or more words in the keyword through the video playback device 300, such as a student uploading a teacher's voice audio material to the video playback device 300. The video playing device 300 uploads the voice sample of the specific person to the cloud server 320, so that the standard sample library of the cloud server 320 is extended, so that the cloud server 320 can train to obtain a new keyword recognition model based on the training sample library of the voice sample of the specific person at least containing the keyword.
The cloud server 320 transmits the trained scene-based keyword recognition model to the video playback device 300 in response to the update request of the video playback device 300.
After receiving the keyword recognition model from the cloud server 320, the video playing device 300 downloads the keyword recognition model to the earphone 310, so that the earphone 310 updates the keyword recognition model stored locally.
It should be noted that, the process of setting the keywords and obtaining the new keyword recognition model may be completed in the initializing step, or may be completed in each appropriate time in the system operation, which may be specifically determined according to the actual situation, which is not limited in the present invention.
Step two, the earpiece 310 receives the audio stream.
After the system initialization is completed, the user may begin receiving and playing the audio stream from the video playback device 300 via the keys 314 on the headphones 310. If the student views the network course through the earphone 310 and the tablet pc 300 at this time.
Step three, the earphone 310 acquires a voice signal in the audio stream, performs voice recognition on the voice signal, and detects whether the voice signal contains a preset keyword in real time by adopting the keyword recognition model based on the scene.
The earphone 310 is internally provided with a voice recognition unit, which may be an embedded neural network processor, and is configured to construct a neural network based on the keyword recognition model, and perform data processing by adopting a deep learning algorithm so as to perform real-time keyword recognition on a continuously input voice signal.
The audio stream of the network lesson may include various sound signals such as music and voice, and the earphone 310 extracts the voice signals therein and detects whether the voice signals include preset keywords by using a scene-based keyword recognition model and a deep learning algorithm. For example, when a student presets a keyword "summary", when a net lesson teacher speaks "we summarize the main content of the lesson below", it is possible to detect and identify that the current voice signal contains the keyword, and if the student uses his own name or number as the keyword, the earphone 310 can play a role of auxiliary reminding when the net lesson teacher is called.
And when no keyword is recognized, the headphone 310 continues to receive and play the audio stream without proceeding to the execution of the following steps. It should be appreciated that the process of receiving and playing the audio stream by the headphones 310 may not be affected when the system is conducting the critical information alert.
Step four, the earphone 310 generates a key information alert and records the audio stream.
When the earphone 310 detects that the current voice signal contains the preset keyword, the vibrator starts vibrating. The user may stop the earphone 310 from vibrating by the key 315. If the vibration exceeds a predetermined vibration time, such as 10 seconds, the user does not stop the vibration, the vibration may be automatically stopped and the LED indicator 312 may be caused to begin to flash red. The red light may continue to blink for a longer blinking time or may blink until the user stops it by the key 315. When the current state of the LED indicator 312 is detected as an operating state (red flash) before the earphone 310 generates a new vibration, the new vibration is not generated, but the current operating state of the LED indicator 312 is continuously maintained. Therefore, if the student wears the earphone at the moment, the student can pay attention to the key information in a vibration mode, and if the student has taken off the earphone, the reminding purpose can be achieved in a light effect mode.
The headphones 310 begin recording the received audio stream at the same time as the critical information alert is generated. The method comprises the following steps:
And locally storing the recorded audio stream within a first preset time period. The first predetermined length of time should be less than or equal to the length of time that the headphones 310 can store the audio stream at most. The first predetermined time period may be a preset fixed value, for example, if the time period in which the headphones 310 can store the audio stream at most is 2 minutes, the first predetermined time period may be 2 minutes, or the first predetermined time period may be 30 seconds, and then the headphones 310 may store 4 audio streams at most with the time period of 30 seconds at most.
The earphone 310 starts recording the received audio stream and simultaneously transmits a recording start instruction and a detected keyword vocabulary to the video playback device 300.
After receiving the recording start instruction sent by the earphone 310, the video playing device 300 starts recording the sent audio stream.
Step five, the video playing apparatus 300 converts the recorded voice signal into text information and stores it.
The video playing device 300 may obtain the voice signal in the recorded audio stream, and convert the full text into text by various voice text conversion methods in the prior art, and store the text. When in storage, the keyword vocabulary, the words and the sound recordings detected by the earphone 310 can be stored in a correlated manner, so that the user can select and review later.
And step six, recording is stopped.
When the user inputs a recording stop command through the key 315, or when the duration of continuous recording exceeds the first predetermined duration but the recording stop command sent by the user is still not received, the earphone 310 will automatically stop recording the audio stream.
When the user inputs a recording stop command through the key 315, or when the duration of continuous recording exceeds the second predetermined duration but the recording stop command sent by the user is still not received, the video playing device 300 will automatically stop recording the audio stream.
And step seven, recording and playing back.
In this embodiment, the user may play back the audio recording on the headphones 310 or the video playback device 300.
For example, when a student initiates a local playback function by pressing the key 315 a number of times in succession, the headset 310 will play a locally stored recorded audio stream while simultaneously playing an audio stream from the video playback device 300. During playing, two audio streams can be mixed and then played, one of the two headsets 311 can play one audio stream, and the other headset can play the other audio stream.
Or when the student starts the remote playback function by continuously pressing the button 315 for 3 times, the earphone 310 transmits a playback request instruction to the video playing device 300, and after receiving the playback request instruction, the video playing device 300 transmits the recorded audio stream to the earphone 310.
In addition, the student can also directly input a playback instruction on the video playback device 300 to play the recorded audio stream stored in the video playback device 300.
The student may also specify that the recorded audio stream therein be played on the video playback device 300.
And step eight, consulting the text information.
In this step, the student may review the text information corresponding to the recorded audio stream in the video playing device 300, so that the student can review and take notes according to the text information.
Through the description of the embodiment and the typical application scene, the key information reminding method, the system and the embedded audio playing device provided by the embodiment of the invention realize the real-time detection, reminding and playback of the key information of continuous voice on the small-sized and low-power-consumption embedded equipment, are convenient to use, simple to operate and wide in application range, can effectively remind, save and review the key information, reduce the loss of missing key information of a user, and increase the satisfaction degree of the user on remote audio and video application.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.