Movatterモバイル変換


[0]ホーム

URL:


CN111601215B - A scenario-based key information reminder method, system and device - Google Patents

A scenario-based key information reminder method, system and device
Download PDF

Info

Publication number
CN111601215B
CN111601215BCN202010313790.4ACN202010313790ACN111601215BCN 111601215 BCN111601215 BCN 111601215BCN 202010313790 ACN202010313790 ACN 202010313790ACN 111601215 BCN111601215 BCN 111601215B
Authority
CN
China
Prior art keywords
keyword
voice
audio stream
recognition model
recording
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010313790.4A
Other languages
Chinese (zh)
Other versions
CN111601215A (en
Inventor
张时嘉
曾娟鹃
张亦农
王海业
由海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xijueshuo Information Technology Co ltd
Original Assignee
Nanjing Xijueshuo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xijueshuo Information Technology Co ltdfiledCriticalNanjing Xijueshuo Information Technology Co ltd
Priority to CN202010313790.4ApriorityCriticalpatent/CN111601215B/en
Publication of CN111601215ApublicationCriticalpatent/CN111601215A/en
Application grantedgrantedCritical
Publication of CN111601215BpublicationCriticalpatent/CN111601215B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明提供一种关键信息提醒方法、系统及嵌入式音频播放装置,相对现有技术,其中的嵌入式音频播放装置可独立完成基于场景的连续语音关键信息实时检测、提醒、录制和回放,使用方便、便捷且私密性好。其中的关键信息提醒系统和提醒方法,预先根据当前应用场景的实际需求及用户自行定制的关键词或训练样本,训练获得与应用场景高度契合的关键词识别模型,因此可有效提高识别连续语音流中的关键信息准确率,并针对当前应用场景应当重点关注的信息和用户感兴趣的信息及时输出提醒并保存,用户体验度极好。

The present invention provides a key information reminder method, system and embedded audio playback device. Compared with the prior art, the embedded audio playback device can independently complete the real-time detection, reminder, recording and playback of continuous voice key information based on the scene, which is easy to use, convenient and private. The key information reminder system and reminder method are pre-trained to obtain a keyword recognition model that is highly compatible with the application scene based on the actual needs of the current application scene and the keywords or training samples customized by the user. Therefore, it can effectively improve the accuracy of identifying key information in the continuous voice stream, and timely output reminders and save the information that should be focused on in the current application scene and the information that the user is interested in, and the user experience is excellent.

Description

Scene-based key information reminding method, system and device
Technical Field
The invention relates to the technical field of embedded equipment, in particular to a scene-based key information reminding system and method and an embedded audio playing device.
Background
Currently, the internet and mobile communication networks have entered into thousands of households, falling throughout the corners of people's lives. Various remote audio and video applications such as web conferences, web education, web business negotiations, web sales, etc. based on these remote communication platforms are also increasingly emerging with the high maturity of related technologies and products such as computer network technologies, audio and video processing technologies, and embedded devices with a system on a chip SoC as a core. The remote audio and video applications matched with the embedded equipment such as the mobile phone, the earphone, the tablet personal computer, the sound box and the like completely break the limitation of regions, so that people in different places can realize real-time communication interaction between voice and video at any time, and great convenience is provided for the production and life of people. For example, in the current epidemic situation, students can continue to take lessons at home through the network teaching platform. Students often attend net lessons through headphones and walk at will in a certain range with the headphones during the course of teaching. But disadvantageously, due to lack of classroom atmosphere in the network teaching process, a teacher cannot observe the class listening state of each child in time, so that the student is very dependent on the personal autonomy of the student. And once the student walks away or plays privately, no one can give prompt and correct in time, and the teaching of the teacher is missed. This situation is also true in web video conferences, where critical voice information is missed due to, for example, private disruption or answering of calls, etc. In general, there is no key information reminding function for the content of the speaker at the opposite end in the net lesson or video conference software on the mobile phone or the computer. The local user, if any, is not necessarily on the side of the handset or computer. Therefore, it is very necessary to directly implement the function of reminding key information in accessory devices such as headphones or speakers closest to the mobile phone or computer of the local user, so that the mind of the local user can be pulled back to the net lesson or video conference in the first time.
In recent years, speech recognition technology has been increasingly used in speech monitoring and recognition of important information. Particularly, with strong support of moore's law and big data, artificial intelligence technology-based speech recognition has entered a deep learning stage from shallow recognition. The voice recognition technology based on the deep learning theory and the neural network model can output a recognition result with higher accuracy, so that the voice recognition technology is widely applied to various fields such as intelligent voice awakening, intelligent voice control, intelligent voice dialogue and the like.
However, after intensive research, the inventors found that if the artificial intelligence speech recognition technology is used for realizing the key speech information reminding function in the current remote audio/video application, there are a plurality of technical bottlenecks, for example:
In the first aspect, in the artificial intelligence speech recognition technology, a speech recognition model is a key for guaranteeing recognition accuracy. In the existing various intelligent voice wake-up, intelligent voice control and intelligent voice dialogue technology applications, a general version of voice recognition model is often adopted, namely, the provider of equipment/application completes training of the voice recognition model in advance, and the judgment standard of important information and the selection of training samples are all determined by the provider of equipment/application. If the universal version of the voice recognition model is simply used in remote audio and video applications, the universal version of the voice recognition model is difficult to adapt to various application scenes, and even poor user experience can be caused because the recognition accuracy cannot be ensured.
In a second aspect, implementation of artificial intelligence speech recognition techniques, particularly deep learning techniques, requires a large number of high-precision computations, which rely on a strong support of the hardware system in terms of memory, computational overhead, and power consumption. Therefore, at present, the technology is mostly used on large-scale special computing platforms with high cost, high power consumption and high performance such as GPU, FPGA, etc., but is very rare in performing independent keyword recognition independent of a mobile phone or cloud, or only adopting a relatively simple isolated word or fixed keyword set, limiting sentence pattern recognition, etc., so as to implement some simple and low-level voice recognition functions, such as simple voice wake-up, smart home voice control, etc., but not implement a key voice information reminding function in a complex and continuous voice stream on various low-power consumption and low-performance embedded devices (such as earphone, portable loudspeaker box, telephone watch, conference terminal device, etc.) which are most used by common consumers. Some intelligent voice assistants known in the market at present are to upload the voice stream collected by the embedded device to the mobile phone or the cloud for recognition, and usually only can realize voice recognition of a single sentence, and upload the collected voice stream to the cloud or the remote device for recognition and then return the result, which usually results in poor user experience due to time delay, and privacy of the user is difficult to be ensured. The reason for this is that it is important to limit the computational power and power consumption of the embedded device hardware, and it is difficult to provide sufficient support for the existing large-vocabulary continuous speech recognition technology.
In the third aspect, in the current speech recognition in the consumer field, a certain interaction is performed after the keyword or the full speech is recognized on the locally input speech stream, and the function of prompting after the keyword recognition of interest is performed on the speech in another direction/from the far end in a specific scene is absent.
It is therefore desirable to provide a scenario-based key information reminding technique that overcomes at least one of the above-mentioned technical drawbacks.
Disclosure of Invention
In view of this, the invention provides a method, a device, a system and an embedded audio playing device for reminding key information, which can effectively remind users of paying attention to the key information.
In order to achieve the above object, as a first aspect of the present invention, there is provided an embedded audio playback apparatus comprising a speaker and a communication unit, further comprising a control unit, a storage unit, a voice recognition unit and a reminder unit,
The communication unit receives an audio stream from a far end;
the voice recognition unit comprises a keyword recognition model unit, wherein the keyword recognition model unit is used for storing a scene-based keyword recognition model;
the keywords are associated with the application scene, wherein the keywords comprise a group of vocabularies which need to be focused in the application scene, and one or more of the vocabularies are pre-designated by a user;
The voice recognition unit extracts a voice signal from the audio stream and detects whether the voice signal contains the keywords or not in real time by adopting the keyword recognition model based on the scene;
the control unit is used for starting recording the received audio stream when the voice signal contains the keyword, and controlling the reminding unit to output the keyword information reminding;
the storage unit is used for storing the recorded audio stream;
the speaker is used for playing the audio stream or playing back the recorded audio stream in response to a playback instruction.
Preferably, the scene-based keyword recognition model may be obtained by training in advance based on a training sample library containing voice samples for the keywords and/or voice samples for specific persons of the keywords by using a deep learning algorithm;
the control unit may be further configured to download the scene-based keyword recognition model from a remote location via the communication unit.
Preferably, the voice recognition unit may further include a voice preprocessing unit for preprocessing an input audio stream to eliminate noise, background human voice, music voice, and extract a voice signal;
Preferably, the voice recognition unit may further include a neural network processing unit, configured to perform data processing on the voice signal or the voice signal processed by the voice preprocessing unit by using a deep learning algorithm based on the keyword recognition model, so as to infer and determine a vocabulary appearing in the voice signal, so as to determine whether the keyword vocabulary is included in the vocabulary.
Preferably, the reminding unit may be one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module and a music message generation module.
Further, the recording device can also comprise an input unit for receiving a recording stop instruction and a playback instruction input by a user;
When the control unit contains keywords in the voice signal, the control unit can start to continuously compress and encode the received audio stream and store the audio stream locally;
the control unit can stop recording when receiving a recording stop instruction or when the duration of recording exceeds a first preset duration;
the control unit can play the locally stored recorded audio stream when receiving a local audio playback instruction;
Further, the control unit may be further configured to send a recording start instruction to the far end when the voice signal includes a keyword, and configured to enable the far end to start continuously recording the sent audio stream, and send a recording stop instruction to the far end when the continuous recording time does not exceed the second predetermined time length and the recording stop instruction is received;
the control unit may send a playback request to the remote end when receiving a playback remote audio command, and receive and play a recorded audio stream stored remotely.
Preferably, the embedded audio playing device is an earphone or a sound box with a band-pass function.
As a second aspect of the present invention, there is provided a key information alert system, comprising an embedded audio playing device and a remote device,
The remote equipment receives a keyword vocabulary customized by a user and/or a voice sample provided by the user and at least containing a specific person of the keyword, which is used for acquiring a keyword recognition model based on a scene;
The scene-based keyword recognition model is obtained by training a training sample library which is based on voice samples for the keywords and/or voice samples for specific people of the keywords in advance;
The embedded audio playing device is communicated with the remote equipment, receives an audio stream from the remote equipment and plays the audio stream;
The embedded audio playing device also acquires a voice signal from the audio stream, carries out voice recognition on the voice signal by adopting a keyword recognition model based on scenes, and detects whether the voice signal contains keywords in real time;
When the voice signal contains keywords, the embedded audio playing device generates a keyword information prompt and starts recording the received audio stream;
The embedded audio playing device responds to the playback instruction and plays the recorded audio stream.
Further, the cloud server can be also included,
The remote equipment is communicated with the cloud server, and the keywords and/or voice samples of specific persons are sent to the cloud server;
The cloud server expands the received keywords and/or the voice samples of the specific person to form a training sample library, and based on the training sample library, the scene-based keyword recognition model is obtained by training through a deep learning algorithm;
The remote device receives the scene-based keyword recognition model from the cloud server and downloads the scene-based keyword recognition model to the embedded audio playing device.
Preferably, the remote device uses keyword vocabulary input by a user and/or voice samples provided by the user and at least containing specific people of the keywords to expand a standard sample library to form a training sample library, and based on the training sample library, the scene-based keyword recognition model is obtained by training through a deep learning algorithm;
the remote device downloads the scene-based keyword recognition model to the embedded audio playing device.
As a third aspect of the present invention, there is provided a key information reminding method, wherein,
The method comprises the steps of receiving a keyword vocabulary customized by a user and/or a voice sample provided by the user and containing at least a specific person of the keyword, wherein the keyword is associated with an application scene and contains a group of vocabularies which need to be focused in the application scene;
Training to obtain the scene-based keyword recognition model based on a training sample library containing voice samples for the keywords and/or voice samples for specific persons of the keywords;
When receiving and playing an audio stream, acquiring a voice signal from the audio stream;
performing voice recognition on the voice signal by adopting the scene-based keyword recognition model, and detecting whether keywords are contained in the voice signal in real time;
When the voice signal contains keywords, generating a keyword information reminder and starting recording the received audio stream;
and playing the recorded audio stream in response to the playback instruction.
Preferably, a wide voice sample is collected in advance to form a standard sample library;
acquiring a voice sample at least containing the keywords according to the keywords;
and expanding the voice sample containing the keywords and/or the voice sample of the specific person to the standard sample library to form a training sample library, and training by adopting a deep learning algorithm based on the training sample library to obtain the scene-based keyword recognition model.
Further, the step of obtaining the voice signal from the audio stream may further include a preprocessing step of eliminating noise, music sound, and background human voice;
Preferably, the voice recognition is performed on the voice signal or the preprocessed voice signal by using the keyword recognition model based on the scene, and whether the voice signal contains keywords or not is detected in real time, which specifically includes constructing a deep learning neural network based on the keyword recognition model, continuously inputting the voice signal into the deep learning neural network for data processing, so as to perform reasoning and judgment on vocabulary appearing in the voice signal, and determining whether the vocabulary contains keywords or not.
Preferably, the recording of the received audio stream may specifically include starting continuous compression encoding and local storage of the received audio stream when the speech signal contains keywords;
Stopping local recording when a recording stop instruction is received or the duration of recording exceeds a first preset time length;
the playing of the recorded audio stream in response to the playback instruction specifically includes playing a locally stored recorded audio stream in response to the playback of the local audio instruction.
Preferably, the recording of the received audio stream may specifically include sending a recording start instruction to a remote end when the speech signal contains a keyword, and starting to record the transmitted audio stream continuously by the remote end, and storing the audio stream remotely;
when the continuous recording time does not exceed the second preset time and a recording stopping instruction is received, a recording stopping instruction is sent to the far end, and the far end stops recording;
The method for playing the recorded audio stream in response to the playback instruction comprises the steps of responding to the playback remote audio instruction, sending a playback request to the remote end, and receiving and playing the recorded audio stream stored in the remote end.
Preferably, the key information prompt can be one or more of a combination of a visual prompt, a touch prompt and an auditory prompt;
The visual reminding comprises a light effect reminding and a remote text message reminding;
the tactile alert includes a vibratory alert;
The audible alert includes a voice alert, a musical alert
Compared with the prior art, the embedded audio playing device provided by the invention has the beneficial effects that the real-time detection, reminding, recording and playback of continuous voice key information based on scenes can be independently completed, and the embedded audio playing device is convenient and fast to use and has good privacy. According to the key information reminding system and the key information reminding method, the key word recognition model highly matched with the application scene is obtained through training according to the actual requirement of the current application scene and the key words or training samples customized by the user, so that the accuracy of recognizing the key information in the continuous voice stream can be effectively improved, reminding and storing are timely output aiming at the information which is important to pay attention to the current application scene and the information which is interesting to the user, and the user experience is excellent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a key information reminding method according to embodiment 1 of the present invention;
fig. 2 is a schematic circuit block diagram of an embedded audio playing device according to embodiment 2 of the present invention;
Fig. 3 is a system architecture diagram of a key information reminding system according to embodiment 3 of the present invention.
Detailed Description
In more than 40 years of application of moore's law, the design technology and the manufacturing technology level of a semiconductor chip are improved rapidly, the computing capacity of the chip is improved greatly, the storage capacity on the chip is improved greatly, and meanwhile, the power consumption is reduced continuously, so that the artificial intelligence technology is possible to be widely applied to small-sized embedded equipment with low power consumption. The invention provides a technical improvement aiming at the defect that people easily miss important information from an opposite terminal when using a remote audio/video application in the prior art. Specifically, on the embedded device, aiming at voice information, an artificial intelligent voice recognition technology based on a scene is adopted to recognize the information of interest from the opposite end in real time, and prompt and save key audio streams are output in time. The invention is applicable to different application scenes and meets the personalized requirements of different users, thereby effectively solving the defects of the prior art. As used herein, "real-time" refers to the fact that the embedded audio playback device has sufficient computing power to identify keywords in the audio stream that was being played at the original speed.
The technical scheme of the present application is further exemplarily described below by means of the accompanying drawings and examples. It will be apparent that the described embodiments are only some of the embodiments of the present application and are not exhaustive of all embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
Example 1:
As shown in fig. 1, according to the core idea of the present invention, this embodiment provides a key information reminding method, in which,
Step 100, initializing step.
The step is a processing flow before entering the key information reminding, and is mainly used for checking and updating software and hardware environment configuration, parameter preparation, program preparation and the like required by the key information reminding. The method can comprise the step of establishing communication connection between the local device and the remote device in a wireless communication mode or a wired communication mode, and can further comprise the step of acquiring a keyword recognition model based on a scene.
It should be noted that "local" and "remote" are referred to herein as relative concepts, where "local" refers to one party or end that receives an audio stream and generates a key information alert, and "remote" refers to the other party or end that communicates in a wired or wireless manner, independent of "local" but directly or indirectly through one or more intermediaries, and sends the audio stream to "local". In addition, it should be noted that "far-end" and "opposite-end" commonly used in the art in describing a voice call are not concepts, where "opposite-end" refers to another initiator of an audio stream, and "far-end" refers to an initial receiving party of a call audio stream after being initiated from the opposite-end, and corresponding "local" refers to a final receiving party of the call audio stream.
As a specific embodiment, the "local" may be an audio playing device based on an embedded system (simply referred to as an embedded audio playing device). An "embedded system" as referred to herein is a special purpose computer system embedded in an object system, and is based on computer technology with application as the center, and the software and hardware can be tailored, so that the embedded system is suitable for a special purpose computer system with strict requirements on functions, reliability, cost, volume and power consumption of the application system. The "embedded device" refers to a device containing an embedded system therein, which is generally based on an ARM core and architecture or other low-power core and architecture, and is used for realizing specific functions and applications, and is a device with various functions relative to a general-purpose PC, and specifically may be an earphone, a sound box, a telephone watch, a conference terminal device, and the like. While the "remote" may be an end-user computer system, a network server or server system, a mobile computing device, a consumer electronic device, or other suitable electronic device, or any combination or portion thereof, such as, in particular, a cell phone, tablet, computer, smart television, etc.
The remote audio and video application can be applicable to various scenes, the transmitted voice information is huge in quantity, and the types of key information are different from person to person and different from scene to scene. For example, people often attend video conferences or web lessons through accessory devices such as headsets or speakers for cell phones or computers. In a video conference, it may be most interesting for users that part of the content related to themselves in the conference, such as departments where themselves are located, last persons, businesses related to themselves, etc., so that keywords for identifying key information should be department names, last names, own names, business names, task placement, delivery deadlines, etc., in net lessons, it may be most interesting for students to teach knowledge points of teachers, so that keywords for identifying key information should be important points, difficulties, points of examination, summaries, reviews, etc., while in customer service centers, it may be most interesting for customer-mentioned complaints, so that keywords need to include complaints, suggestions, quality, service attitudes, etc. If full text speech recognition models are employed in these different scenarios and it is desired to ensure recognition accuracy, the speech recognition models must be trained based on a large number of speech samples. However, in general, this is difficult to be implemented in an accessory device of a mobile phone or a computer, on one hand, a huge amount of voice samples are difficult to obtain, and on the other hand, training based on a huge amount of voice samples has very high requirements on computer hardware, so that the popularization and application of the technology in the accessory device of the mobile phone or the computer are limited by high implementation cost.
Therefore, in this embodiment, the step of obtaining the keyword recognition model based on the scene, particularly obtaining the keyword recognition model in the accessory embedded device of the mobile phone or the computer, is used for adjusting and updating the keyword recognition model according to the actual application scene, so that the keyword recognition model is more suitable for the current scene, and meets the user requirement. The keywords are associated with the application scene and comprise a group of vocabularies which need to be focused in the application scene. Different application scenarios may correspond to different keywords. The user can set and specify one or more vocabularies in the keywords according to the actual requirements.
The step of obtaining the keyword recognition model based on the scene specifically comprises the steps of receiving a keyword vocabulary customized by a user and/or receiving a voice sample of a specific person containing at least the keyword provided by the user, expanding a standard sample library by using the keyword and the voice sample of the specific person so as to form a training sample library, and training based on the training sample library to obtain the keyword recognition model based on the scene. Wherein the standard sample library may be a training sample set formed based on a wide range of pre-collected voice samples.
The step of receiving user-defined keyword vocabulary and/or receiving user-provided voice samples of a specific person containing at least the keywords is typically performed at a remote location based on a more abundant user interface on a mobile phone or a computer.
As an alternative implementation mode, a user can set a custom keyword set through a remote end in advance according to own preference, demand and use scene, and a provider of an audio stream can also generate a default keyword set according to various factors such as the use scene, the content of the audio stream, the use habit of the user and the like. The remote end can also display a plurality of default keyword vocabulary in advance for users to select, add and subtract so as to form a keyword vocabulary set associated with the application scene.
To match the hardware environment of the embedded device, an upper limit on the number of vocabulary in the keyword, for example, 30 groups of vocabulary, etc., may be set.
In addition, in speech recognition, various factors such as the sex, age, physiological characteristics of pronunciation, dialect, pronunciation of non-native language, emotion at the time of speaking, and environmental noise may affect the accuracy of recognition, for example, the pronunciation of the same word "key" varies greatly between the Sichuan and Guangdong. Therefore, in this embodiment, a voice sample with a specific accent, which is provided by a user and at least includes a keyword, may be obtained, and the standard sample library may be extended with the voice sample, for example, a student may provide a record of a teacher in a class, and a staff may provide a record of a boss in a meeting, etc.
After the voice sample of a specific person, which at least contains the keywords and is provided by the user, is received, the voice sample of the specific person is expanded into the training sample library. According to the embodiment, the keyword recognition model based on the scene is obtained based on the sample library training comprising the voice samples closely related to the application scene, so that the recognition accuracy can be effectively improved.
The process of training and obtaining the keyword recognition model in the embodiment can be realized by adopting a hidden Markov model (Hidden Markov Model, HMM), a dynamic theme model (Dynamic Topic Models, DTM) and various classical artificial intelligence voice recognition algorithms derived based on the technology, which are successfully used for voice recognition and text recognition at present, or adopting an algorithm based on deep learning and various related algorithms in the future. Deep learning is one of the important areas of machine learning (MACHINE LEARNING) research, the motivation of which is to build, simulate a neural network for analysis learning of the human brain, and interpret data, such as images, sounds, and text, by mimicking the mechanisms of the human brain. The core of deep learning is to learn more useful features by building a machine learning model with multiple hidden layers and a large amount of training data, thus ultimately improving the accuracy of classification or prediction. Currently, in computer vision and natural language, the mainstream deep learning algorithm is a convolutional neural network (Convolutional neural network, CNN), a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) algorithm, and in addition, a Long Short-Term Memory (LSTM) algorithm, a deep full-sequence convolutional neural network (Deep Fully Convolutional Neural Network, abbreviated as DFCNN) algorithm, and the like. When implemented, the present embodiments may employ various applicable deep learning algorithms including, but not limited to, these existing or future ones.
As a preferred implementation, this embodiment employs a continuous speech keyword recognition technique based on a deep learning algorithm. For example, after the training sample library is obtained, a deep learning algorithm such as a convolutional neural network (Convolutional Neural Network, abbreviated as CNN) algorithm, a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) algorithm and the like is adopted, and a keyword recognition model based on a scene is obtained based on training of the training sample library.
The training process of the keyword recognition model adopting the deep learning algorithm can be finished at a far end or at a cloud end. It should be noted that, the "cloud" as used herein refers to a server side of cloud computing or a background server of cloud computing with strong processing and storage capabilities. As a preferred embodiment, the training process is completed in the cloud end, so as to fully utilize hardware resources and strong computing power of the cloud end. The method specifically comprises the steps that after a user inputs keyword vocabulary at a far end or uploads a voice sample of a specific person containing keywords, the far end sends the keywords and/or the voice sample to a cloud end, so that the cloud end can acquire the voice sample containing the keywords in various modes such as the Internet, and expand the voice sample and the voice sample of the specific person into a standard sample library thereof to form a training sample library, and then the training sample library is used for training to obtain a keyword recognition model based on scenes.
Further, after training is completed, the remote end receives the keyword recognition model based on the scene from the cloud.
The process of the training at the far end may refer to the process of the cloud training, which is not described herein.
The initializing step may further include updating a local keyword recognition model, which specifically includes remotely downloading the scene-based keyword recognition model to the local. The remote terminal can actively send the update request to the local, or the remote terminal can respond to the local update request to start the download execution.
After the initialization step is completed, the following key information real-time detection and reminding process can be carried out.
Step 110, when receiving and playing an audio stream, acquiring a voice signal from the audio stream;
The key information reminding process in the embodiment is to identify key information contained in voice information and remind the user when receiving and playing audio streams in the remote audio and video application.
As a preferred embodiment, when the voice signal in the audio stream is obtained, the step further performs background sound elimination on the audio stream, removes noise, background voice, music sound and the like in the background, and extracts a foreground voice signal with a high signal-to-noise ratio so as to improve the signal-to-noise ratio and further improve the success rate of voice recognition.
Step 120, performing voice recognition on the voice signal by using the keyword recognition model based on the scene, and detecting whether the voice signal contains a preset keyword in real time;
when detecting whether the voice signal contains keywords, the voice signal can be considered to contain keywords as long as a group of keyword vocabularies are detected.
As a preferred implementation manner, a continuous voice keyword recognition technology based on a deep learning algorithm is adopted to recognize scene-based keyword information, and specifically, the method comprises the steps of constructing a deep learning neural network based on a keyword recognition model, inputting continuous voice signals to be recognized into the deep learning neural network to perform data processing so as to infer and judge words appearing in the voice signals and determine whether the words contain keyword words.
In this embodiment, the continuous speech keyword recognition technology based on the deep learning algorithm is used to perform scene-based keyword recognition, so that compared with the large vocabulary continuous speech recognition in the prior art, it is unnecessary to recognize all characters, but only detect whether one or more groups of keywords set by a user appear in the continuous speech stream, on one hand, the continuous speech stream can be detected in real time, on the other hand, the requirements on the computing capacity, the storage space and the power consumption of hardware are low, and the method can be applied to a small-sized low-power embedded system, and meanwhile, recognition accuracy can be effectively improved based on scene recognition, and user experience of speech recognition is improved.
When it is detected that the speech signal does not include the keyword, the process returns to step 110, and the subsequently acquired audio stream is continuously detected.
When the voice signal at least contains a group of keywords, steps 130 and 140 are executed;
step 130, generating a key information reminder.
The key information alert may include a visual alert, a tactile alert, and an audible alert;
the visual reminding comprises a light effect reminding and a text message reminding, such as that an LED indicator lamp flashes or displays specific light effect, a flashing pattern appears on a remote screen, a remote text message (such as a notification message of a mobile phone application program APP) and the like;
the tactile alert includes a vibratory alert such as ringing according to a predetermined law;
the audible alert includes a voice alert, a music alert, such as an alert with predetermined voice content or music.
In the implementation, the above one or more reminding modes can be selected according to the actual application scene, for example, only a light effect reminding or a music reminding can be set, and a message can be sent to the associated computer application program APP while the reminding is vibrated, so that the effect of double reminding is obtained.
Step 140, beginning recording the received audio stream;
in this embodiment, when it is determined that the voice information of the current audio stream includes a keyword, in order to help the user to miss important content as little as possible, recording is started on the received audio stream while a reminder is generated.
When the audio stream is recorded, the keyword can be used as a recording starting point, the audio stream received after the keyword appears can be used as the recording starting point, and a section of the audio stream which is already in rolling compression coding in the current audio stream when the keyword appears can be used as the recording starting point by moving forward a fixed time with the keyword as the starting point. That is, the recorded audio stream may or may not include the audio stream at the time of occurrence of the keyword, and may or may not include the audio stream before occurrence of the keyword.
The recorded audio stream will be stored locally after compression encoding for local playback. And continuously recording the audio stream until a recording stop instruction is received or the continuous recording time exceeds a first preset time length, and stopping recording. The first predetermined time period may be set to be relatively short, for example, 1 to 2 minutes, taking into account the limited capacity of the local storage carrier. In general, important content will appear in the voice information in a short time after the keyword appears, so the first shorter predetermined time length, although short, may save the most important voice content, so that the user can quickly learn the important information when playing back the recording.
When the recording start instruction is sent, timing can be started, and when the continuous recording time of the far end is calculated to not exceed the second preset time length and the recording stop instruction is received, the recording stop instruction is sent to the far end, so that the far end can stop recording at any time in the second preset time length due to the fact that the recording stop instruction is received, and the controllability of the recording time length is increased. The remote end may automatically stop recording when the duration exceeds the second predetermined length.
In order to help the user to fully grasp important information as much as possible and reduce information omission, the second predetermined time length may be set to be greater than or equal to the first predetermined time length, that is, the second predetermined time length is longer, for example, 2-5 minutes, so that an audio stream containing key information for a longer time period may be saved for playback by the user.
Of course, in the case where the local storage space is sufficiently large, the first predetermined time period may be set to be longer than or equal to the second predetermined time period, so that the recording is stored locally for a sufficiently long time period, and the recording for a shorter time period is reserved at the far end, so that the user or other people can play back the recording at the far end to quickly learn the key information.
In addition, as an alternative implementation manner, the remote end can also perform full text voice recognition on the recorded audio stream when recording the audio stream so as to obtain corresponding text and store the text information.
Step 150, in response to the playback instruction, playing the recorded audio stream.
In this step, the locally recorded and stored audio stream may be played in response to the playback local audio command, or the remotely recorded audio stream may be received and played in response to the playback remote audio command, with the playback request being sent to the remote end.
As an alternative implementation manner, the recorded audio streams can be stored according to the sequence of the recording start time when being stored locally, and correspondingly, can be played in turn according to the sequence of the recording start time when being played back.
It should be noted that step 104 is a step performed based on the obtained playback specification, and thus it need not necessarily be performed after step 103, or it may be performed to detect playback instructions at any time during use to play back a sound recording.
In a typical application scenario, the key information reminding method of the embodiment may be applied to a call center system. Often, operators of call centers receive hundreds of voice calls each day, which is labor intensive. However, it is often difficult to clearly express the main conversation purpose of the party making a call in a short time due to the difference in the speech expression ability, the accent problem, and even the emotion problem. If the operator cannot concentrate the spirit to a high degree, the operator can miss important information of the opposite party easily, and even misunderstand the meaning of the opposite party, so that adverse effects are caused. By adopting the method of the embodiment, when the operator receives a call, the earphone capable of reminding key information is worn, the earphone automatically identifies whether the voice information of the opposite party contains key words such as alarm, complaint, deception and the like and timely reminds the operator of paying attention to the key information, and the earphone can record the key information or inform a far end (such as a call center management platform, a call telephone transfer platform and the like) in communication connection with the earphone to record the key information. Thus, the operator can more accurately and comprehensively know the key information through the playback function, and the understanding of the conversation intention of the opposite party is enhanced. Therefore, the key information reminding method of the embodiment not only can timely and effectively remind the operator, but also can help the operator to review the call content, thereby reducing the information loss and greatly relieving the working pressure of the operator.
Example 2
Referring to fig. 2, according to the core concept of the present invention, the present embodiment provides an embedded audio playing device, which includes a communication unit, a speaker, a control unit, a storage unit, a voice recognition unit and a reminding unit,
The storage unit is used for storing data, programs and the like related to the operation of the device.
The communication unit can be a wired communication unit or a wireless communication unit, and can also comprise a wired communication module or a wireless communication module. Specifically, the communication unit may be implemented as a bluetooth communication unit, a WIFi communication unit, an Internet network interface, an audio-specific wired transmission interface, a USB interface, a micro USB interface, a mini USB interface, a Type-C interface, a lighting interface, or any of various known or future communication units that may be used in the present embodiment.
The communication unit receives an audio stream from a far end;
The voice recognition unit is used for extracting a voice signal from the audio stream and detecting whether keywords are contained in the voice signal in real time by adopting a keyword recognition model based on scenes;
The control unit is a control center of the device, is connected with other units in the device by utilizing various interfaces and lines, and carries out overall monitoring and scheduling on each unit so as to realize each function of the device, particularly when keywords are contained in the voice signals, the control unit starts recording the received audio stream and controls the reminding unit to output key information reminding;
in this embodiment, the keywords and the application scenario are associated, which includes a group of vocabularies that need to be focused in the application scenario, and one or more of the vocabularies are pre-designated by the user;
The voice recognition unit comprises a keyword recognition model unit used for storing the scene-based keyword recognition model. The scene-based keyword recognition model is obtained by training a training sample library which is based on a voice sample for the keywords and/or a voice sample for a specific person of the keywords in advance, and as a preferable implementation manner, the scene-based keyword recognition model is obtained by training by adopting a deep learning algorithm, and the voice recognition unit can adopt the keyword recognition model to perform continuous voice keyword recognition so as to detect whether the voice signal contains keywords in real time;
the voice recognition unit may further include a voice preprocessing unit, configured to preprocess an input audio stream to eliminate noise, music sound, background voice, and the like, and extract a voice signal with a high signal-to-noise ratio;
The voice recognition unit may further include a neural network processing unit, configured to perform data processing on the voice signal by using a deep learning algorithm based on the keyword recognition model, so as to perform reasoning and decision on the vocabulary appearing in the voice signal, so as to determine whether the keyword vocabulary is included in the vocabulary. The neural network processing unit can be an embedded neural Network Processor (NPU), a special neural network processing array processing unit, a DSP, an embedded processor and other processing modules which can be used for processing massive multimedia data in the neural network.
In this embodiment, the keyword recognition model is externally trained and downloaded into the device prior to use. Thus, the control unit is further configured to download the scene-based keyword recognition model from a remote location via the communication unit.
The reminding unit is one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module and a music message generation module. The indicator lamp module can be an LED indicator lamp which can flash with light or display specific graphics to output a prompt, the vibrator module can generate vibration with preset frequency, the text message generation module can generate text messages according to preset message formats, such as text messages containing currently recognized keywords, the voice message generation module can generate voice messages according to preset voice message formats, such as voice messages containing currently recognized keywords, the voice message generation module can select a section from prestored voice data according to preset modes to serve as voice messages, such as Tone tones including 'drip', 'Ding Dong', and the like.
The loudspeaker is used for playing the audio stream, playing back the recorded audio stream, playing back the voice message or the sound message, and the like. It should be understood that in some embodiments, the speaker may cooperate with the control unit and the storage unit to replace the function of the reminding unit, for example, only by means of sound reminding.
The embedded audio playing device further comprises an input unit, wherein the input unit is used for receiving various control instructions input by a user, such as a playback instruction, a stop reminding instruction, a recording stop instruction and the like input by the user.
The input unit can be a touch panel, a key, a voice command input module and other mechanical or voice input modules.
The storage unit is used for storing the recorded audio stream;
in an alternative implementation mode, when the voice signal contains a keyword, the control unit starts to perform continuous compression coding on the received audio stream and stores the received audio stream locally;
the control unit plays the locally stored recorded audio stream when receiving a local audio playback instruction;
The control unit is further configured to send a recording start instruction to the far end when the voice signal contains a keyword, and is configured to enable the far end to start continuously recording the sent audio stream, and send a recording stop instruction to the far end when the continuous recording time does not exceed a second predetermined time length and the recording stop instruction is received;
and when receiving a remote audio playing instruction, the control unit sends a playback request to the remote end and receives and plays the recorded audio stream stored at the remote end.
In addition, the embedded audio playing device can further comprise a power supply unit, wherein the power supply unit is used for providing a power supply required by the device during operation, and the power supply unit can be a power supply circuit module powered by a button battery or a rechargeable battery, can also be a power supply management module powered by an external input power supply for the device, and can also be a circuit module based on self-power-taking of a wired communication interface.
Obviously, the embedded audio playing device of the present embodiment may be used to implement some or all of the methods, processes or steps of the key information reminding method described in embodiment 1. The same or similar parts as those of embodiment 1 are described, and the description of this embodiment is omitted.
The embedded audio playing device can be embodied as a headset audio playing device, such as various wired earphone devices, wireless earphone devices and the like, can also be embodied as various portable sound boxes, and can also be embodied as accessory devices of a phone watch, a portable game machine, a portable multimedia player and the like of a mobile phone or a computer. For example, in a typical application scenario, the embedded audio playback device is a sound box with a band-pass function. The LED indicator lamp is arranged on the shell of the sound box, a keyword recognition model based on a scene is downloaded in advance in the LED indicator lamp, and real-time continuous detection can be carried out on voice information currently played by the sound box. When the current voice information contains keywords, the LED indicator lights start to flash so as to remind the user. The sound box has an intelligent voice control function, and a user can send out a control instruction through voice so as to control the sound box to execute the functions of closing the LED indicator lamp, stopping recording, playing back and the like. The detailed process of the sound box for realizing the key information reminding may be described with reference to the foregoing embodiment 1 and part of this embodiment, and will not be repeated here.
Example 3
According to the core idea of the invention, the embodiment provides a key information reminding system, which comprises an embedded audio playing device and a remote device,
The remote equipment receives a keyword vocabulary customized by a user and/or a voice sample provided by the user and at least containing a specific person of the keyword, which is used for acquiring a keyword recognition model based on a scene;
The scene-based keyword recognition model is obtained by training a training sample library which is based on voice samples for the keywords and/or voice samples for specific people of the keywords in advance;
The embedded audio playback device communicates with the remote device, receives and plays audio streams from the remote device, and the communication may be in any suitable form of communication, such as wired (e.g., ethernet, USB, lightning, fiber optic) communication or wireless (e.g., wiFi, bluetooth, IR) communication.
The embedded audio playing device also acquires a voice signal from the audio stream, carries out voice recognition on the voice signal by adopting a keyword recognition model based on scenes, and detects whether the voice signal contains keywords in real time;
When the voice signal contains keywords, the embedded audio playing device generates a keyword information prompt and starts recording the received audio stream;
The embedded audio playing device responds to the playback instruction and plays the recorded audio stream.
As an alternative way, the key recognition model completes training on the remote device, the remote device uses user-defined keyword vocabulary and/or voice samples of specific people containing at least the keywords provided by the user to expand a standard sample library thereof to form a training sample library, and the training sample library is used for training to obtain a scene-based key recognition model;
the remote device downloads the scene-based keyword recognition model to the embedded audio playing device.
As another optional implementation manner, the keyword recognition model completes training at the cloud, and the system further comprises a cloud server;
the remote equipment is communicated with the cloud server, and the keywords and/or voice samples of specific persons are sent to the cloud server;
the cloud server is used for expanding a standard sample library of the received keywords and the voice samples of the specific person, and training is carried out on the basis of the training sample library to obtain a scene-based keyword recognition model;
The remote device receives the scene-based keyword recognition model from the cloud server and downloads the scene-based keyword recognition model to the embedded audio playing device.
Obviously, the key information reminding system provided in this embodiment may be used to implement some or all of the method, flow or step in the key information reminding method described in embodiment 1. The embedded audio playing device of embodiment 2 can also be used to implement the key information reminding system of the present embodiment. Similar technical details thereof may be referred to the description of the foregoing embodiments and are not repeated herein.
The following will take a typical application scenario as an example to describe the core idea of the embodiment of the present invention in more detail.
Referring to fig. 3, in this application scenario, the key information reminding system includes a video playing device (such as a tablet pc) 300, an earphone 310, and a cloud server 320.
The earphone 310 may be a headset, an in-ear earphone or an ear-hanging earphone, may be a wired earphone or a wireless earphone, may have only 1 headset 311, may have left and right headsets 311, and may have a single-piece or a split-piece headset 311.
The headphones 310 communicate with the video playback device 300 either by wire or wirelessly, thereby receiving an audio stream from the video playback device 300. The video playing device 300 may be a personal computer, a tablet computer, a smart television, a mobile phone, etc. of the user. The user views a video program through the video playback device 300. Fig. 3 shows a student watching a net lesson through a tablet computer.
The video playback device 300 may also access the cloud server 320 based on a network, which may be a local area network, a wide area network, a cellular network, or a combination thereof.
The earphone 310 is provided with an LED indicator 312 and keys 313-316. The LED indicator 312 may emit flashing red light, the key 313 is a volume up key, the key 314 is a play/pause key, the key 315 is a stop reminder/stop record/playback key, and the key 316 is a volume down key. The key 315 may be set to perform three functions of stopping reminding, stopping recording and starting playback at the same time when pressed 1 time, or may be set to perform stopping reminding and recording at the same time when pressed 1 time, and starting playback when pressed twice consecutively. And may be specifically set according to an actual implementation environment, to which the present invention is not limited.
The LED indicator 312 may also be disposed on an external microphone (not shown) of the earphone 310, so that the user can adjust the external microphone to the front position of his lips when wearing the earphone, and thus the LED indicator 312 is more visible if the user is reminded of light.
In addition, a vibrator (not shown) is also provided in the earphone 310. The vibrator may be implemented using existing or future applicable technologies, and the present invention is not particularly limited. For example, an eccentric motor with a cam may be used.
The cloud server 320 may train to generate keyword recognition models based on the deep learning algorithm described above. In a specific implementation, the cloud server 320 may collect a wide range of voice samples in advance, and perform vocabulary labeling and other processing on the voice samples to form a standard sample library.
In the application scene, the key information reminding system realizes the key information reminding process as follows:
Step one, initializing.
Before the key information reminding process is started, an initialization step is carried out, and software and hardware environment configuration and various parameter settings required by the operation and communication of various devices and equipment in the system are checked and updated.
Setting keywords to obtain a new keyword recognition model. The method comprises the following steps:
the user sets keyword vocabulary through the video playing device 300, for example, students can input words such as "key", "examination", "summary" and their own names as keywords before surfing the net lesson. The keywords which accord with the current application scene and have individuation can be formed through the autonomous setting of the user.
In order to match the hardware power consumption and computational effort of the headset 310, the upper limit of the keyword vocabulary number is set to 20.
When a new vocabulary is input in the keywords of the video playback device 300, the video playback device 300 accesses the cloud server 320, sends a request to update the keyword recognition model to the cloud server 320, and sends the keywords to the cloud server 320.
After the cloud server 320 receives the keywords, it can compare the keywords with the keywords existing on the cloud server 320, when all the words in the keywords sent by the video playing device 300 are included in the keywords existing on the cloud server 320, it directly uses the existing standard sample library as a training sample library, trains the keywords by the deep learning algorithm to obtain a new scene-based keyword recognition model, when some words in the keywords are not included in the keywords existing on the cloud server 320, it obtains the voice sample including the words from the internet, and expands the standard sample library to form the training sample library, and trains to generate a new keyword recognition model.
The user may also upload a voice sample of a specific person containing one or more words in the keyword through the video playback device 300, such as a student uploading a teacher's voice audio material to the video playback device 300. The video playing device 300 uploads the voice sample of the specific person to the cloud server 320, so that the standard sample library of the cloud server 320 is extended, so that the cloud server 320 can train to obtain a new keyword recognition model based on the training sample library of the voice sample of the specific person at least containing the keyword.
The cloud server 320 transmits the trained scene-based keyword recognition model to the video playback device 300 in response to the update request of the video playback device 300.
After receiving the keyword recognition model from the cloud server 320, the video playing device 300 downloads the keyword recognition model to the earphone 310, so that the earphone 310 updates the keyword recognition model stored locally.
It should be noted that, the process of setting the keywords and obtaining the new keyword recognition model may be completed in the initializing step, or may be completed in each appropriate time in the system operation, which may be specifically determined according to the actual situation, which is not limited in the present invention.
Step two, the earpiece 310 receives the audio stream.
After the system initialization is completed, the user may begin receiving and playing the audio stream from the video playback device 300 via the keys 314 on the headphones 310. If the student views the network course through the earphone 310 and the tablet pc 300 at this time.
Step three, the earphone 310 acquires a voice signal in the audio stream, performs voice recognition on the voice signal, and detects whether the voice signal contains a preset keyword in real time by adopting the keyword recognition model based on the scene.
The earphone 310 is internally provided with a voice recognition unit, which may be an embedded neural network processor, and is configured to construct a neural network based on the keyword recognition model, and perform data processing by adopting a deep learning algorithm so as to perform real-time keyword recognition on a continuously input voice signal.
The audio stream of the network lesson may include various sound signals such as music and voice, and the earphone 310 extracts the voice signals therein and detects whether the voice signals include preset keywords by using a scene-based keyword recognition model and a deep learning algorithm. For example, when a student presets a keyword "summary", when a net lesson teacher speaks "we summarize the main content of the lesson below", it is possible to detect and identify that the current voice signal contains the keyword, and if the student uses his own name or number as the keyword, the earphone 310 can play a role of auxiliary reminding when the net lesson teacher is called.
And when no keyword is recognized, the headphone 310 continues to receive and play the audio stream without proceeding to the execution of the following steps. It should be appreciated that the process of receiving and playing the audio stream by the headphones 310 may not be affected when the system is conducting the critical information alert.
Step four, the earphone 310 generates a key information alert and records the audio stream.
When the earphone 310 detects that the current voice signal contains the preset keyword, the vibrator starts vibrating. The user may stop the earphone 310 from vibrating by the key 315. If the vibration exceeds a predetermined vibration time, such as 10 seconds, the user does not stop the vibration, the vibration may be automatically stopped and the LED indicator 312 may be caused to begin to flash red. The red light may continue to blink for a longer blinking time or may blink until the user stops it by the key 315. When the current state of the LED indicator 312 is detected as an operating state (red flash) before the earphone 310 generates a new vibration, the new vibration is not generated, but the current operating state of the LED indicator 312 is continuously maintained. Therefore, if the student wears the earphone at the moment, the student can pay attention to the key information in a vibration mode, and if the student has taken off the earphone, the reminding purpose can be achieved in a light effect mode.
The headphones 310 begin recording the received audio stream at the same time as the critical information alert is generated. The method comprises the following steps:
And locally storing the recorded audio stream within a first preset time period. The first predetermined length of time should be less than or equal to the length of time that the headphones 310 can store the audio stream at most. The first predetermined time period may be a preset fixed value, for example, if the time period in which the headphones 310 can store the audio stream at most is 2 minutes, the first predetermined time period may be 2 minutes, or the first predetermined time period may be 30 seconds, and then the headphones 310 may store 4 audio streams at most with the time period of 30 seconds at most.
The earphone 310 starts recording the received audio stream and simultaneously transmits a recording start instruction and a detected keyword vocabulary to the video playback device 300.
After receiving the recording start instruction sent by the earphone 310, the video playing device 300 starts recording the sent audio stream.
Step five, the video playing apparatus 300 converts the recorded voice signal into text information and stores it.
The video playing device 300 may obtain the voice signal in the recorded audio stream, and convert the full text into text by various voice text conversion methods in the prior art, and store the text. When in storage, the keyword vocabulary, the words and the sound recordings detected by the earphone 310 can be stored in a correlated manner, so that the user can select and review later.
And step six, recording is stopped.
When the user inputs a recording stop command through the key 315, or when the duration of continuous recording exceeds the first predetermined duration but the recording stop command sent by the user is still not received, the earphone 310 will automatically stop recording the audio stream.
When the user inputs a recording stop command through the key 315, or when the duration of continuous recording exceeds the second predetermined duration but the recording stop command sent by the user is still not received, the video playing device 300 will automatically stop recording the audio stream.
And step seven, recording and playing back.
In this embodiment, the user may play back the audio recording on the headphones 310 or the video playback device 300.
For example, when a student initiates a local playback function by pressing the key 315 a number of times in succession, the headset 310 will play a locally stored recorded audio stream while simultaneously playing an audio stream from the video playback device 300. During playing, two audio streams can be mixed and then played, one of the two headsets 311 can play one audio stream, and the other headset can play the other audio stream.
Or when the student starts the remote playback function by continuously pressing the button 315 for 3 times, the earphone 310 transmits a playback request instruction to the video playing device 300, and after receiving the playback request instruction, the video playing device 300 transmits the recorded audio stream to the earphone 310.
In addition, the student can also directly input a playback instruction on the video playback device 300 to play the recorded audio stream stored in the video playback device 300.
The student may also specify that the recorded audio stream therein be played on the video playback device 300.
And step eight, consulting the text information.
In this step, the student may review the text information corresponding to the recorded audio stream in the video playing device 300, so that the student can review and take notes according to the text information.
Through the description of the embodiment and the typical application scene, the key information reminding method, the system and the embedded audio playing device provided by the embodiment of the invention realize the real-time detection, reminding and playback of the key information of continuous voice on the small-sized and low-power-consumption embedded equipment, are convenient to use, simple to operate and wide in application range, can effectively remind, save and review the key information, reduce the loss of missing key information of a user, and increase the satisfaction degree of the user on remote audio and video application.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not considered to be beyond the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (15)

Translated fromChinese
1.一种嵌入式音频播放装置,包括扬声器和通信单元,其特征在于:还包括控制单元、存储单元、语音识别单元及提醒单元,1. An embedded audio playback device, comprising a speaker and a communication unit, characterized in that it also includes a control unit, a storage unit, a voice recognition unit and a reminder unit,所述通信单元接收来自远端的音频流;The communication unit receives an audio stream from a remote end;所述控制单元用于通过所述通信单元自远端下载基于场景的关键词识别模型,所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇,所述词汇中的一个或多个由用户预先指定;The control unit is used to download a scenario-based keyword recognition model from a remote end through the communication unit, wherein the scenario-based keyword recognition model is obtained by training in advance based on a training sample library containing voice samples for the keyword and/or voice samples of a specific person for the keyword; the keyword is associated with an application scenario, which contains a group of words that need to be focused on in the application scenario, and one or more of the words are pre-specified by a user;所述语音识别单元包括关键词识别模型单元,所述关键词识别模型单元用于存储所述基于场景的关键词识别模型;The speech recognition unit includes a keyword recognition model unit, and the keyword recognition model unit is used to store the scene-based keyword recognition model;所述语音识别单元自所述音频流中提取语音信号,并采用所述基于场景的关键词识别模型针对所述语音信号进行人工智能语音识别,实时检测所述语音信号中是否包含所述关键词;The speech recognition unit extracts a speech signal from the audio stream, and uses the scene-based keyword recognition model to perform artificial intelligence speech recognition on the speech signal, and detects in real time whether the speech signal contains the keyword;所述控制单元用于在所述语音信号中包含关键词时,开始录制所接收的音频流,并控制所述提醒单元输出关键信息提醒;The control unit is used to start recording the received audio stream when the voice signal contains keywords, and control the reminder unit to output key information reminders;所述存储单元用于存储被录制的音频流;The storage unit is used to store the recorded audio stream;所述扬声器用于播放所述音频流,或响应于回放指令,回放所录制的音频流。The speaker is used to play the audio stream, or to play back the recorded audio stream in response to a playback instruction.2.如权利要求1所述嵌入式音频播放装置,其特征在于:所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,采用深度学习算法训练获得。2. The embedded audio playback device as described in claim 1 is characterized in that: the scene-based keyword recognition model is pre-based on a training sample library containing voice samples for the keyword and/or voice samples of a specific person for the keyword, and is obtained by training using a deep learning algorithm.3.如权利要求2所述嵌入式音频播放装置,其特征在于:所述语音识别单元还包括语音预处理单元,用于对输入的音频流进行预处理,以消除噪声、背景人声、音乐声,提取语音信号;3. The embedded audio playback device as claimed in claim 2, characterized in that: the speech recognition unit further comprises a speech preprocessing unit for preprocessing the input audio stream to eliminate noise, background human voice, music sound, and extract the speech signal;所述语音识别单元还包括神经网络处理单元,用于基于所述关键词识别模型,采用深度学习算法对所述语音信号或所述语音预处理单元处理后的语音信号进行数据处理,从而对语音信号中出现的词汇进行推理和判决,以确定其中是否包含关键词词汇。The speech recognition unit also includes a neural network processing unit, which is used to perform data processing on the speech signal or the speech signal processed by the speech preprocessing unit based on the keyword recognition model using a deep learning algorithm, so as to infer and judge the words appearing in the speech signal to determine whether the word contains keyword words.4.如权利要求1所述嵌入式音频播放装置,其特征在于:所述提醒单元为指示灯模组、振动器模组、文字消息生成模组、语音消息生成模组、音乐消息生成模组中的一种或多种。4. The embedded audio playback device as claimed in claim 1, characterized in that the reminder unit is one or more of an indicator light module, a vibrator module, a text message generation module, a voice message generation module, and a music message generation module.5.如权利要求1所述嵌入式音频播放装置,其特征在于:还包括输入单元,用于接收用户输入的录制停止指令、回放指令;5. The embedded audio playback device according to claim 1, further comprising an input unit for receiving a recording stop instruction and a playback instruction input by a user;所述控制单元在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;When the voice signal contains a keyword, the control unit starts to continuously compress and encode the received audio stream and store it locally;所述控制单元在接收到录制停止指令或持续录制时间超过第一预定时长时,停止录制;The control unit stops recording when receiving a recording stop instruction or the continuous recording time exceeds a first predetermined time period;所述控制单元在接收到回放本地音频指令时,播放本地存储的录制音频流;The control unit plays the locally stored recorded audio stream when receiving the instruction to play back the local audio;所述控制单元还用于在所述语音信号中包含关键词时,向远端发送录制开始指令,用于使远端开始对所发送的音频流持续录制,当持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令;The control unit is further configured to send a recording start instruction to the remote end when the voice signal contains a keyword, so as to enable the remote end to start continuously recording the sent audio stream, and to send a recording stop instruction to the remote end when the continuous recording time does not exceed a second predetermined time and a stop recording instruction is received;所述控制单元在接收到回放远端音频指令时,向远端发送回放请求,并接收和播放远端存储的录制音频流。When receiving the remote audio playback instruction, the control unit sends a playback request to the remote end, and receives and plays the recorded audio stream stored in the remote end.6.如权利要求1至5中之一所述嵌入式音频播放装置,其特征在于:所述嵌入式音频播放装置为耳机或带通话功能的音箱。6. The embedded audio playback device as claimed in any one of claims 1 to 5, characterized in that the embedded audio playback device is a headset or a speaker with a call function.7.一种关键信息提醒系统,其特征在于:包括嵌入式音频播放装置和远端设备,7. A key information reminder system, characterized in that it includes an embedded audio player and a remote device,所述远端设备接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本,以用于获取基于场景的关键词识别模型;所述关键词和应用场景关联,其中包含一组在该应用场景中需要重点关注的词汇;The remote device receives a keyword vocabulary defined by a user and/or a speech sample of a specific person provided by the user and containing at least the keyword, so as to obtain a scenario-based keyword recognition model; the keyword is associated with an application scenario, and includes a group of words that need to be focused on in the application scenario;所述基于场景的关键词识别模型为预先基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得;The scenario-based keyword recognition model is obtained by training in advance based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword;所述嵌入式音频播放装置与所述远端设备通信,自远端设备下载所述基于场景的关键词识别模型,以及,接收来自远端设备的音频流,并播放;The embedded audio playback device communicates with the remote device, downloads the scene-based keyword recognition model from the remote device, and receives and plays the audio stream from the remote device;所述嵌入式音频播放装置还自所述音频流中获取语音信号,并采用基于场景的关键词识别模型针对所述语音信号进行人工智能语音识别,实时检测所述语音信号中是否包含关键词;The embedded audio playback device also obtains a voice signal from the audio stream, and uses a scene-based keyword recognition model to perform artificial intelligence voice recognition on the voice signal, and detects in real time whether the voice signal contains keywords;当所述语音信号中包含关键词时,所述嵌入式音频播放装置产生关键信息提醒,并开始录制所接收的音频流;When the voice signal contains a keyword, the embedded audio playback device generates a key information reminder and starts recording the received audio stream;所述嵌入式音频播放装置响应于回放指令,播放所录制的音频流。The embedded audio playback device plays the recorded audio stream in response to the playback instruction.8.如权利要求7所述关键信息提醒系统,其特征在于:还包括云服务器,8. The key information reminder system as claimed in claim 7, characterized in that it also includes a cloud server,所述远端设备与所述云服务器通信,将所述关键词和/或特定人的语音样本发送至所述云服务器;The remote device communicates with the cloud server and sends the keyword and/or the voice sample of the specific person to the cloud server;所述云服务器将接收到的关键词和/或所述特定人的语音样本用于对其标准样本库进行扩充形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;The cloud server uses the received keywords and/or the speech samples of the specific person to expand its standard sample library to form a training sample library, and based on the training sample library, uses a deep learning algorithm to train to obtain the scenario-based keyword recognition model;所述远端设备接收来自所述云服务器的基于场景的关键词识别模型,并将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device receives the scenario-based keyword recognition model from the cloud server, and downloads the scenario-based keyword recognition model to the embedded audio playback device.9.如权利要求7所述关键信息提醒系统,其特征在于:所述远端设备将用户输入的关键词词汇和/或用户提供的至少包含所述关键词的特定人的语音样本用于对标准样本库进行扩充,形成训练样本库,并基于所述训练样本库,采用深度学习算法训练获得所述基于场景的关键词识别模型;9. The key information reminder system according to claim 7, characterized in that: the remote device uses the keyword vocabulary input by the user and/or the voice sample of a specific person provided by the user that at least contains the keyword to expand the standard sample library to form a training sample library, and based on the training sample library, adopts a deep learning algorithm to train and obtain the scenario-based keyword recognition model;所述远端设备将所述基于场景的关键词识别模型下载至所述嵌入式音频播放装置。The remote device downloads the scenario-based keyword recognition model to the embedded audio playback device.10.一种关键信息提醒方法,其特征在于:10. A key information reminder method, characterized in that:接收用户自定义的关键词词汇,和/或用户提供的、至少包含所述关键词的特定人的语音样本;所述关键词和应用场景关联,包含一组在该应用场景中需要重点关注的词汇;Receive user-defined keyword vocabulary and/or a user-provided voice sample of a specific person containing at least the keyword; the keyword is associated with an application scenario and includes a group of vocabulary that needs to be focused on in the application scenario;基于包含针对所述关键词的语音样本,和/或针对所述关键词的特定人的语音样本的训练样本库,训练获得基于场景的关键词识别模型;Based on a training sample library containing speech samples for the keyword and/or speech samples of a specific person for the keyword, training to obtain a scenario-based keyword recognition model;在嵌入式音频播放装置接收和播放音频流时,自所述音频流中获取语音信号;When the embedded audio playback device receives and plays the audio stream, obtaining a voice signal from the audio stream;所述嵌入式音频播放装置采用自远端下载的所述基于场景的关键词识别模型针对所述语音信号进行人工智能语音识别,实时检测所述语音信号中是否包含关键词;The embedded audio playback device uses the scenario-based keyword recognition model downloaded from the remote end to perform artificial intelligence speech recognition on the voice signal, and detects in real time whether the voice signal contains keywords;当所述语音信号中包含关键词时,所述嵌入式音频播放装置产生关键信息提醒,并开始录制所接收的音频流;When the voice signal contains a keyword, the embedded audio playback device generates a key information reminder and starts recording the received audio stream;所述嵌入式音频播放装置响应于回放指令,播放所录制的音频流。The embedded audio playback device plays the recorded audio stream in response to the playback instruction.11.如权利要求10所述关键信息提醒方法,其特征在于:11. The key information reminder method according to claim 10, characterized in that:预先采集广泛的语音样本,形成标准样本库;Collect a wide range of voice samples in advance to form a standard sample library;根据所述关键词获取至少包含所述关键词的语音样本;Acquire a voice sample containing at least the keyword according to the keyword;将所述包含所述关键词的语音样本和/或所述特定人的语音样本扩充至所述标准样本库,形成训练样本库,基于所述训练样本库采用深度学习算法训练获得所述基于场景的关键词识别模型。The speech samples containing the keywords and/or the speech samples of the specific person are expanded to the standard sample library to form a training sample library, and the scene-based keyword recognition model is obtained by training with a deep learning algorithm based on the training sample library.12.如权利要求10所述关键信息提醒方法,其特征在于:所述的自所述音频流中获取语音信号步骤中,还包括消除噪声、音乐声、背景人声的预处理步骤;12. The key information reminder method according to claim 10, characterized in that: the step of obtaining the voice signal from the audio stream further includes a pre-processing step of eliminating noise, music, and background human voice;采用所述基于场景的关键词识别模型针对所述语音信号或预处理后的语音信号进行语音识别,实时检测所述语音信号中是否包含关键词,具体包括:构建基于所述关键词识别模型的深度学习神经网络,将语音信号连续输入所述深度学习神经网络进行数据处理,以对所述语音信号中出现的词汇进行推理和判决,确定其中是否包含关键词词汇。The scene-based keyword recognition model is used to perform speech recognition on the speech signal or the preprocessed speech signal, and real-time detection is performed on whether the speech signal contains keywords, specifically including: constructing a deep learning neural network based on the keyword recognition model, continuously inputting the speech signal into the deep learning neural network for data processing, so as to infer and judge the words appearing in the speech signal, and determine whether the words contain keyword words.13.如权利要求10所述关键信息提醒方法,其特征在于:所述录制所接收的音频流,具体包括:在所述语音信号中包含关键词时,开始对接收到的音频流进行持续压缩编码并本地存储;13. The key information reminder method according to claim 10, characterized in that: the recording of the received audio stream specifically comprises: when the voice signal contains a keyword, starting to continuously compress and encode the received audio stream and storing it locally;接收到录制停止指令或持续录制时间超过第一预定时长时,停止本地录制;When a recording stop instruction is received or the continuous recording time exceeds a first predetermined time, the local recording is stopped;所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放本地音频指令,播放本地存储的录制音频流。The playing of the recorded audio stream in response to the playback instruction specifically includes: playing the locally stored recorded audio stream in response to the playback local audio instruction.14.如权利要求10所述关键信息提醒方法,其特征在于:所述录制所接收的音频流,具体包括:在所述语音信号中包含关键词时,向远端发送录制开始指令,远端开始对所发送的音频流持续录制,并进行远端存储;14. The key information reminder method according to claim 10, characterized in that: the recording of the received audio stream specifically comprises: when the voice signal contains a keyword, sending a recording start instruction to the remote end, and the remote end starts to continuously record the sent audio stream and performs remote storage;持续录制时间未超过第二预定时长且接收到停止录制指令时,向远端发送录制停止指令,远端停止录制;When the continuous recording time does not exceed the second predetermined time and a stop recording instruction is received, a recording stop instruction is sent to the remote end, and the remote end stops recording;所述响应于回放指令,播放所录制的音频流,具体包括:响应于回放远端音频指令,向远端发送回放请求,并接收和播放远端存储的录制音频流。The playing of the recorded audio stream in response to the playback instruction specifically includes: sending a playback request to the remote end in response to the playback remote audio instruction, and receiving and playing the recorded audio stream stored in the remote end.15.如权利要求10所述关键信息提醒方法,其特征在于:所述关键信息提醒为视觉提醒、触觉提醒和听觉提醒中的一种或多种形式的组合;15. The key information reminder method according to claim 10, characterized in that: the key information reminder is a combination of one or more forms of visual reminder, tactile reminder and auditory reminder;所述视觉提醒包括光效提醒、远端文字消息提醒;The visual reminder includes light effect reminder and remote text message reminder;所述触觉提醒包括振动提醒;The tactile reminder includes a vibration reminder;所述听觉提醒包括语音提醒、音乐提醒。The auditory reminder includes voice reminder and music reminder.
CN202010313790.4A2020-04-202020-04-20 A scenario-based key information reminder method, system and deviceActiveCN111601215B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202010313790.4ACN111601215B (en)2020-04-202020-04-20 A scenario-based key information reminder method, system and device

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202010313790.4ACN111601215B (en)2020-04-202020-04-20 A scenario-based key information reminder method, system and device

Publications (2)

Publication NumberPublication Date
CN111601215A CN111601215A (en)2020-08-28
CN111601215Btrue CN111601215B (en)2025-03-25

Family

ID=72183273

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202010313790.4AActiveCN111601215B (en)2020-04-202020-04-20 A scenario-based key information reminder method, system and device

Country Status (1)

CountryLink
CN (1)CN111601215B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
TWI709052B (en)*2018-10-312020-11-01仁寶電腦工業股份有限公司Smart liquor cabinet and searching method for liquor
CN113468317B (en)*2021-06-262024-03-08北京网聘信息技术有限公司Resume screening method, system, equipment and storage medium
CN117178320A (en)*2021-07-162023-12-05华为技术有限公司 Methods, devices, electronic equipment and media for speech listening and speech recognition model generation
CN115188397A (en)*2022-09-072022-10-14云丁网络技术(北京)有限公司Media output control method, device, equipment and readable medium
CN115862373B (en)*2022-12-022025-08-08深圳信路通智能技术有限公司 Parking lot call service method, device and system
CN116437253A (en)*2023-03-312023-07-14武汉星纪魅族科技有限公司Earphone control method, device, equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109451158A (en)*2018-11-092019-03-08维沃移动通信有限公司A kind of based reminding method and device
CN212588503U (en)*2020-04-202021-02-23南京西觉硕信息科技有限公司Embedded audio playing device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7689417B2 (en)*2006-09-042010-03-30Fortemedia, Inc.Method, system and apparatus for improved voice recognition
US9471212B2 (en)*2014-03-102016-10-18Htc CorporationReminder generating method and a mobile electronic device using the same
CN107464557B (en)*2017-09-112021-05-07Oppo广东移动通信有限公司Call recording method and device, mobile terminal and storage medium
CN109979440B (en)*2019-03-132021-05-11广州市网星信息技术有限公司Keyword sample determination method, voice recognition method, device, equipment and medium
CN110556110A (en)*2019-10-242019-12-10北京九狐时代智能科技有限公司Voice processing method and device, intelligent terminal and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109451158A (en)*2018-11-092019-03-08维沃移动通信有限公司A kind of based reminding method and device
CN212588503U (en)*2020-04-202021-02-23南京西觉硕信息科技有限公司Embedded audio playing device

Also Published As

Publication numberPublication date
CN111601215A (en)2020-08-28

Similar Documents

PublicationPublication DateTitle
CN111601215B (en) A scenario-based key information reminder method, system and device
CN111630876B (en) Audio equipment and audio processing method
CN106874265B (en)Content output method matched with user emotion, electronic equipment and server
CN108922525B (en)Voice processing method, device, storage medium and electronic equipment
CN112331193A (en) Voice interaction method and related device
JP2019536150A (en) Social robot with environmental control function
KR20190005103A (en)Electronic device-awakening method and apparatus, device and computer-readable storage medium
CN110931000A (en)Method and device for speech recognition
CN110111795B (en)Voice processing method and terminal equipment
US20220246133A1 (en)Systems and methods of handling speech audio stream interruptions
CN212588503U (en)Embedded audio playing device
CN112712788A (en)Speech synthesis method, and training method and device of speech synthesis model
CN113449068A (en)Voice interaction method and electronic equipment
CN115841814A (en)Voice interaction method and electronic equipment
CN113299309A (en)Voice translation method and device, computer readable medium and electronic equipment
CN119670767A (en) Translation system based on artificial intelligence, smart glasses for translation and translation method
CN111081275B (en)Terminal processing method and device based on sound analysis, storage medium and terminal
JP2008096884A (en)Communication system for learning foreign language
CN113573143B (en) Audio playback method and electronic device
CN111339881A (en) Baby growth monitoring method and system based on emotion recognition
CN119360842A (en) Voice interaction method, device, electronic device and computer storage medium
CN116665635A (en)Speech synthesis method, electronic device, and computer-readable storage medium
JP2014204429A (en)Voice dialogue method and apparatus using wired/wireless communication network
CN113157241A (en)Interaction equipment, interaction device and interaction system
CN109558853B (en)Audio synthesis method and terminal equipment

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp