Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
With the continuous development of computer and internet technologies, networks provide more convenient communication modes for people, and application scenes such as instant messaging, online office work, online learning and the like are more and more common in people's lives. Web conferencing is a common form of implementing online office and online learning.
In the process of the network conference (which can be online meeting or online teaching), a plurality of users participate in the network conference, the environments of the users are different, the background sound is uncertain, and the background sound of the whole conference or classroom is loud. Solutions for setting a talk-bar function for online conferences have also been proposed in the related art. Specifically, a control for controlling the start and stop of the language forbidden function is arranged on the running interface of the web conference, and the user can start or stop the language forbidden function by clicking the control. When the talk-inhibiting function is started, the sound of the environment where the local user is located cannot be transmitted to the remote user, and the remote user cannot hear the sound of the local user. When the language-forbidden function is closed, the sound of the environment where the local user is located is transmitted to the remote user, and the remote user can hear the sound of the local user. The prohibition function of a certain user or a certain part of users of the network conference can be turned on or off by an administrator, for example, the administrator sets prohibition of a part of user groups during the network conference.
In the network conference (which may be online meeting or online teaching), the user may select banning talk when not speaking, and the user needs to turn off banning talk when speaking, but may forget to turn off banning talk when speaking, and after finding that the banning talk is not turned off, the user needs to turn off the banning talk first and then repeat speaking, which is cumbersome to operate.
During the network conference (which may be online meeting or online teaching), the network conference serves as an instant messaging tool, and the user can communicate with a plurality of groups, for example, perform voice communication with group a, perform file sending and receiving with group B, and the like.
In the process of the network conference (which can be online meeting or online teaching), a user can also open a plurality of application programs at the same time, and the network conference program can run in the background, so that the operations of receiving and sending messages and transmitting texts by using other application programs in the foreground by the user are not influenced.
In the above multiple application scenarios, a user has multiple task requirements, and whether the talk-inhibition function is enabled or not is adjusted at any time according to the user requirements, which may cause the following problems: (1) the user speaks in the stage of forbidding speaking function starting, which can cause that the speaking is not heard by the remote user and needs to speak again for the second time; (2) if the user is communicating with a plurality of group conferences and the talk inhibition is required to be performed after the conversation stream for voice communication is found out each time the user clicks to switch, the operation of talk inhibition is performed, and the operation cost is high; (3) when the user uses other application programs, the user needs to switch to the network conference program first and then forbid the language operation; (4) when the device (mobile phone or computer) locks the screen, the device needs to be unlocked first, then the device switches to the network conference program, and then the device performs the operation of prohibiting words, so that the operation cost is high.
Fig. 1A is a schematic diagram of an exemplary system architecture to which a control method and apparatus for a web conference may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1A is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1A, thesystem architecture 100 according to this embodiment may include a plurality ofterminal devices 101, anetwork 102, and aserver 103.Network 102 is the medium used to provide a communication link betweenterminal device 101 andserver 103.Network 102 may include various connection types, such as wired and/or wireless communication links, and so forth.
A user may useterminal device 101 to interact withserver 103 overnetwork 102 to receive or send messages and the like. Various messaging client applications, such as a web browser application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (to name a few), may be installed onterminal device 101.
Theterminal device 101 may be various electronic devices having a display screen including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like.
Theserver 103 may be a server providing various services, such as a background management server (for example only) providing support for web conference requests initiated by users using theterminal devices 101. The background management server may analyze and otherwise process the received data such as the user request, and feed back a processing result (for example, information or data obtained or generated according to the user request) to the terminal device.
For example, any one of the plurality ofterminal apparatuses 101 initiates a web conference, transmits a web conference request to theserver 103, and transmits a request to invite the remainingterminal apparatuses 101 to join the web conference. Theserver 103 creates a conference and forwards a request to invite to join the network conference to the remainingterminal apparatuses 101. After the remainingterminal apparatuses 101 join the network conference, eachterminal apparatus 101 may transmit a local text, voice, or video message to the remaining terminal apparatuses 101 (remote terminals) through theserver 103 and receive a text, voice, or video message from the remote terminals through theserver 103.
It should be noted that the method for controlling the web conference provided by the embodiment of the present disclosure may be generally executed by theterminal device 101. Accordingly, the control device for the network conference provided by the embodiment of the present disclosure may be generally disposed in theterminal device 101.
It should be understood that the number of terminal devices, networks, and servers in FIG. 1A are merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 1B is an exemplary scene diagram of a control method and apparatus to which a web conference may be applied according to one embodiment of the present disclosure.
As shown in fig. 1B, an exemplary scenario according to this embodiment may includeterminal device 110,terminal device 110 may run a web conference program, and web conference program may exposeweb conference interface 111 when running. In the left half of theweb conference interface 111, a user who is speaking or a document being explained, etc. may be displayed. In the right half of thenetmeeting interface 111, a group currently in a netmeeting may be displayed, and the group may include, for example, user a, user B, and user C. In the right half of theweb conference interface 111, some controls that can control the web conference are also displayed, such as a control for controlling the video function to be turned on and off, a control for controlling the language forbidden function to be turned on and off, a control for uploading a file, a control for sending a message, and the like.
Illustratively, theterminal device 110 is a device held by the user a, and the user a can decide whether to let the remote users (user B and user C) hear his/her own voice by clicking a control for controlling the language-forbidden function during the participation in the network conference. For example, when the user a clicks a control for controlling the talk inhibition function to turn on the talk inhibition function while not speaking, the users B and C do not hear the sound of the user a. When the user a clicks the control for controlling the talk inhibiting function again to close the talk inhibiting function when speaking, the users B and C can hear the sound of the user a.
Fig. 2 is a flowchart of a control method of a web conference according to one embodiment of the present disclosure.
As shown in fig. 2, themethod 200 for controlling a web conference may include operations S210 to S230.
According to an embodiment of the present disclosure, operations S210 to S230 may be performed by the local electronic device during the network conference, and the users participating in the network conference may include a user of the local electronic device (simply referred to as a local user) and a user of the remote electronic device (simply referred to as a remote user).
In operation S210, audio data is acquired during the operation of the network conference program.
According to embodiments of the present disclosure, the audio data may be sound data in the environment in which the local electronic device is located. When the user of the local electronic device does not speak, the sound data of the environment in which the user of the local electronic device is located may include background sounds such as system noise and natural sounds, and when the user of the local electronic device does not speak, the sound data of the environment in which the user of the local electronic device is located may include background sounds and voice of the user. Audio data may be acquired using an audio sensor, such as a microphone of the local electronic device.
In operation S220, a voice instruction is recognized from audio data.
According to the embodiment of the disclosure, the voice instruction in the audio data can be recognized by using a voice recognition technology, the voice instruction in the audio data is matched with the preset voice instruction, and if the voice instruction in the audio data is consistent with the preset voice instruction, a corresponding instruction control function is executed.
According to an embodiment of the present disclosure, the preset voice instruction may be a pre-configured voice instruction for turning on or off the language prohibition function. For example, the voice instruction for turning off the talk inhibit function may be "i want to speak, turn off talk inhibit". The voice instruction for turning on the talk-inhibit function may be "i am inhibited to talk first and you continue". The preset voice command may be configured in other forms besides the sentence spoken by the user, such as a keyword, a number, a letter, etc. spoken by the user, for example, the voice command for turning off the language forbidden function may be configured as "off", "1", or "a". The voice instruction for turning on the talk inhibit function may be configured as "on", "2" or "b", and the like. The configuration of the preset voice command is not limited to the sound generated by the user, and may also be the sound generated by the user using other sound generation tools. The above are merely examples, and may be configured according to actual needs.
In operation S230, a talk inhibit function of the network conference program is controlled according to the recognized voice instruction.
According to the embodiment of the disclosure, if the voice instruction in the audio data is consistent with the preset voice instruction for closing the language-forbidden function, and the language-forbidden function of the current network conference is in an open state, the language-forbidden function of the network conference is closed. And if the voice instruction in the audio data is consistent with the preset voice instruction for starting the language-forbidden function and the language-forbidden function of the current network conference is in a closed state, starting the language-forbidden function of the network conference.
Illustratively, if the voice instruction in the audio data is recognized as "i want to speak, close the talk-inhibiting function", which is the same as the preset voice instruction for closing the talk-inhibiting function, and the talk-inhibiting function of the current network conference is in an on state, the talk-inhibiting function of the network conference is closed. After the banned function of the network conference is turned off, a prompt message for prompting that the banned function is turned off may be generated, and the prompt message may be in the form of a sound, such as a beep. The reminder information may also be in the form of a message, such as a notification message showing "banning closed". The prompt information can enable the user to quickly know that the speech-inhibiting function is closed, the local user can start speaking, and the speaking can be heard by the remote user.
If the voice instruction in the audio data is recognized as 'I forbid speaking first, and you continue', the voice instruction is the same as the voice instruction for starting the speech forbidding function, and the speech forbidding function of the current network conference is in a closed state, the speech forbidding function of the network conference is started. After the talk-back function of the network conference is turned on, a prompt message for prompting that the talk-back function is turned on may be generated, and the prompt message may be in the form of a sound, such as a click. The reminder information may also be in the form of a message, such as a notification message showing "no statement is on". The prompt message can enable the user to quickly know that the language-prohibiting function is started, and the words spoken by the local user cannot be heard by the remote user.
According to an embodiment of the present disclosure, audio data is acquired during operation of a web conference program; recognizing a voice instruction from the audio data; the language forbidden function of the network conference program is controlled according to the recognized voice instruction, the voice instruction can be used for controlling the opening and closing of the language forbidden function of the network conference program, the convenience of switching the states of the language forbidden function is improved, and the communication efficiency of the network conference is improved.
Fig. 3 is a flowchart of a control method of a web conference according to another embodiment of the present disclosure.
As shown in fig. 3, the method for controlling the web conference may include operations S310 to S360.
In operation S310, in the case where the talk inhibit function of the network conference program is turned on, audio data is acquired and a first voice instruction is recognized.
According to the embodiment of the disclosure, under the condition that the language forbidden function of the network conference program is started, the voice sensors such as the microphone of the local electronic equipment can continue to work and acquire audio data in real time. And recognizing the collected audio data in real time by utilizing a voice recognition technology, and recognizing a voice instruction as a first voice instruction.
In operation S320, it is determined whether the recognized first voice instruction is a first preset instruction, if so, operation S330 is performed, otherwise, operation S310 is returned to.
According to an embodiment of the present disclosure, the first preset instruction may be an instruction for turning off the talk-inhibiting function, for example, the first preset instruction is "i want to speak, turn off talk-inhibiting". And judging whether the first voice instruction is a first preset instruction, if so, executing operation S330, otherwise, returning to operation S310 to continue to acquire audio data in real time by using the microphone under the condition that the language-forbidden function is started, and performing voice recognition.
In operation S330, the talk-disable function is turned off, and a first prompt message for prompting that the talk-disable function has been turned off is generated.
According to the embodiment of the disclosure, under the condition that the first voice instruction is the first preset instruction, the language-forbidden function of the network conference is automatically closed, and the first prompt information for prompting that the language-forbidden function is closed is generated, so that a user can quickly know that the language-forbidden function is closed, a local user can start speaking, and the speech can be heard by a remote user.
In operation S340, in case that the talk inhibit function of the network conference program is turned off, new audio data is acquired and a second voice instruction is recognized.
According to the embodiment of the disclosure, under the condition that the language-forbidden function of the network conference program is closed, the voice sensor such as the microphone of the local electronic device acquires new audio data in real time, the collected new audio data is identified in real time by using the voice identification technology, and the second voice instruction is identified from the acquired new audio data.
In operation S350, it is determined whether the second voice command is a second preset command, if so, operation S360 is performed, otherwise, operation S340 is returned to.
According to an embodiment of the present disclosure, the second preset instruction may be an instruction for turning on a speech-forbidden function, for example, the second preset instruction is "i forbid speaking first, and you continue". And judging whether the second voice instruction is a second preset instruction, if so, executing operation S360, otherwise, returning to operation S340 to continue to acquire new audio data in real time by using the microphone under the condition that the language-forbidden function is closed, and performing voice recognition.
In operation S360, the talk inhibit function is turned on, and a second prompt message is generated to prompt that the talk inhibit function is turned on.
According to the embodiment of the disclosure, under the condition that the second voice instruction is the second preset instruction, the language-forbidden function of the network conference is automatically started, and second prompt information for prompting that the language-forbidden function is started is generated, so that a user can quickly know that the language-forbidden function is started, and the words spoken by a local user cannot be heard by a remote user. And returns to operation S310 to continue to acquire audio data in real time using the microphone and perform voice recognition with the language-inhibited function turned on.
Fig. 4 is a flowchart of a control method of a web conference according to another embodiment of the present disclosure.
As shown in fig. 4, the method for controlling the web conference may include operations S410 to S440.
In operation S410, audio data is acquired during the operation of the network conference program.
According to the embodiment of the disclosure, the audio data is audio data in the environment where the local electronic device is located, and the audio data can be acquired by using an audio sensor such as a microphone of the local electronic device.
In operation S420, it is determined whether the source of the audio data is human voice, and if so, operation S430 is performed, otherwise, operation S410 is returned.
According to the embodiment of the disclosure, the collected audio data is identified in real time, and whether the source of the collected audio data is human voice is judged, so that whether the user speaks can be quickly judged. Specifically, if it is human voice, it indicates that the user has spoken, and operation S430 is performed. If the voice is not human voice, it indicates that the user has not spoken, and returns to S410 to continue to use the microphone to acquire audio data in real time.
In operation S430, a voice instruction is recognized from audio data.
In operation S440, a talk inhibit function of the network conference program is controlled according to the recognized voice instruction.
According to the embodiment of the disclosure, if the voice instruction in the audio data is consistent with the preset voice instruction for closing the language-forbidden function, and the language-forbidden function of the current network conference is in an open state, the language-forbidden function of the network conference is closed. And if the voice instruction in the audio data is consistent with the preset voice instruction for starting the language-forbidden function and the language-forbidden function of the current network conference is in a closed state, starting the language-forbidden function of the network conference.
Fig. 5 is a flowchart of a method of identifying a source of audio data according to one embodiment of the present disclosure.
As shown in fig. 5, the method includes operations S5421 to S5422.
In operation S5421, spectral features of audio data are extracted from at least a portion of the audio data.
According to the embodiment of the disclosure, if the voice is generated by a user, the voice intensity is continuously provided with the audio data after being detected, the spectrum characteristics can be extracted from the continuous audio data, and whether the source of the continuous audio data is human voice can be identified according to the spectrum characteristics of the continuous audio data.
In operation S5422, a source of audio data is identified based on spectral features using a speech recognition model.
According to the embodiment of the disclosure, whether the source of the audio data is human voice can be recognized based on the spectral features of the audio data by using a voice recognition model, the voice recognition model can be obtained by training based on a neural network model, the training data can comprise the spectral features of the human voice, the spectral features of animal voice, the spectral features of natural sound and the like, the spectral features of the human voice are used as positive samples, the spectral features of the animal voice and the spectral features of the natural sound and the like are used as negative samples, and the neural network model is trained by using the positive samples and the negative samples to obtain the trained neural network model as the voice recognition model. The speech recognition model can recognize whether the source of the audio data is human speech for the spectral features of the input audio data.
Fig. 6 is a block diagram of a control device of a web conference according to one embodiment of the present disclosure.
As shown in fig. 6, thecontrol 600 of the web conference may include anacquisition module 601, afirst recognition module 602, and acontrol module 603.
The obtainingmodule 601 is used for obtaining audio data during the operation of the network conference program.
Thefirst recognition module 602 is configured to recognize a voice command from audio data.
Thecontrol module 603 is configured to control a talk-disable function of the netmeeting program according to the recognized voice instruction.
According to an embodiment of the present disclosure, thecontrol module 603 comprises a first control unit.
The first control unit is used for closing the language-forbidden function of the network conference under the condition that the language-forbidden function is in an opening state and the recognized voice instruction comprises a first instruction.
According to an embodiment of the present disclosure, thecontrol 600 of the web conference further includes a first generation module.
The first generation module is used for generating prompt information for prompting that the language-forbidden function is closed after the first control unit closes the language-forbidden function.
According to an embodiment of the present disclosure, thecontrol module 603 comprises a second control unit.
The second control unit is used for starting the language-forbidden function under the condition that the language-forbidden function of the network conference is in a closed state and the recognized voice instruction comprises a second instruction.
According to an embodiment of the present disclosure, thecontrol 600 of the web conference further includes a second generation module.
The second generation module is used for generating prompt information for prompting that the language-forbidden function is started after the second control unit starts the language-forbidden function.
According to an embodiment of the present disclosure, thecontrol 600 of the web conference further comprises a second identification module.
The second recognition module is used for recognizing the source of the audio data before the first recognition module recognizes the voice instruction from the audio data, wherein the first recognition module is executed under the condition that the source of the audio data is human voice.
According to an embodiment of the present disclosure, the second recognition module includes an extraction unit and a recognition unit.
The extraction unit is used for extracting the spectral characteristics of the audio data from at least one part of the audio data;
the recognition unit is configured to recognize a source of the audio data based on the spectral feature using a speech recognition model.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an exampleelectronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, thedevice 700 comprises acomputing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from astorage unit 708 into a Random Access Memory (RAM) 703. In theRAM 703, various programs and data required for the operation of thedevice 700 can also be stored. Thecomputing unit 701, theROM 702, and theRAM 703 are connected to each other by abus 704. An input/output (I/O)interface 705 is also connected tobus 704.
Various components in thedevice 700 are connected to the I/O interface 705, including: aninput unit 706 such as a keyboard, a mouse, or the like; anoutput unit 707 such as various types of displays, speakers, and the like; astorage unit 708 such as a magnetic disk, optical disk, or the like; and acommunication unit 709 such as a network card, modem, wireless communication transceiver, etc. Thecommunication unit 709 allows thedevice 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of thecomputing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. Thecalculation unit 701 executes the respective methods and processes described above, such as the control method of the network conference. For example, in some embodiments, the method of controlling a web conference may be implemented as a computer software program tangibly embodied in a machine-readable medium, such asstorage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed ontodevice 700 viaROM 702 and/orcommunications unit 709. When the computer program is loaded into theRAM 703 and executed by thecomputing unit 701, one or more steps of the control method of the network conference described above may be performed. Alternatively, in other embodiments, thecomputing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the control method of the web conference.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.