Disclosure of Invention
In view of this, embodiments of the present application provide an intelligent outbound method, a terminal device, and a readable storage medium, so as to solve the problem that the labor cost of the current service industry is high.
According to a first aspect, an embodiment of the present application provides an intelligent outbound method, including: acquiring an outbound voice; converting the outbound voice into a corresponding text; selecting corresponding reply content according to the text; and generating corresponding reply voice according to the reply content.
With reference to the first aspect, in some embodiments of the present application, the step of converting the outbound voice into corresponding text includes: determining the initial position and the end position of pronunciation according to the audio stream of the outbound voice; extracting target audio data according to the starting position and the ending position; and decoding the target audio data to generate a corresponding text.
With reference to the first aspect, in some embodiments of the present application, after the step of decoding the target audio data to generate a corresponding text, the step of converting the outbound voice into the corresponding text further includes: and performing text smoothness, punctuation prediction and text segmentation on the text.
With reference to the first aspect, in some embodiments of the present application, before the step of determining a start position and an end position of a pronunciation according to an audio stream of the outbound voice, the step of converting the outbound voice into a corresponding text further includes: and carrying out noise reduction and reverberation elimination processing on the outbound voice.
With reference to the first aspect, in some embodiments of the present application, the step of selecting a corresponding reply content according to the text includes: matching a corresponding intention scene for the text according to a preset grammar rule; and converting the text into corresponding structured data according to the intention scene, and determining reply content according to the structured data.
With reference to the first aspect, in some embodiments of the present application, the step of selecting a corresponding reply content according to the text includes: extracting text content according to a preset fixed slot position; and converting the text into corresponding structured data according to the text content of the fixed slot, and determining reply content according to the structured data.
With reference to the first aspect, in some embodiments of the present application, the step of generating a corresponding reply voice according to the reply content includes: converting the reply content into a corresponding rhyme sequence; and generating corresponding reply voice according to the rhyme sequence.
According to a second aspect, an embodiment of the present application provides a terminal device, including: the input unit is used for acquiring the outbound voice; the text conversion unit is used for converting the outbound voice into a corresponding text; the text understanding unit is used for selecting corresponding reply content according to the text; and the voice playing unit is used for generating corresponding reply voice according to the reply content.
According to a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect or any embodiment of the first aspect when executing the computer program.
According to a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and the computer program, when executed by a processor, implements the steps of the method according to the first aspect or any embodiment of the first aspect.
The intelligent outbound method, the terminal device and the readable storage medium provided by the embodiment of the application realize the intelligent selection of the reply content by the computer through the text conversion and the text semantic understanding of the outbound voice sent by the client. In order to be more suitable for telephone communication and provide telephone service for clients, after the reply content is selected, the reply content is further converted into voice, so that the telephone service for the clients can be completed instead of manual customer service, and the labor cost of service industries such as communication, finance and the like is favorably reduced.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
The embodiment of the application provides an intelligent outbound method, as shown in fig. 1, the method may include the following steps:
step S101: and acquiring the outbound voice.
When a client transacts communication or financial services such as price package change, credit card payment by stages and the like, the client often selects telephone transaction, namely telephone service transaction is performed by dialing the telephone of the client of an enterprise. The service handling requirement proposed by the client after the call is connected is the outbound voice mentioned in the embodiment of the application.
Step S102: the outbound voice is converted to corresponding text.
In a specific embodiment, as shown in fig. 2, the process of step S102 can be implemented by the following several sub-steps:
step S1021: based on the audio stream of the outbound speech, the start and end positions of the utterance are determined.
Step S1022: target audio data is extracted based on the start position and the end position.
Step S1023: and decoding the target audio data to generate a corresponding text.
In practical applications, in order to improve the recognition effect of the audio, the following sub-steps may be added before step S1021:
step S1020: and carrying out noise reduction and reverberation elimination processing on the outbound voice.
After the noise reduction and reverberation elimination processing is performed on the input voice, the endpoint detection can be performed on the input audio stream, the starting position and the ending position of the speaking are determined, and the identification processing is synchronously performed. In step S1023, in order to improve the user response speed and ensure the recognition effect, two-pass decoding may be adopted in the recognition process. Specifically, the target audio data can be decoded once by using acoustic fusion models such as DFCNN and BilSTM and an nGram language model; and then, carrying out two-time decoding processing on the data obtained by the one-time decoding by utilizing the domain language model and the RNN language model.
In order to improve the readability of the text, post-processing such as punctuation prediction, text smoothness, text segmentation and the like can be performed on the data obtained after the two-pass decoding, and finally the text is generated. As shown in fig. 2, step S1024 may be added after step S1023 to perform text smoothing, punctuation prediction and text segmentation on the text.
Step S103: and selecting corresponding reply content according to the text.
After obtaining the text corresponding to the outbound voice, semantic understanding of the text is required. Semantic understanding (NLP) refers to converting a natural language into computer-readable structured data, and can achieve the purpose of semantic understanding by matching corresponding intention scenes through grammar rules or capturing text contents of fixed slots.
Specifically, the process of step S103 may be implemented by the following several sub-steps:
step S1031: and matching the corresponding intention scene for the text according to a preset grammar rule.
Step S1032: and converting the text into corresponding structured data according to the intention scene, and determining reply content according to the structured data.
In a specific embodiment, the following substeps may also be used to implement the process of step S103, in place of step S1031 and step S1032:
step S1031': and extracting text content according to the preset fixed slot position.
Step S1032': and converting the text into corresponding structured data according to the text content of the fixed slot position, and determining reply content according to the structured data.
Step S104: and generating a corresponding reply voice according to the reply content.
In a specific embodiment, as shown in fig. 2, the process of step S104 can be implemented by the following several sub-steps:
step S1041: and converting the reply content into a corresponding rhyme sequence.
Step S1042: and generating corresponding reply voice according to the rhyme sequence.
In step S104, the text is converted into sound, and the sound corresponding to the anchor is synthesized according to the selected anchor, and the specific principle is as follows:
speech synthesis can be regarded as an artificial intelligence system. In order to synthesize a high-quality language, besides relying on various rules, including semantic rules, lexical rules, and phonetic rules, the content of the text must be well understood, which also relates to the understanding of natural language. The text-to-speech conversion process is to convert the text sequence into a rhyme sequence and then generate a speech waveform according to the rhyme sequence. Wherein, step S1041 involves linguistic processing, such as word segmentation, word-to-speech conversion, etc., and a set of valid prosody control rules; step S1042 requires advanced speech synthesis technology, and can synthesize a high-quality speech stream in real time as required. Generally speaking, a text-to-speech conversion system requires a complex conversion procedure from text sequences to phoneme sequences, that is, the text-to-speech conversion system not only needs digital signal processing technology, but also needs to be supported by a great deal of linguistic knowledge.
According to the intelligent outbound method provided by the embodiment of the application, the text conversion and the text semantic understanding are carried out on the outbound voice sent by the client, so that the reply content can be intelligently selected by the computer. In order to be more suitable for telephone communication and provide telephone service for clients, after the reply content is selected, the reply content is further converted into voice, so that the telephone service for the clients can be completed instead of manual customer service, and the labor cost of service industries such as communication, finance and the like is favorably reduced.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
An embodiment of the present application further provides a terminal device, as shown in fig. 3, where the terminal device may include: aninput unit 301, a text conversion unit 302, atext understanding unit 303, and a voice playing unit 304.
Specifically, theinput unit 301 is configured to obtain an outbound voice; the corresponding implementation process can be referred to the record of step S101 in the above method embodiment.
The text conversion unit 302 is configured to convert the outbound voice into a corresponding text; the corresponding implementation process can be referred to the description of step S102 in the above method embodiment.
Thetext understanding unit 303 is configured to select corresponding reply content according to the text; the corresponding implementation process can be referred to the description of step S103 in the above method embodiment.
The voice playing unit 304 is configured to generate a corresponding reply voice according to the reply content; the corresponding implementation process can be referred to the record of step S104 in the above method embodiment.
Fig. 4 is a schematic diagram of another terminal device provided in an embodiment of the present application. As shown in fig. 4, theterminal device 400 of this embodiment includes: aprocessor 401, amemory 402 and acomputer program 403, such as an intelligent call-out program, stored in saidmemory 402 and executable on saidprocessor 401. Theprocessor 401, when executing thecomputer program 403, implements the steps in the above-described various embodiments of the intelligent outbound method, such as the steps S101 to S104 shown in fig. 1. Alternatively, theprocessor 401, when executing thecomputer program 403, implements the functions of the modules/units in the device embodiments described above, such as the functions of theinput unit 301, the text conversion unit 302, thetext understanding unit 303, and the speech playing unit 304 shown in fig. 3.
Thecomputer program 403 may be partitioned into one or more modules/units that are stored in thememory 402 and executed by theprocessor 401 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of thecomputer program 403 in theterminal device 400. For example, thecomputer program 403 may be partitioned into a synchronization module, a summarization module, an acquisition module, a return module (a module in a virtual device).
Theterminal device 400 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, aprocessor 401, amemory 402. Those skilled in the art will appreciate that fig. 4 is merely an example of aterminal device 400 and does not constitute a limitation ofterminal device 400 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.
TheProcessor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Thestorage 402 may be an internal storage unit of theterminal device 400, such as a hard disk or a memory of theterminal device 400. Thememory 402 may also be an external storage device of theterminal device 400, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on theterminal device 400. Further, thememory 402 may also include both an internal storage unit and an external storage device of theterminal device 400. Thememory 402 is used for storing the computer programs and other programs and data required by the terminal device. Thememory 402 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.