Intelligent outbound method, device, computer equipment and storage mediumTechnical Field
The invention relates to the technical field of artificial intelligence, in particular to an intelligent outbound method, an intelligent outbound device, computer equipment and a storage medium.
Background
The traditional customer service mode needs a large number of manual agents to complete outgoing calls to customers, so that the labor cost is increased, the number of the manual agents is limited, and when the traffic is large enough, tasks cannot be completed timely and effectively. For example, with the development of economy and the rise of internet finance, cash credit and consumer credit services appear in large quantities, and the problem that how to collect the money which is credited out becomes an important part of the collection service, and collection service cannot be timely carried out through limited manual seats.
To this end, an automatic outbound system is introduced. The existing automatic outbound system sets fixed outbound time by a worker, when the outbound time is reached, preset voice is broadcasted to a user through a telephone, the broadcasted content is hard and single, the answering intention of a calling object cannot be accurately identified, a corresponding speech operation is provided according to the answering intention, and the calling purpose is difficult to achieve.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention provides an intelligent outbound method, apparatus, computer device and storage medium, so as to solve the problem that it is difficult to achieve the purpose of calling because the answering intention of the calling party cannot be accurately identified and a corresponding dialog is provided according to the answering intention.
In order to achieve the above object, the present invention provides an intelligent outbound method, comprising:
a contact phone calling the target object;
when the contact call is connected, circularly executing the following steps:
acquiring the dialect of the current round;
converting the current round of speech technology into voice data to be played;
playing the voice data to be played to the target object;
judging whether the current round of speech operation is a preset bye-bye operation or not, if so, ending the process, and if not, receiving response speech data of the target object to the speech data to be broadcasted;
converting the response voice data into response text data;
identifying a target reply intention of the target object according to the reply text data;
and matching the dialect corresponding to the target reply intention from a preset dialect library to serve as the next round of dialect.
In a preferred embodiment of the present invention, the identifying the target reply intention of the target object according to the reply text data includes:
processing the response text data by using a pre-trained intention recognition model to obtain a plurality of candidate intentions corresponding to the response text data and initial semantic similarity corresponding to each candidate intention;
multiplying the initial semantic similarity corresponding to each candidate intention by a preset weight corresponding to each candidate intention to obtain a final semantic similarity corresponding to each candidate intention;
and determining the candidate intention with the highest final semantic similarity as a target reply intention.
In a preferred embodiment of the present invention, before the identifying the target reply intention of the target object according to the reply text data, the method further includes: and performing error correction processing on the response text data.
In a preferred embodiment of the present invention, the error-introducing process on the response text data includes:
determining words to be corrected and replacement words corresponding to the words to be corrected in the response text data according to a preset standard word bank;
and replacing the word to be corrected in the response text data with the corresponding replacement word.
In a preferred embodiment of the present invention, after the receiving the response voice data of the target object to the voice data to be broadcasted, the method further includes: and extracting audio characteristic vectors in the response voice data, and acquiring a target emotion classification result corresponding to the response voice data according to the audio characteristic vectors.
In a preferred embodiment of the present invention, the obtaining a target emotion classification result corresponding to the response speech data according to the audio feature vector includes:
matching the audio feature vector with a plurality of emotion feature models obtained by pre-training, wherein the emotion feature models respectively correspond to one of a plurality of emotion classification results;
and taking the emotion classification result corresponding to the matched emotion characteristic model as the target emotion classification result.
In a preferred embodiment of the present invention, the contact phone is stored in a blockchain, and after calling the contact phone of the target object, the method further comprises:
and recording the response state of the contact telephone, and filtering according to the response state.
In order to achieve the above object, the present invention further provides an intelligent outbound device, including:
the calling module is used for calling a contact telephone of the target object;
the phone operation acquisition module is used for acquiring the phone operations of the current round;
the speech conversion module is used for converting the current round of speech into to-be-played voice data;
the voice playing module is used for playing the voice data to be played to the target object;
the conversation judging module is used for judging whether the current conversation is a preset bye conversation or not;
the response receiving module is used for receiving response voice data of the target object to the voice data to be broadcasted;
the voice conversion module is used for converting the response voice data into response text data;
the intention identification module is used for identifying a target reply intention of the target object according to the response text data;
and the dialect matching module is used for matching the dialect corresponding to the target reply intention from a preset dialect library as the next round of dialect.
In a preferred embodiment of the present invention, the intention identifying module includes:
the initial semantic similarity recognition unit is used for processing the response text data by utilizing a pre-trained intention recognition model to obtain a plurality of candidate intentions corresponding to the response text data and initial semantic similarity corresponding to each candidate intention;
the final semantic similarity recognition unit is used for multiplying the initial semantic similarity corresponding to each candidate intention by a preset weight corresponding to each candidate intention to obtain the final semantic similarity corresponding to each candidate intention;
and the target intention determining unit is used for determining the candidate intention with the highest final semantic similarity as the target reply intention.
In a preferred embodiment of the present invention, the apparatus further includes an error correction module, configured to perform error correction processing on the response text data.
In a preferred embodiment of the present invention, the error correction module includes:
the wrong word determining unit is used for determining words to be corrected and replacement words corresponding to the words to be corrected in the response text data according to a preset standard word bank;
and the wrong word replacing unit is used for replacing the words to be corrected in the response text data with the corresponding replacing words.
In a preferred embodiment of the present invention, the apparatus further comprises:
the audio characteristic extraction module is used for extracting audio characteristic vectors in the response voice data;
and the emotion recognition module is used for acquiring a target emotion classification result corresponding to the response voice data according to the audio characteristic vector.
In a preferred embodiment of the present invention, the emotion recognition module includes:
the matching unit is used for matching the audio characteristic vector with a plurality of emotion characteristic models obtained by pre-training, wherein the emotion characteristic models respectively correspond to one of a plurality of emotion classification results;
and the emotion determining unit is used for taking the emotion classification result corresponding to the matched emotion characteristic model as the target emotion classification result.
In a preferred embodiment of the present invention, the contact phone is stored in a blockchain, and the apparatus further comprises:
and the filtering module is used for recording the response state of the contact telephone and filtering according to the response state.
In order to achieve the above object, the present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the aforementioned method when executing the computer program.
In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned method.
Through the technical scheme, the invention has the following beneficial effects:
according to the method and the device, after the voice data to be broadcasted is broadcasted to the target object, the response voice data of the target object to the voice data to be broadcasted is received, the response voice data is converted into the response text data, the target response intention of the target object is identified according to the response text data, and finally the dialect corresponding to the target response intention is matched as the next round of dialect from the preset dialect library, so that the response intention of the calling object can be accurately identified, the corresponding dialect is provided according to the response intention, and the calling effectiveness is improved.
Drawings
FIG. 1 is a flow chart of one embodiment of an intelligent outbound method of the present invention;
FIG. 2 is a block diagram of an embodiment of an intelligent outbound device of the present invention;
fig. 3 is a hardware architecture diagram of one embodiment of the computer apparatus of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
Example one
The embodiment provides an intelligent outbound method, as shown in fig. 1, which specifically includes the following steps:
and S1, calling the contact telephone of the target object.
Specifically, the contact phone of the target object may be obtained from a preset call list and a call is made. For example, assume that the call destination is for call collection, and corresponding dialing frequency and dialing number are configured in advance for collection objects of different levels (e.g. commonly used M0-M4 +). And when the corresponding time is reached, acquiring the contact telephone of the target object from the preset receiving list and calling. Generally, the response status of a contact call may be on, off, down, empty, on-call, on-hook, off-server, network busy, incoming call restriction, etc. Preferably, if the response state is a halt state or a null number, the contact phone of the target object can be deleted from the call list so as to filter the invalid contact phone, improve the quality of the contact phone, improve the success rate of normal call of the contact phone and save the call waiting time.
It is emphasized that to further ensure the privacy and security of the contact phone, the contact phone may also be stored in a node of a blockchain, and the corresponding contact phone is extracted from the blockchain in advance before the call. The blockchain referred to in this embodiment is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
And S2, acquiring the dialect of the current round. When the step is executed for the first time, the current round of talk is the preset open-field talk; when the step is not executed for the first time, the current round of dialogs is the subsequent dialogs obtained according to the reply intention of the target object.
And S3, converting the dialogs of the current round into corresponding voice data to be played. In this embodiment, a conventional TTS (Text To Speech, from Text To Speech) technology is used To convert the current round of Speech, and corresponding Speech data can be obtained.
And S4, playing the voice data to be played to the target object through the contact telephone.
And S5, judging whether the current turn of the dialect is a preset bye-bye dialect, if so, indicating that the call is finished, ending the process, otherwise, executing the step S6.
S6, receiving the response voice data of the target object to the voice data to be broadcasted.
And S7, converting the received response voice data into corresponding response text data. In this embodiment, a conventional ASR (Automatic Speech Recognition) technology may be adopted to convert the current round of Speech to obtain corresponding response text data.
And S8, identifying the target reply intention of the target object according to the obtained response text data. The method comprises the following steps:
and S81, processing the response text data by using a pre-trained intention recognition model to obtain the semantic similarity between each response intention in the dialogistic library and the response text data, and taking the preset number of response intentions with the highest semantic similarity with the response text data as candidate response intentions. For example, suppose that the answer text data is "how much money i owe, so that hands are very tight recently", the number of candidate reply intentions is 3, and the candidate reply intentions are "agreement payment", "fund difficulty" and "inquiry owing information", and the corresponding semantic similarity of the three is 0.3, 0.7 and 0.78 respectively.
And S82, multiplying the semantic similarity corresponding to each candidate intention by the preset weight corresponding to each candidate intention to obtain the weighted semantic similarity corresponding to each candidate intention, and taking the candidate intention with the highest weighted semantic similarity as the target reply intention. In the present embodiment, a plurality of reply intents are divided into a main flow intention and a subsidiary flow intention in advance, and respective weights are set for the main flow intention and the subsidiary flow intention, respectively. For priority to guarantee the effect, the weight of the main process intention should be greater than the weight of the auxiliary process intention, for example, the weight of the main process intention may be preset to 0.6, and the weight of the auxiliary process intention may be preset to 0.4. When the aforementioned "agreement and payment" and "difficulty in funding" are divided into the main process intention in advance, the "question owing information" is divided into the auxiliary process intention in advance, the weighted semantic similarity of "agreement and payment" is 0.3 × 0.6 — 0.18, the weighted semantic similarity of "difficulty in funding" is 0.7 × 0.6 — 0.42, and the weighted semantic similarity of "question owing information" is 0.78 × 0.4 — 0.312. It can be seen that the aforementioned answer text data "how much money did i owe? Recently, the weighted semantic similarity of the "hard to fund" reply intent is highest with a great hand, and therefore, the "hard to fund" reply intent is taken as the target reply intent of the reply text data.
In the present embodiment, the intention extraction model is preferably a neural network model. Before training the intention extraction model, training samples need to be collected in advance, and the training samples comprise a plurality of training texts and intention information corresponding to the training texts. During training, each training text is input into the intention extraction model to predict intention information of the training text through the intention extraction model, then the predicted intention information is compared with the collected corresponding intention information to judge whether the predicted intention information and the collected corresponding intention information are consistent, and if not, parameters of the intention extraction model are adjusted to enable the predicted intention information and the collected corresponding intention information to be consistent. And according to the training mode, training the intention extraction model by adopting all the training texts until the predicted intention information is consistent with the acquired intention information, and determining the parameters of the intention extraction model after the training is finished so as to determine the intention extraction model.
S9, matching the corresponding dialect of the target reply intention from the preset dialect library to be used as the next round of dialect, and then returning to the step S2.
In this embodiment, a phone library is configured in advance, and the phone library is provided with an open phone and phones corresponding to different reply intentions, for example, the open phone is configured to inquire whether the user is himself or herself; for the reply intention determined to be the principal, the corresponding operation is set to inform overdue information; for the reply intention of approving the payment, the corresponding dialect is set as the bye-bye dialect corresponding to the approving payment; for the reply intention that the repayment cannot be operated, the corresponding dialect is set as a mode for informing the repayment; for the recovery intention of the capital difficulty, the corresponding dialect is set as the pressure application dialect for looking for the turnover of the person; for the reply intention of inquiring the arrearage information, the corresponding dialect is set to inform the arrearage information and the like. It should be noted that the number, type and content of the dialogs of the present invention are not limited at all. It should be understood that, for example, assuming that the target reply is intended to be "difficult to fund", the corresponding utterance of "difficult to fund", i.e., the pressure-applying utterance of the person looking over, is matched from the utterance library, and then returns to step S2 for the next round of conversation with the target object.
Through the steps, the reply intention of each response of the target object can be accurately identified, and the next round of dialect corresponding to the target reply intention is obtained from the dialect library in a matching mode, so that the calling effect is improved.
In this embodiment, the intelligent outbound method may further include the following steps: before executing step S8, performing error correction processing on the response text data, where the specific process of error correction processing is as follows: determining words to be corrected and replacement words corresponding to the words to be corrected in the response text data according to a preset standard word bank; and replacing the word to be corrected in the response text data with the corresponding replacement word. Specifically, a plurality of common standard words, such as loan-like words, are preset in the standard word bank: oxygen credits, cash credits, and the like. And after the response text data is obtained through ASR recognition, determining whether the response text data has a homophone corresponding to a certain common standard word in a standard word bank, if so, taking the homophone as a word to be corrected, and taking the common standard word corresponding to the homophone as a replacement word corresponding to the word to be corrected. For example, assuming that the standard word "oxygen credit" exists in the standard word bank and the "oxygen pocket" exists in the response text data recognized by the ASR, the "oxygen pocket" in the response text data is replaced with the "oxygen credit".
In this embodiment, the intelligent outbound method may further include the following steps: after step S6 is executed, extracting an audio feature vector in the response voice data, and obtaining a target emotion classification result corresponding to the response voice data according to the audio feature vector. The specific process is as follows: firstly, matching the audio characteristic vector with a plurality of emotion characteristic models obtained by pre-training, wherein the emotion characteristic models respectively correspond to one of a plurality of preset emotion classification results; and then, taking the emotion classification result corresponding to the matched emotion characteristic model as the target emotion classification result. The obtained target emotion classification result may include: the method comprises the following steps of agreeing, distracting, appreciating, praising, hurting, fear, anger, complaining and the like, wherein the agreeing, the appreciating, the praising and the praising are positive emotions, the hurting, the fear, the anger and the complaining are negative emotions, and when the obtained target emotion classification result is a certain negative emotion, a call can be transferred to a manual seat, so that the call quality and the call effect are guaranteed to the greatest extent.
In this step, the aforementioned audio feature vector includes one or more of the following audio features: energy features, frame number of utterance features, pitch frequency features, formant features, harmonic-to-noise ratio features, and mel-frequency cepstral coefficient features. It should be understood that which audio feature vectors are extracted specifically may be determined according to the requirements of the actual scene, and the present invention does not limit the type, number and vector direction of the audio features corresponding to the extracted audio feature vectors. In an embodiment of the present invention, in order to obtain an optimal emotion recognition effect, the above six audio features may be extracted simultaneously.
In this embodiment, the emotion feature model may be established by pre-learning audio feature vectors of a plurality of preset voice data including emotion classification tags corresponding to a plurality of emotion classifications, which is equivalent to establishing a correspondence between the emotion feature model and the emotion classifications, and each emotion feature model may correspond to one emotion classification. The pre-learning process of establishing the emotional characteristic model may include: firstly, clustering respective audio characteristic vectors of a plurality of preset voice data comprising emotion classification labels corresponding to a plurality of emotion classifications to obtain clustering results of the preset emotion classifications; and then, according to the clustering result, training the audio characteristic vector of the preset voice data in each cluster into an emotion characteristic model. Based on the emotion feature models, the emotion feature models corresponding to the current voice data can be obtained through a matching process based on the audio feature vectors, and then corresponding emotion classifications are obtained.
In the present embodiment, these emotional feature models may be Gaussian Mixture Models (GMMs). Therefore, the emotion characteristic vectors of the voice samples classified by the same emotion can be clustered by adopting a K-means algorithm, and the initial values of the parameters of the Gaussian mixture model are calculated according to the clustering result. And then training a Gaussian mixture model corresponding to each emotion classification by adopting an E-M algorithm. When the mixed Gaussian models are used for carrying out the matching process of emotion classification, the likelihood probabilities between the audio feature vectors of the current voice data and the plurality of emotion feature models can be calculated, and then the matched emotion feature models are determined by measuring the likelihood probabilities, for example, the emotion feature model with the likelihood probability larger than a preset threshold and the maximum emotion feature model is used as the matched emotion feature model. It should be understood that although it is stated in the above description that the emotional feature model may be a gaussian mixture model, the emotional feature model may also be implemented in other forms, such as a Support Vector Machine (SVM) model, a K nearest neighbor classification algorithm (KNN) model, a markov model (HMM), A Neural Network (ANN) model, and the like. The specific implementation form of the emotional characteristic model is not strictly limited.
It should be noted that, for the sake of simplicity, the present embodiment is described as a series of acts, but those skilled in the art should understand that the present invention is not limited by the described order of acts, because some steps can be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Example two
The present embodiment provides an intelligent outbound device, as shown in fig. 2, the intelligent outbound device 10 includes:
and the calling module 11 is used for calling the contact telephone of the target object. It is emphasized that to further ensure the privacy and security of the contact phone, the contact phone may also be stored in a node of a blockchain, and the corresponding contact phone is extracted from the blockchain in advance before the call.
And a speaking operation obtaining module 12 for obtaining the speaking operation of the current round.
And a dialect conversion module 13, configured to convert the current round of dialect into to-be-played voice data.
And the voice playing module 14 is configured to play the voice data to be played to the target object.
And the speech operation judging module 15 is configured to judge whether the speech operation of the current round is a preset bye-speech operation.
And the response receiving module 16 is configured to receive response voice data of the target object to the voice data to be played.
And the voice conversion module 17 is configured to convert the response voice data into response text data.
And the intention identification module 18 is used for identifying the target reply intention of the target object according to the response text data.
And the dialect matching module 19 is configured to match a dialect corresponding to the target reply intention from a preset dialect library as a next round of dialect.
In this embodiment, the intention identifying module 18 includes:
the initial semantic similarity recognition unit is used for processing the response text data by utilizing a pre-trained intention recognition model to obtain a plurality of candidate intentions corresponding to the response text data and initial semantic similarity corresponding to each candidate intention;
the final semantic similarity recognition unit is used for multiplying the initial semantic similarity corresponding to each candidate intention by a preset weight corresponding to each candidate intention to obtain the final semantic similarity corresponding to each candidate intention;
and the target intention determining unit is used for determining the candidate intention with the highest final semantic similarity as the target reply intention.
In this embodiment, the apparatus further includes an error correction module, configured to perform error correction processing on the response text data.
In this embodiment, the error correction module includes:
the wrong word determining unit is used for determining words to be corrected and replacement words corresponding to the words to be corrected in the response text data according to a preset standard word bank;
and the wrong word replacing unit is used for replacing the words to be corrected in the response text data with the corresponding replacing words.
In this embodiment, the apparatus further includes:
the audio characteristic extraction module is used for extracting audio characteristic vectors in the response voice data;
and the emotion recognition module is used for acquiring a target emotion classification result corresponding to the response voice data according to the audio characteristic vector.
In this embodiment, the emotion recognition module includes:
the matching unit is used for matching the audio characteristic vector with a plurality of emotion characteristic models obtained by pre-training, wherein the emotion characteristic models respectively correspond to one of a plurality of emotion classification results;
and the emotion determining unit is used for taking the emotion classification result corresponding to the matched emotion characteristic model as the target emotion classification result.
In this embodiment, the apparatus further includes:
and the filtering module is used for recording the response state of the contact telephone and filtering according to the response state.
For the embodiment of the apparatus, it is basically similar to the method embodiment of the first embodiment, so the description herein is relatively simple, and for the relevant points, refer to the partial description of the method embodiment. Also, it should be understood by those skilled in the art that the embodiments described in the specification are preferred embodiments and the reference to the module is not necessarily essential to the invention.
EXAMPLE III
The present embodiment provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster composed of multiple servers) capable of executing programs. The computer device 20 of the present embodiment includes at least, but is not limited to: a memory 21, a processor 22, which may be communicatively coupled to each other via a system bus, as shown in FIG. 3. It is noted that fig. 3 only shows the computer device 20 with components 21-22, but it is to be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.
In the present embodiment, the memory 21 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 20. Of course, the memory 21 may also include both internal and external storage devices of the computer device 20. In this embodiment, the memory 21 is generally used for storing an operating system and various application software installed in the computer device 20, such as the program codes of the intelligent outbound device 10 of the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is configured to execute the program codes stored in the memory 21 or process data, for example, execute the intelligent outbound device 10, so as to implement the intelligent outbound method according to the first embodiment.
Example four
The present embodiment provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer readable storage medium of the embodiment is used for storing the intelligent outbound device 10, and when being executed by a processor, the intelligent outbound method of the first embodiment is implemented.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.