Movatterモバイル変換


[0]ホーム

URL:


CN119577459B - Intelligent customer service training method and device for multi-mode large model and storage medium - Google Patents

Intelligent customer service training method and device for multi-mode large model and storage medium
Download PDF

Info

Publication number
CN119577459B
CN119577459BCN202510095877.1ACN202510095877ACN119577459BCN 119577459 BCN119577459 BCN 119577459BCN 202510095877 ACN202510095877 ACN 202510095877ACN 119577459 BCN119577459 BCN 119577459B
Authority
CN
China
Prior art keywords
data
model
target
text
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510095877.1A
Other languages
Chinese (zh)
Other versions
CN119577459A (en
Inventor
陈振杰
徐雷
罗韵
邓富城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jijian Technology Co ltd
Original Assignee
Shandong Jijian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jijian Technology Co ltdfiledCriticalShandong Jijian Technology Co ltd
Priority to CN202510095877.1ApriorityCriticalpatent/CN119577459B/en
Publication of CN119577459ApublicationCriticalpatent/CN119577459A/en
Application grantedgrantedCritical
Publication of CN119577459BpublicationCriticalpatent/CN119577459B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请公开了一种多模态大模型的智能客服训练方法、装置及存储介质,涉及人工智能技术领域。本申请方法包括:收集多模态数据;对多模态数据进行数据查重;对查重数据进行预处理;对预处理数据进行数据增强;基于Transformer的架构构建初始化多模态大模型;根据增强型数据训练初始化多模态大模型,获取目标多模态大模型;获取用户询问信息;提取用户询问信息中的权重文字;将权重文字输入目标多模态大模型,获取关联信息;将关联信息进行预处理,生成特征向量;根据相似度算法获取特征向量与权重文字之间的相似度值;将相似度值按照顺序进行排列,获取排序结果;提取排序结果;基于目标相似值确定目标关联信息;将目标关联信息进行展示。

The present application discloses a method, device and storage medium for intelligent customer service training of a multimodal large model, and relates to the field of artificial intelligence technology. The present application method includes: collecting multimodal data; checking for duplicate data on multimodal data; preprocessing the checked duplicate data; performing data enhancement on the preprocessed data; initializing a multimodal large model based on the Transformer architecture; initializing a multimodal large model based on enhanced data training to obtain a target multimodal large model; obtaining user query information; extracting weight text in the user query information; inputting the weight text into the target multimodal large model to obtain associated information; preprocessing the associated information to generate a feature vector; obtaining a similarity value between the feature vector and the weight text according to a similarity algorithm; arranging the similarity values in order to obtain a sorting result; extracting the sorting result; determining target associated information based on the target similarity value; and displaying the target associated information.

Description

Intelligent customer service training method and device for multi-mode large model and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to an intelligent customer service training method, device and storage medium for a multi-mode large model.
Background
With the rapid development of internet technology and the increasing proliferation of electronic commerce platforms, consumers often have difficulty in quickly finding products meeting the demands of the consumers when facing massive product information. Conventional product recommendation systems mostly rely on methods such as rule-based recommendation, collaborative filtering recommendation or content-based recommendation, which, although improving the accuracy of recommendation to some extent, still have many limitations. For example, a recommendation system based on rules needs to manually set rules and is difficult to adapt to diversified user demands, a collaborative filtering recommendation system relies on historical behavior data of users, a cold start problem exists for new users or new products, and a recommendation system based on content mainly depends on matching degree of product description and user input, and potential demands and preferences of users are ignored.
The deep learning model is capable of automatically learning a characteristic representation of data and capturing complex relationships between the data through a hierarchical structure. Therefore, the deep learning technology is applied to the product recommendation system, and accurate understanding, efficient retrieval and personalized recommendation of user input information can be achieved. However, how to effectively combine deep learning techniques with product recommendation systems and how to optimize the similarity calculation process remains a hotspot and difficulty problem for current research.
Disclosure of Invention
In order to solve the technical problems, the application provides an intelligent customer service training method, device and storage medium for a multi-mode large model.
The following describes the technical scheme provided in the present application:
the first aspect of the application provides an intelligent customer service training method of a multi-mode large model, which comprises the following steps:
collecting multi-modal data, wherein the multi-modal data comprises text, voice and words;
performing data duplicate checking on the multi-mode data to obtain duplicate checking data;
Preprocessing the duplicate checking data to obtain preprocessed data;
Performing data enhancement on the preprocessed data through an enhancement algorithm, and acquiring enhancement data, wherein the data enhancement processing comprises the steps of performing synonym replacement, random insertion and text deletion on the preprocessed data;
constructing an initialized multi-mode large model based on a architecture of a transducer;
training the initialized multi-modal large model according to the enhanced data to obtain a target multi-modal large model;
Acquiring user inquiry information;
extracting weight characters in the user inquiry information;
inputting the weight text into the target multi-mode big model so that the target multi-mode big model obtains associated information according to the weight text, wherein the associated information comprises text, video, voice and image information corresponding to the weight text;
Preprocessing the associated information to generate a feature vector;
Obtaining a similarity value between the feature vector and the weight text according to a similarity algorithm;
the similarity values are arranged in sequence to obtain a sequencing result;
extracting a first target similarity value in the sorting result;
determining target association information based on the target similarity value;
and displaying the target associated information.
Optionally, training the initialized multi-modal large model according to the enhanced data to obtain a target multi-modal large model includes:
acquiring a plurality of enhancement feature vectors of the enhancement data;
Inputting the enhanced feature vector into the initialized multi-modal large model to obtain a multi-modal feature vector;
obtaining a prediction result according to the multi-mode feature vector;
calculating the loss between the prediction result and the actual label by using a cross entropy loss function, and obtaining a loss value;
acquiring updated parameters of the initialized multi-mode large model through a back propagation algorithm according to the loss value;
And updating the initialized multi-modal large model through the updating parameters to obtain the target multi-modal large model.
Optionally, calculating a loss between the prediction result and the actual tag using a cross entropy loss function, and obtaining a loss value includes:
calculating the loss between the prediction result and the actual label by using a cross entropy loss function, and obtaining a loss value;
The cross entropy loss function is:
where N is the number of categories, yi is the one-hot encoding of the actual label, pi is the probability of model prediction belonging to the ith category.
Optionally, performing data duplication checking on the multi-mode data to obtain duplication checking data, including:
Cleaning the multi-mode data to remove invalid, incomplete or wrong data;
Classifying the cleaned multi-modal data, wherein the multi-modal data comprises text data, image data, audio data and video data;
performing duplicate checking on the text data by adopting a first processing mode to acquire first duplicate checking data;
Performing duplicate checking on the image data by adopting a second processing mode to acquire second duplicate checking data;
Performing duplicate checking on the audio data by adopting a third processing mode to obtain third duplicate checking data;
Performing duplicate checking on the video data by adopting a fourth processing mode to obtain fourth duplicate checking data;
and merging the first check-up data, the second check-up data, the third check-up data and the fourth check-up data to generate a check-up data set.
Optionally, extracting the weight text in the user query information includes:
extracting vocabulary in the user inquiry information as nodes;
constructing an undirected graph according to the co-occurrence relation of the vocabulary in the user inquiry information;
Calculating a weight value of each vocabulary according to the co-occurrence times of the vocabulary in the undirected graph;
And extracting weight words in the user inquiry information according to the weight values.
Optionally, a softmax classification algorithm is configured in the multimodal big model, the softmax classification algorithm being defined as:
wherein c represents the category of the classification,Representing the probability that sample i belongs to category c,Representing the exponential summation of the scores for all categories.
Optionally, the similarity algorithm is a Tanimoto similarity algorithm, and the Tanimoto similarity algorithm is defined as:
Wherein X and Y are two sets, X n Y is the intersection of X and Y,AndThe number of elements is X and Y, respectively.
The second aspect of the present application provides an intelligent customer service training device for a multimodal large model, the device comprising:
The collecting unit is used for collecting multi-modal data, wherein the multi-modal data comprises texts, voices and words;
The duplicate checking unit is used for performing data duplicate checking on the multi-mode data to obtain duplicate checking data;
The preprocessing unit is used for preprocessing the weight checking data to obtain preprocessed data;
The construction unit is used for constructing an initialized multi-mode large model based on a architecture of a transducer;
the acquisition unit is used for acquiring the inquiry information of the user;
the first extraction unit is used for extracting weight characters in the user inquiry information;
the input unit is used for inputting the weight characters into the target multi-mode big model so that the target multi-mode big model obtains associated information according to the weight characters, and the associated information comprises characters, videos, voices and image information corresponding to the weight characters;
the preprocessing unit is used for preprocessing the associated information to generate a feature vector;
the similarity calculation unit is used for obtaining a similarity value between the feature vector and the weight text according to a similarity algorithm;
The sorting unit is used for sequentially arranging the similarity values to obtain a sorting result;
The second extraction unit is used for extracting the first target similarity value in the sorting result;
a determining unit configured to determine target association information based on the target similarity value;
And the output unit is used for displaying the target associated information.
The third aspect of the present application provides an intelligent customer service training device for a multimodal large model, the device comprising:
a processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
The memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.
A fourth aspect of the application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
1. the deep learning model is used for accurately understanding the input information of the user, and the historical behavior data and preference information of the user are combined to realize personalized recommendation, so that a product list which meets the requirements of the user is provided for the user.
2. And (3) performing similarity calculation by adopting an optimized deep learning algorithm and a cross entropy loss function, and continuously optimizing model parameters through iterative training, so that accuracy and efficiency of similarity calculation are improved, and accuracy of recommendation is ensured.
3. By integrating multidimensional information such as user historical behavior data and social media data, a user portrait is constructed, and even if the user historical behavior data is less, more accurate recommendation can be provided for the user.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an embodiment of a multi-modal large-model intelligent customer service training method provided by the application;
FIG. 2 is a flow chart of one embodiment of the method for obtaining a target multi-modal large model in the present application;
FIG. 3 is a flow chart of an embodiment of performing data duplication checking on multi-modal data and obtaining duplication checking data in the present application;
FIG. 4 is a flowchart of an embodiment of extracting weight words from user query information according to the present application;
FIG. 5 is a schematic structural diagram of an embodiment of a multi-modal large-model intelligent customer service training apparatus provided by the present application;
fig. 6 is a schematic structural diagram of an embodiment of another multi-modal large-model intelligent customer service training apparatus provided by the present application.
Detailed Description
It should be noted that the method, the device and the storage medium for intelligent customer service training of the multi-mode large model provided by the application can be applied to a terminal, a system and a server, for example, the terminal can be a smart phone or a computer, a tablet computer, a smart television, a smart watch, a portable computer terminal and other fixed terminals. For convenience of explanation, the present application is exemplified by using the terminal as the execution subject.
Referring to fig. 1, the present application first provides an embodiment of an intelligent customer service training method for a multi-modal large model, which includes:
s101, collecting multi-modal data, wherein the multi-modal data comprises texts, voices and words;
the text in this embodiment refers to a written language form composed of a series of characters or symbols for expressing information or ideas. In the field of computer science and information processing, text generally refers to a sequence of characters stored in an electronic device that can be read and processed by a computer program. Speech refers to human-uttered acoustic signals that can be captured by a recording device and converted to digital format for storage, transmission, and processing. Speech recognition techniques are capable of converting speech signals to text, while speech synthesis techniques are capable of converting text to speech. Text refers to a written form of characters or symbols that may represent words, phrases, or sentences in a language, unlike text, which is more focused on the written form of individual characters or symbols, and text is a complete sentence or paragraph made up of such characters or symbols.
When constructing an intelligent customer service system of a multi-mode large model, a large amount of multi-mode data needs to be collected firstly to serve as input and training basis of the system. Such data originates from a variety of sources and forms including, but not limited to, user conversation records, social media content, website text, audio files, handwriting recognition results, and the like. Text, voice and text data input by a user are directly acquired through a user interaction interface, a social media platform, a website and the like. And obtaining relevant text data from third party data sources such as news websites, forums, blogs and the like, converting the audio files into text data through a voice recognition technology, and converting handwritten characters into digital text data through a handwriting recognition technology. In the collecting process, data are required to be screened, invalid, repeated or noise data are removed, and accuracy and diversity of the data are ensured. The collected data is sorted into three modes of text, voice and words for subsequent data processing and model training.
S102, performing data duplication checking on the multi-mode data to obtain duplication checking data;
In this embodiment, the data is reviewed by comparing the contents of the submitted document or dataset with a huge database, through specific software or system, to identify possible plagiarism, duplications or misreferenced portions therein, and to present them in an intuitive way. The core of this process is the high efficiency of the algorithm and the broad coverage of the database.
The collected multi-modal data is input into a duplication checking system, and for non-text data, preprocessing is needed to be firstly carried out, and the data is converted into text form or comparable vector representation so as to carry out duplication checking. And comparing the input data with the existing data in the system by utilizing an algorithm and a database in the duplication checking system. And calculating the similarity between the input data and the existing data according to the comparison result, and identifying the part similar to or repeated with the existing data according to the calculation result of the similarity.
S103, preprocessing the duplicate checking data to obtain preprocessed data;
In this embodiment, the primary purpose of preprocessing the data after the duplicate is to make the data more normalized and consistent for subsequent data processing and model training. And removing special characters, invalid characters, redundant spaces and the like in the data, unifying date and time formats, filling in missing values and the like. Through the steps, the accuracy and the integrity of the data can be ensured, and errors in subsequent processing are reduced. The data is converted into a unified format and unit, such as all dates are converted into corresponding formats, the numerical data is converted into a unified measurement unit, and the like, so that the consistency of the data between different systems and platforms is ensured. Unstructured or semi-structured data, such as speech, images, etc., need to be converted into structured data for subsequent processing and analysis. The speech data is converted into text data by using speech recognition techniques and the image data is converted into a numerical or vector representation by using image recognition techniques. Data from different sources and in different formats are integrated and integrated to form a unified data set. The data integrity and diversity are ensured, and more input information is provided for subsequent data analysis and model training.
S104, carrying out data enhancement on the preprocessed data through an enhancement algorithm, and obtaining enhancement data, wherein the data enhancement processing comprises synonym replacement, random insertion and text deletion on the preprocessed data;
In this embodiment, the enhanced algorithm is a class of technical methods for optimizing the dataset. In the fields of natural language processing, image processing and the like, an enhanced algorithm generates new data samples through a series of transformation and processing operations, so that the diversity and the richness of data are increased.
In the text of the preprocessing data, a certain number of words are randomly selected, and synonyms of the words are found in a synonym dictionary and replaced. The method is used for increasing the diversity of the text, keeping the basic meaning of the original text unchanged, and facilitating the multi-modal large model to learn a richer language representation. In the text of the preprocessing data, some new words are inserted into the random selection position, wherein the new words refer to synonyms of the existing words in the original text, and can be other related or unrelated words. By introducing new vocabulary and sentence structure, the complexity and diversity of the text are increased, and the processing capacity of the model on unknown vocabulary and sentence is improved. In the text of the pre-processed data, some words or phrases are randomly deleted with a certain probability. Noise and redundant information in the text are simulated, the processing capacity of the model on incomplete data and noise data is enhanced, and meanwhile the generalization capacity of the model is improved.
S105, constructing an initialized multi-mode large model based on a architecture of a transducer;
In this embodiment, the transducer architecture is a deep learning architecture, and the core of the transducer architecture is a self-attention mechanism by which long-distance dependencies between sequence data can be effectively captured. The transducer architecture consists of an encoder and a decoder, each encoder and decoder layer containing multiple layers of self-attention mechanisms and feed-forward neural networks. This architecture has achieved great success in the NLP field and has gradually expanded to the multi-modal learning field.
Before constructing the multi-mode large model, preprocessing multi-mode data such as texts, images and the like. This includes text segmentation, image cropping, etc. steps to extract an efficient feature representation. The text encoder and the image encoder are used to extract feature representations of the text and the image, respectively, which are used as a basis for the subsequent processing of the model. The cross-modal attention mechanism can learn weights among different modalities and dynamically fuse information, so that the cross-modal representation and relationship are captured, and the cross-modal attention mechanism is one of core parts of a multi-modal large model. Training is carried out on multi-mode data, and cross-mode representation and relation are learned, and the multi-mode data is realized through tasks such as classification of image text, image description generation and the like. Training strategies improve model performance by utilizing massive amounts of data. And then, the trained multi-mode large model is finely tuned to specific downstream tasks, such as image question-answering, visual retrieval and the like, so that the model can be optimized for specific tasks, and the performance of the model is improved.
S106, training the initialized multi-modal large model according to the enhanced data to obtain a target multi-modal large model;
And selecting a multi-mode large model based on a transducer architecture as an initial model, and carrying out initialization setting on parameters of the model according to task requirements and data characteristics. And inputting the multi-mode data subjected to the data enhancement processing into a model. During the training process, the multi-modal large model adapts to the changes in the input data by continually adjusting its internal parameters. An optimization algorithm is used to minimize the loss function, thereby improving the performance of the model, and gradient information is passed to each layer of the model by a back propagation algorithm for weight updating. The multi-mode large model realizes comprehensive understanding and processing of multi-mode data by learning association and complementary information among different modes.
During the training process, the multimodal big model is evaluated periodically to check if its performance is expected. If the evaluation result is not ideal, the multi-mode large model needs to be adjusted and optimized, the model structure is modified, the parameter setting is adjusted, and the like. The training process ends when the multimodal big model reaches the expected performance on the training set. At this time, the model has learned how to process and understand the multi-modal data and has some generalization capability.
S107, acquiring user inquiry information;
The intelligent customer service system enables users to conveniently input their inquiry by designing an intuitive and easy-to-use user interface, wherein the user interface has the functions of a text input box, a voice input button, an image uploading area and the like, provides necessary prompts and guidance in a UI, and helps the users to definitely input contents and formats. As the user inputs information through the UI, the intelligent customer service system captures these inputs in real-time. For text input, the intelligent customer service system monitors keyboard events and captures characters entered by the user, for speech input, the intelligent customer service system converts speech to text using speech recognition techniques, and for image input, the intelligent customer service system captures and stores image data. After the data is captured, the input data is analyzed, so that the input data is ensured to meet the requirement of system processing. For text input, redundant space and punctuation marks are removed and word segmentation processing is carried out, and for image input, scaling, clipping or denoising processing is carried out. If text and image data are input at the same time, the intelligent customer service system integrates the information of different modes. The text and image features are aligned, fused or otherwise associated for subsequent processing. During the process of acquiring user input, the intelligent customer service system detects and corrects possible form errors. For example, for text input, the intelligent customer service system uses a spell checking algorithm to identify and correct spelling errors, and for speech input, the intelligent customer service system uses confidence scores for speech recognition to filter out low quality recognition results. In the process of acquiring user inquiry information, privacy protection principles must be strictly adhered to, secure storage and transmission of user data, and relevant laws and regulations and privacy policies must be adhered to when processing data. The information entered by the user is stored in a suitable database or data structure for subsequent processing and analysis. This information is passed on to the next part of the system for further processing and answer generation.
S108, extracting weight characters in the user inquiry information;
The intelligent customer service system receives inquiry information input by a user, wherein the inquiry information is usually a character string, and can contain various characters such as letters, numbers, symbols and the like. The user input is preprocessed, including removal of unnecessary blank characters, as well as HTML tags or special characters. Preprocessing is used to ensure that the subsequently processed data is clean, standardized text data. The intelligent customer service system maintains a predefined weight word library that contains a series of words that are considered to be of higher importance in the user query. Traversing the preprocessed user query information, and checking whether words or phrases in the user query information are matched with words in the weight word stock one by one. If the matching is successful, the word or phrase is marked as weight text, and the position and the occurrence number of the word or phrase are recorded. After extracting all the matching weighted words, the intelligent customer service system may generate a list or data structure containing these words and their associated information. The intelligent customer service system may be used for further analysis based on the results generated or for directly answering the user's questions. And finally, the system outputs the extracted weight text and related information thereof to a user interface.
S109, inputting the weight text into the target multi-mode big model so that the target multi-mode big model obtains associated information according to the weight text, wherein the associated information comprises text, video, voice and image information corresponding to the weight text;
The intelligent customer service system firstly inputs the weight characters into the target multi-mode large model. The weight text herein refers to one or more keywords, phrases or sentences, each of which has a specific weight. The multi-mode large model firstly analyzes the input weight words and understands the meaning and the context. Through natural language processing technology, the auxiliary model accurately grasps the intention of input information. Based on the parsed weight words, the multimodal big model searches its internal database or externally connected database for multimodal information associated with these words. Including various forms of information such as text, video, voice, and images. If the weight information is provided explicitly at the time of input or the multi-modal large model can identify the weight difference, the multi-modal large model takes the weights into consideration when retrieving and selecting the associated information, and preferentially returns multi-modal information more relevant to words with higher weights. The multimodal big model collates and generates a multimodal information set associated with the weighted words. The multimodal information collection includes directly related text segments, related video links or summaries, links or transcribed text of speech segments, and images or charts related to the input text.
S110, preprocessing the associated information to generate a feature vector;
the associated information is preprocessed by removing noise and extraneous content, such as advertisements, duplicate information, invalid links, etc., from the associated information. And converting the associated information into a standardized information format, such as unified text coding, picture size adjustment and the like, and classifying the associated information, such as a text class, a video class, a voice class, an image class and the like. And various information is indexed, and keywords, topics or abstracts are extracted, so that the subsequent processing is facilitated. If the associated information relates to multiple modalities, it is necessary to ensure their consistency in time or space for the subsequent fusion process. For the literal information, the character can be extracted by adopting a word bag model, TF-IDF, word embedding and other methods. These methods are capable of converting text into a vector form, capturing key information in the text. For image class information, features can be extracted using convolutional neural networks, etc., which can automatically learn local and global features in the image and represent them as vectors. For the voice class information, features can be extracted by using methods such as mel-frequency cepstral coefficients, linear predictive coding and the like. Mel-frequency cepstral coefficients, linear predictive coding, are capable of capturing pitch and cadence information in speech. For video class information, features such as key frame extraction, motion trail extraction, color histogram extraction, and the like can be adopted. The features can be used to reflect the content and dynamic information of the video. The extracted features are converted into unified feature representation forms such as single-hot codes, word vectors, image feature vectors and the like. The feature vectors are normalized to ensure that they have the same scale in subsequent processing. And combining the feature vectors of different modes to form a comprehensive feature vector. This feature vector can contain multiple modality information in the associated information, providing a rich feature representation for subsequent processing.
S111, obtaining a similarity value between the feature vector and the weight text according to a similarity algorithm;
The similarity value between the feature vector and the weight text is obtained according to a similarity algorithm, and the working principle of the step is mainly based on a vector space model and a similarity calculation method. Feature vectors are extracted from the associated information by a preprocessing step, the feature vectors being used to represent core content or key features of the information. A feature vector is a point in a high-dimensional space, each dimension corresponding to a feature, the value of the dimension representing the importance or weight of the feature in the information. Weighted words refer to words having a particular weight or importance that are generally closely related to the target of the query or analysis. The weighted words may be converted into feature vector form for comparison with feature vectors of the associated information.
The similarity algorithm may be a Tanimoto similarity algorithm, where the Tanimoto similarity algorithm is defined as:
;
Wherein X and Y are two sets, X n Y is the intersection of X and Y,AndThe number of elements is X and Y, respectively.
First, sets X and Y of similarity are calculated, and then an intersection X n Y of the sets X and Y is calculated, the intersection containing elements that appear in both X and Y. Then respectively calculating the element numbers of the sets X and YAnd. According to the definition of Tanimoto similarity, the size of the intersection is divided by the sum of the numbers of the two set elements to subtract the size of the intersection, so as to obtain the similarity between the two sets. Finally, the calculated similarity value is output, which is a real number between 0 and 1, indicating the degree of similarity between the two sets. The closer the value is to 1, the more similar the two sets are, and the closer the value is to 0, the more dissimilar the two sets are. The Tanimoto similarity algorithm can calculate the similarity between two sets, has simple calculation process, is suitable for processing large-scale data sets, and remarkably improves the efficiency of data processing. The Tanimoto similarity algorithm can accurately reflect the degree of similarity between two sets by calculating the relationship between the set intersection and the total number of elements.
S112, arranging the similarity values in sequence to obtain a sequencing result;
A set of similarity values is input to the multimodal mass model for ranking. The similarity value is calculated by comparing the feature vector with the weight text, and a proper sorting algorithm is selected according to specific requirements and performance consideration. When the sorting is completed, a list or array containing the sorted similarity values is generated, elements in the list or array are arranged according to the size sequence of the similarity values, and a sorted similarity value set is output and used for subsequent data analysis, visualization or decision support.
S113, extracting a first target similarity value in the sorting result;
The intelligent customer service system accesses the ordered similarity value set, and elements are arranged according to the order of the similarity values because the set is already ordered. In the sorted set, the first element of the sort is typically located in the first position of the set. In the sorting process, the algorithm has ensured that the element with the largest or smallest similarity value is at the beginning of the set. Then, the intelligent customer service system extracts and outputs the element from the first position of the set, wherein the output element is the first element in the sorted similarity value set, namely the target similarity value.
S114, determining target association information based on the target similarity value;
First, a first target similarity value in a similarity value set is input, wherein the first target similarity value represents similarity corresponding to a feature vector which is most similar to a weighted text. The similarity threshold is a preset value for judging whether the target similarity value is high enough to confirm the relevance between the corresponding feature vector and the weight text. By comparing the target similarity value with a preset similarity threshold. The comparison process is to determine whether the target similarity value reaches a sufficiently high level to confirm a strong correlation between the corresponding feature vector and the weighted text. If the target similarity value is greater than or equal to the similarity threshold value, the strong correlation exists between the corresponding feature vector and the weight text, and the high similarity between the information represented by the feature vector and the information represented by the weight text in content is indicated. Once it is confirmed that there is a strong correlation between the feature vector and the weight text, target correlation information can be determined from the feature vector and output as target correlation information.
S115, displaying the target associated information.
In this embodiment, this step first selects an appropriate presentation format according to the type of the target association information and the preference of the user by inputting the target association information determined in the previous step. And finally, presenting the processed target association information to a user.
According to the embodiment, the method has the beneficial effects that by collecting and processing the multi-mode data, a more comprehensive and accurate intelligent customer service model can be trained. The fusion of the multimodal data enables the model to more accurately understand the user intent, thereby improving the accuracy and relevance of the response. The data enhancement processing can increase the diversity of training data, help the multi-mode large model learn more language modes and features, and help to promote the generalization capability of the model in the face of new user inquiry. The method has the advantages that the transformation framework is used for constructing and initializing the multi-mode large model, the parallel processing capacity and the self-attention mechanism of the framework can be utilized, the training efficiency of the model is improved, and the model is facilitated to better understand and process complex user queries. According to the method, the weight words in the user inquiry information are extracted, and the associated information is obtained according to the weight words, so that more personalized service experience can be provided for the user. By calculating the similarity value between the feature vector and the weight text and sorting according to the similarity, the method can preferentially display the most relevant information related to the user inquiry, is beneficial to improving the information acquisition efficiency of the user, and optimizes the overall user experience. The method supports interaction modes of texts, voices, images and the like, so that a user can select the most suitable interaction mode according to own preference and habit.
Referring to fig. 2, the present application provides an embodiment of a method for obtaining a target multi-modal large model, the embodiment comprising:
s201, acquiring a plurality of enhancement feature vectors of the enhancement data;
S202, inputting the enhanced feature vector into the initialized multi-mode large model to obtain a multi-mode feature vector;
s203, obtaining a prediction result according to the multi-mode feature vector;
s204, calculating the loss between the prediction result and the actual label by using a cross entropy loss function, and obtaining a loss value;
s205, acquiring updated parameters of the initialized multi-mode large model through a back propagation algorithm according to the loss value;
S206, updating the initialized multi-modal large model through the updating parameters to obtain the target multi-modal large model.
This embodiment describes a process for optimizing a multi-modal large model for prediction by training. The process involves the steps of enhanced data processing, feature extraction, model prediction, loss calculation, model parameter updating, etc. The intelligent customer service system firstly processes enhanced data, wherein the enhanced data is data obtained after preprocessing, the intelligent customer service system extracts a plurality of feature vectors from the data, and then the feature vectors are input into an initialized multi-mode large model. And after the multi-mode large model processes the input feature vector, outputting the multi-mode feature vector. The intelligent customer service system generates a prediction result by utilizing the multi-mode feature vector. The multi-mode feature vector is transmitted to an output layer of the model, and a prediction result is obtained through calculation. To evaluate the predictive performance of the model, the system uses a cross entropy loss function to calculate the difference between the predicted result and the actual label. The cross entropy loss function is a method for measuring the difference between two probability distributions, and by calculating a loss value, the intelligent customer service system can quantify the error degree of model prediction.
Calculating the loss between the prediction result and the actual label by using a cross entropy loss function, and obtaining a loss value;
The cross entropy loss function is:
where N is the number of categories, yi is the one-hot encoding of the actual label, pi is the probability of model prediction belonging to the ith category.
The predicted outcome and the actual label of the model. The loss value is calculated by using a cross entropy loss function, representing the difference between the predicted result and the actual label. The cross entropy loss function is used to measure the difference between two probability distributions. In the present embodiment, these two probability distributions are the predicted result pi and the actual label yi of the model, respectively. Firstly, a prediction result and an actual label of a target multi-mode large model are obtained, and the actual label is converted into one-hot coding. According to the formula of the cross entropy loss function, calculating the loss value of each category, adding the loss values of all categories, dividing the loss values by the category number N to obtain a final loss value L, wherein the loss value L is used for evaluating the performance of the model. During the training process, the target multi-mode large model continuously adjusts its parameters to minimize the loss value L. The gradient of the loss value L will be used to update the weights and bias of the model by the back propagation algorithm. To minimize the loss value L, an optimization algorithm is typically used to update the parameters of the model. In each iteration, the current loss value L is calculated and the parameters of the model are updated according to their gradients.
After obtaining the loss value, the intelligent customer service system updates the parameters of the model by using a back propagation algorithm. The error signal is counter-propagated according to the loss values, the gradient of each parameter is calculated by the chain law, and the parameters of the model are updated according to the gradients. And finally, updating the initialized multi-modal large model by the intelligent customer service system by using the updated parameters, thereby obtaining the optimized target multi-modal large model. After the multi-mode large model is trained, input data can be processed more accurately and a prediction result can be generated.
According to the embodiment, the method has the beneficial effects that the diversity and complexity of training data can be increased by using the enhanced data, so that the multi-mode large model is helped to learn more features and modes, and the accuracy of the model in the actual task processing process is helped. The multi-mode large model in the embodiment can fuse data features from different modes to generate multi-mode feature vectors, so that the model can more comprehensively understand input data, and accuracy and relevance of a prediction result are improved. The cross entropy loss function can accurately measure the difference between the model prediction result and the actual label, and a clear direction is provided for model optimization. By minimizing the loss value, the model can gradually adjust the parameters, and the prediction accuracy is improved. The loss value can directly reflect the performance of the model, so that the training progress can be estimated by monitoring the change of the loss value, and the training strategy can be timely adjusted. The cross entropy loss function is particularly suitable for multi-classification tasks, can process data sets with multiple categories, and outputs the prediction probability of each category, so that the model is more flexible and accurate in processing complex multi-classification problems. In the training process, even if partial error labels or noise data exist, the cross entropy loss function can be utilized to guide the model to learn correct data distribution, so that the generalization capability of the model is improved.
Referring to fig. 3, the present application provides an embodiment of a method for performing data duplication checking on multi-mode data and obtaining duplication checking data, which includes:
S301, cleaning the multi-mode data to remove invalid, incomplete or wrong data;
S302, classifying the cleaned multi-modal data, wherein the multi-modal data comprises text data, image data, audio data and video data;
s303, searching the text data by adopting a first processing mode to acquire first searching data;
s304, performing duplicate checking on the image data by adopting a second processing mode to acquire second duplicate checking data;
s305, performing duplicate checking on the audio data by adopting a third processing mode to acquire third duplicate checking data;
S306, searching the video data by adopting a fourth processing mode to obtain fourth searching data;
S307, combining the first check data, the second check data, the third check data and the fourth check data to generate a check data set.
The embodiment mainly obtains a repeated-free and high-quality data set by performing a series of preprocessing and classified duplicate checking operations on the multi-mode data, ensures the integrity and accuracy of the data, effectively removes repeated contents by aiming at a specific duplicate checking mode of each mode data, and finally combines duplicate checking results of all modes to form a unified duplicate checking data set. The cleaned data is classified according to its modality, and in this embodiment, the multi-modality data is classified into four major categories, i.e., text data, image data, audio data, and video data. The cleaning of the multi-mode data is mainly to remove invalid, incomplete or format error data, wherein the invalid data can comprise null values, missing values or data which obviously do not accord with expectations, the incomplete data refers to data which are not fully information and cannot be used for subsequent analysis, and the format error data can be caused by coding problems, storage problems or transmission problems. The first processing mode is text data check duplication, and the text data check duplication is matched with character string matching, semantic similarity calculation, hash technology and the like. The character string matching can be based on direct comparison of keywords, phrases or the whole document, semantic similarity calculation utilizes a natural language processing technology to evaluate meaning similarity among texts, a hash technology converts the texts into hash values with fixed lengths, and whether the texts are repeated or not is judged by comparing the hash values. The second processing mode is that the image data is searched again, and the image feature extraction and matching technology is mainly adopted. The feature extraction method comprises the steps of color histogram, edge detection, texture analysis and feature vector extraction by a deep learning model, and is completed by calculating the similarity between the feature vectors. The third processing mode is audio data duplication checking, and the audio duplication checking technology comprises audio feature extraction, audio fingerprint generation, matching and the like. By analyzing the waveform, spectrum and other characteristics of the audio, a unique audio fingerprint is generated. By comparing fingerprints of different audio, the system is able to identify duplicate audio content. And the fourth processing mode is used for searching the video data. Video duplication often combines techniques of image duplication and audio duplication. And extracting key frames and audio features in the video, and respectively checking the key frames and the audio features.
As can be seen from the above embodiments, the beneficial effects brought by the scheme are that invalid, incomplete or wrong data are effectively removed through the data cleaning stage, and the accuracy and reliability of the data used in subsequent analysis are ensured. Repeated items in the data set are reduced through the duplicate checking process, the influence of redundant information on analysis results is avoided, and the overall quality of the data is improved. The data is classified, so that each type of data can adopt the most suitable duplicate checking method, and the problem of low efficiency possibly caused by processing multiple types of data by a single method is avoided. And the targeted duplicate checking processing reduces unnecessary calculation overhead and improves the overall efficiency of data processing.
Referring to fig. 4, the present application provides an embodiment of a method for extracting weight text in user query information, which includes:
S401, extracting words in the user inquiry information as nodes;
S402, constructing an undirected graph according to the co-occurrence relation of the vocabulary in the user inquiry information;
S403, calculating a weight value of each vocabulary according to the co-occurrence times of the vocabulary in the undirected graph;
S404, extracting weight characters in the user inquiry information according to the weight values.
First, user inquiry information is preprocessed, such as word segmentation, stop word removal, etc., to acquire words constituting the inquiry information. And taking the words as nodes to provide a foundation for the subsequent construction of an undirected graph, and constructing an undirected graph according to the co-occurrence relation of the extracted words in the user query information, wherein in the undirected graph, if two words simultaneously appear in the user query information, one edge is connected between the two words, the relevance between the words can be reflected, and the basis is provided for the subsequent calculation of the weight value. For each node in the undirected graph, its weight value is calculated according to its number of co-occurrences with other nodes. The more co-occurrence times, the higher the importance of the vocabulary in the user query information, and therefore the greater the weight value. And sorting the vocabulary in the user inquiry information according to the calculated weight value, and selecting the vocabulary with higher weight value as weight text for reflecting the focus or key information of the user inquiry. The extracted weight text can be used for subsequent tasks such as analysis, processing or answer generation, and the like, so that the accuracy and efficiency of the system are improved.
According to the embodiment, the method has the beneficial effects that by considering the co-occurrence relation of the vocabularies in the query information of the user, the relevance among the vocabularies can be more accurately captured, and the key information reflecting the intention of the user can be more accurately extracted. The calculation of the weight value can improve the identification of the key information, so that the extracted weight text accords with the focus and the key point of the user inquiry. The method realizes the rapid processing of a large number of user inquiry information by constructing the undirected graph and calculating the weight value, and improves the information processing efficiency. The extracted weight text can be directly used for subsequent tasks such as analysis, processing or answer generation, unnecessary information screening and filtering steps are reduced, and the processing efficiency is further improved.
In the above embodiment, a softmax classification algorithm is configured in the multimodal mass model, the softmax classification algorithm being defined as:
;
wherein c represents the category of the classification,Representing the probability that sample i belongs to category c,Representing the exponential summation of the scores for all categories.
In the multi-modal large model, a softmax classification algorithm is generally used in an output layer of the model to convert the processing result of the multi-modal large model on input data of multiple modes such as text, image, audio and the like into probability distribution. The multi-modal large model receives input data from different modalities, which are converted to classification scores after processing inside the multi-modal large model. The classification score is processed by a softmax function and converted into a probability distribution. And selecting the category with the highest probability as a final classification result according to the probability distribution.
The embodiment shows that the Softmax classification algorithm has the beneficial effects that the output of the multi-mode large model can be converted into probability distribution by the Softmax classification algorithm, so that the classification result is more visual and accurate. By comparing the probabilities of different categories, the multi-mode large model can select the category with the highest probability as a prediction result, so that the classification accuracy is improved. The Softmax classification algorithm is suitable for processing multi-mode data, and can combine the characteristics and information of different modes to carry out joint classification. In the model training process, the Softmax classification algorithm can be combined with the cross entropy loss function algorithm to realize efficient model training. Through continuous iterative optimization, the classification performance of the multi-mode large model can be improved.
Referring to fig. 5, the application further provides a training device for intelligent customer service of the multi-mode large model, which comprises:
A collecting unit 501 for collecting multi-modal data including text, speech, and words;
the duplication checking unit 502 is configured to perform data duplication checking on the multi-mode data to obtain duplication checking data;
A first preprocessing unit 503, configured to preprocess the duplicate checking data to obtain preprocessed data;
A construction unit 504, configured to construct an initialized multi-modal large model based on a architecture of a transducer;
An acquiring unit 505 for acquiring user inquiry information;
a first extracting unit 506, configured to extract a weight text in the user query information;
the input unit 507 is configured to input the weighted text into the target multi-mode large model, so that the target multi-mode large model obtains association information according to the weighted text, where the association information includes text, video, voice and image information corresponding to the weighted text;
A second preprocessing unit 508, configured to preprocess the association information to generate a feature vector;
A similarity calculating unit 509, configured to obtain a similarity value between the feature vector and the weighted text according to a similarity algorithm;
A sorting unit 510, configured to sequentially sort the similarity values to obtain a sorting result;
A second extracting unit 511, configured to extract a target similarity value of the first ranking in the ranking result;
a determining unit 512, configured to determine target association information based on the target similarity value;
an output unit 513 for presenting the target association information.
Referring to fig. 6, the application further provides a training device for intelligent customer service of the multi-mode large model, which comprises:
a processor 601, a memory 602, an input/output unit 603, and a bus 604;
the processor 601 is connected to the memory 602, the input-output unit 603, and the bus 604;
the memory 602 holds a program, which the processor 601 invokes to perform any of the methods described above.
The application also relates to a computer readable storage medium having a program stored thereon, which when run on a computer causes the computer to perform any of the methods described above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM, random access memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims (10)

CN202510095877.1A2025-01-222025-01-22Intelligent customer service training method and device for multi-mode large model and storage mediumActiveCN119577459B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510095877.1ACN119577459B (en)2025-01-222025-01-22Intelligent customer service training method and device for multi-mode large model and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510095877.1ACN119577459B (en)2025-01-222025-01-22Intelligent customer service training method and device for multi-mode large model and storage medium

Publications (2)

Publication NumberPublication Date
CN119577459A CN119577459A (en)2025-03-07
CN119577459Btrue CN119577459B (en)2025-04-22

Family

ID=94805833

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510095877.1AActiveCN119577459B (en)2025-01-222025-01-22Intelligent customer service training method and device for multi-mode large model and storage medium

Country Status (1)

CountryLink
CN (1)CN119577459B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119831711B (en)*2025-03-202025-06-20青岛科技大学Multi-mode data-oriented commodity recommendation method
CN120338366B (en)*2025-04-022025-09-23科睿特软件集团股份有限公司 Service work order identification and distribution method and system based on tourist spatiotemporal behavior graph
CN119938766B (en)*2025-04-082025-07-22吉林大学 A structured data generation method, device, terminal device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114416927A (en)*2022-01-242022-04-29招商银行股份有限公司Intelligent question and answer method, device, equipment and storage medium
KR20230014035A (en)*2021-07-202023-01-27국민대학교산학협력단Method and device for recommending related documents through user search intent analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112732870B (en)*2020-12-312024-03-05平安科技(深圳)有限公司Word vector based search method, device, equipment and storage medium
CN113902964B (en)*2021-09-092025-05-23中山大学 Keyword-aware multimodal attention video question answering method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR20230014035A (en)*2021-07-202023-01-27국민대학교산학협력단Method and device for recommending related documents through user search intent analysis
CN114416927A (en)*2022-01-242022-04-29招商银行股份有限公司Intelligent question and answer method, device, equipment and storage medium

Also Published As

Publication numberPublication date
CN119577459A (en)2025-03-07

Similar Documents

PublicationPublication DateTitle
CN112131350B (en)Text label determining method, device, terminal and readable storage medium
CN119577459B (en)Intelligent customer service training method and device for multi-mode large model and storage medium
CN102831184B (en)According to the method and system text description of social event being predicted to social affection
CN117493491A (en)Natural language processing method and system based on machine learning
CN112307182B (en) An Extended Query Method for Pseudo-Relevant Feedback Based on Question Answering System
CN117009458B (en)Cross-language event retrieval method integrating event knowledge
CN118170933B (en) A method and device for constructing multimodal corpus data in scientific fields
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN119988588A (en) A large model-based multimodal document retrieval enhancement generation method
CN111859955A (en) A public opinion data analysis model based on deep learning
CN119557424B (en)Data analysis method, system and storage medium
CN114611520B (en) A text summary generation method
CN114707517B (en)Target tracking method based on open source data event extraction
CN119622364A (en) Method, system, electronic device and storage medium for identifying sensitive words in news articles
CN113688633B (en) Method and device for determining outline
CN119938824A (en) Interaction method and related equipment
CN111259223B (en)News recommendation and text classification method based on emotion analysis model
CN119025619A (en) Search optimization method, device, electronic device, storage medium and program product
CN117009170A (en)Training sample generation method, device, equipment and storage medium
CN111341457B (en)Medical diagnosis information visualization method and device based on big data retrieval
CN120407775B (en)Text data automatic labeling method and system integrating pre-training NLP model
CN118484665B (en) Intelligent extraction method and system of text topics based on NLP technology
Singh et al.Machine-Learning Techniques for Effective Text Mining
CN119167946B (en) Method and terminal for embedding exclusive AI interpretation for shared documents
WangThe Construction of a Green Network Ecosystem

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp