Disclosure of Invention
In view of the above, embodiments of the present invention provide a multi-modal psychological disease diagnosis method, a computer device, and a storage medium.
The embodiment of the invention provides a multi-modal mental disease diagnosis method, which comprises the following steps:
acquiring video and audio data of a user for answering a preset diagnosis question;
acquiring text data of the user for answering the predetermined diagnostic question;
obtaining a first probability of a psychological disease state according to the audio-visual data based on a first diagnosis model;
obtaining a second probability of the psychological disease state according to the text data based on a second diagnosis model;
and obtaining the final probability of the mental disease state of the user according to the first probability and the second probability.
In this way, in the psychological disease diagnosis method according to the embodiment of the present application, by acquiring the audio-visual data and the text data of the user answering the predetermined diagnosis question, and using the preset diagnosis model to obtain the probability of the psychological disease state of the user according to the relevant data, and combining the two, the final disease probability of the user can be confirmed. By adopting a model diagnosis mode, the diagnosis method of the psychological diseases can be based on the basis, the diagnosis result is scientific and objective, and the diagnosis method has relative universality, so that the psychological diseases of a large-range crowd can be diagnosed, and the diagnosis requirements of all psychological disease patients under the condition of limited resource conditions are further met.
In some embodiments, the obtaining audio-visual data of the user answering the predetermined diagnostic question comprises:
acquiring video data of a user for answering a predetermined diagnosis question;
extracting image data of a user from the video data;
and extracting audio data of the user from the video data.
Therefore, by acquiring the video data of the user answering the preset diagnosis questions, the image data and the audio data can be extracted, and a data basis is provided for the subsequent psychological disease diagnosis.
In some embodiments, the deriving a first probability of a psychological disease state from the audiovisual data based on a first diagnostic model comprises:
extracting user facial feature information in the image data to form a facial feature vector matrix.
In this way, by extracting the facial feature information of the user in the image data, a plurality of facial feature vectors with the same dimension are formed, so that the psychological disease state of the user can be analyzed according to the features of each part of the face, a facial feature vector matrix is further formed, the facial feature vector matrix is provided for the first diagnosis model, and a facial feature basis is provided for obtaining a first probability of the psychological disease state of the user.
In some embodiments, the obtaining a first probability of a psychological disease state from the audiovisual data based on the first diagnostic model includes:
and carrying out emotion classification on the audio data based on a preset emotion classification model to form an emotion feature vector matrix.
Therefore, a plurality of emotion feature vectors with the same dimensionality are formed by extracting the emotion feature information of the user in the audio data, an emotion feature vector matrix is further formed, and an emotion feature vector matrix is provided for the first diagnosis model, so that an emotion feature basis is provided for obtaining a first probability of the state of the psychological disease of the user.
In some embodiments, the deriving a first probability of a psychological disease state from the audiovisual data based on a first diagnostic model comprises:
and mapping the dimension of the emotion characteristic vector matrix and the face characteristic vector matrix so that the dimension of the mapped emotion characteristic vector matrix is the same as the dimension of the face characteristic vector matrix.
Therefore, the emotion characteristic vector matrix is mapped into the characteristic vector matrix with the same dimension as the face characteristic vector matrix, and a basis is provided for vector splicing and fusion of the emotion characteristic vector matrix and the face characteristic vector matrix.
In some embodiments, the deriving a first probability of a psychological disease state from the audiovisual data based on a first diagnostic model comprises:
carrying out vector splicing and fusion on the mapped emotion characteristic vector matrix and the face characteristic vector matrix;
and inputting the spliced eigenvector matrix into the first diagnosis model to obtain a first probability of the psychological disease state.
Therefore, the emotion characteristic vector matrix and the facial characteristic vector matrix are spliced into the same matrix, so that emotion characteristic change and facial characteristic change of a user can be comprehensively considered when the first diagnosis model analyzes the mental disease state, and finally a model result of an image and audio is obtained, so that the first probability of the mental disease state is obtained.
In some embodiments, said obtaining text data of said user answering said predetermined diagnostic question comprises:
and performing text extraction processing on the video and audio data of the user answering the preset diagnosis questions to obtain the text data.
Therefore, after the audio and video data of the user are obtained, the audio data in the audio and video data are converted into text data, and confusion factors caused by different individuals, such as accents, speech speeds, intonations and the like, can be avoided at first. Secondly, the meaning which the user wants to express can be judged according to the semantics.
In some embodiments, the deriving a second probability of a psychological disease state from the textual data based on a second diagnostic model comprises:
extracting the content of the text data;
and performing disease matching according to the extracted content to obtain a second probability of the psychological disease state.
In this way, the content of the text data is extracted, and only the text data answered by the user is reserved, so that after the text data is subjected to disease matching, a second probability representing the psychological disease state of the user can be obtained.
In some embodiments, said deriving a final probability of the user's mental disease state from the first and second probabilities comprises:
and performing fusion calculation on the first probability and the second probability according to a preset weight to obtain a final probability of the user mental disease state.
Therefore, the first probability and the second probability are fused and calculated according to the preset weight to obtain the final diagnosis result. And modifying the preset weight according to the actual condition to enable the final diagnosis result to be in accordance with the reality, thereby achieving the purpose of diagnosing the psychological disease state for the user.
The invention provides a computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, implements the method of any of the above.
Therefore, the computing equipment of the application can respectively obtain the probability of the psychological disease state of the user according to the relevant data by acquiring the video and audio data and the text data of the user answering the preset diagnosis question and utilizing the preset diagnosis model, and combine the probability of the psychological disease state of the user and the text data, thereby confirming the final disease probability of the user. By adopting a model diagnosis mode, the diagnosis method of the psychological diseases can be based on the basis, the diagnosis result is scientific and objective, and the diagnosis method has relative universality, so that the psychological diseases of a large-range crowd can be diagnosed, and the diagnosis requirements of all psychological disease patients under the condition of limited resource conditions are further met.
The present invention provides a non-transitory computer-readable storage medium of a computer program which, when executed by one or more processors, causes the processors to perform the method.
Therefore, in the method and the device, the audio-visual data and the text data of the user answering the preset diagnosis questions are obtained, the probability of the psychological disease state of the user is obtained according to the relevant data by using the preset diagnosis model, and the probability of the psychological disease state of the user is combined, so that the final disease probability of the user can be confirmed. By adopting a model diagnosis mode, the diagnosis method of the psychological diseases can be based on the basis, the diagnosis result is scientific and objective, and the diagnosis method has relative universality, so that the psychological diseases of a large-range crowd can be diagnosed, and the diagnosis requirements of all psychological disease patients under the condition of limited resource conditions are further met.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1, the present application provides a method for multi-modal diagnosis of psychological diseases, comprising:
s10: acquiring video and audio data of a user for answering a preset diagnosis question;
s20: acquiring text data of a user for answering a predetermined diagnosis question;
s30: obtaining a first probability of a psychological disease state according to the audio-visual data based on the first diagnosis model;
s40: obtaining a second probability of the psychological disease state according to the text data based on the second diagnosis model;
s50: and obtaining the final probability of the mental disease state of the user according to the first probability and the second probability.
The application also provides a computer device, and the psychological disease diagnosis method can be realized by the computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for acquiring audio-visual data of a user for answering a preset diagnosis question and acquiring text data of the user for answering the preset diagnosis question, and is used for obtaining a first probability of a psychological disease state according to the audio-visual data based on a first diagnosis model, obtaining a second probability of the psychological disease state according to the text data based on a second diagnosis model, and obtaining a final probability of the psychological disease state of the user according to the first probability and the second probability. The computer equipment in the application can be medical equipment with a video and audio diagnosis function.
Specifically, the multi-modal psychological disease diagnosis method is realized at least based on audio-visual data and text data collected by a user in the process of answering a predetermined diagnosis question. The audio-visual data can be obtained from an audio-visual video of a user who has completed a diagnosis problem predetermined by a professional psychotherapist. The length of the video data can be variable, and the length of the video data is different according to different video recordings, and is not limited herein. The text data may be text data obtained by converting audio data in the audio/video data using an Automatic Speech Recognition (ASR) technique to answer a predetermined diagnostic question. The predetermined diagnostic questions may be the age, occupation, character, etc. of the user, or may be the most troubling and most desired questions to be solved, for example, "how the sleep quality is during the time of you", "what is the most happy thing recently", and may be configured by the user according to the needs, and is not limited herein. The first diagnosis model can be generated by collecting a large number of characteristics of faces, behaviors, emotions and the like of people, performing label learning, and further obtaining a first probability of the psychological disease state through the audio-video data of a user. For example, the model can be trained by collecting the features of faces, behaviors, moods and the like of ten thousand persons and respectively marking the respective depression degrees of the ten thousand persons, and after training, the model has the capability of judging the depression degree. And transmitting a section of video data with unknown depression degree to the model, and returning the most possible depression degree of the section of video data by the model. The second diagnosis model may be a diagnosis model generated by collecting a large number of text features of a person and performing label learning, and further a second probability of the mental disease state is obtained through text data of the user. For example, the model can be trained by collecting the text features of ten thousand people and respectively marking the respective depression degrees of the ten thousand people, and after training,the model possessed the ability to judge the extent of depression. A piece of text data with unknown depression degree is transmitted to the model, and the model returns the most probable depression degree of the piece of text data. The accuracy of model identification depends on the training degree based on the collected characteristic materials and whether the structure of the model is reasonable. The first and second probabilities obtained can each represent the respective probability of different degrees of depression, e.g., the first and second probabilities can each be expressed as (no depression a)1 Mild depression a2 Moderate depression a3 Major depression a4 ) And a is a1 +a2 +a3 +a4 =1。
It can be understood that the first probability and the second probability are calculated based on different models, and the different models have respective emphasis, and the first probability and the second probability need to be subjected to fusion calculation to obtain the final probability of the user mental disease state. The fusion calculation may be performed by, for example, arithmetic mean, weighted average, or the like, and is not limited herein.
In summary, in the multi-modal mental disease diagnosis method and the computer device according to the embodiments of the present application, by obtaining the audio-visual data and the text data of the user answering the predetermined diagnosis question, the probability of the mental disease state of the user is respectively obtained according to the relevant data by using the preset diagnosis model, and the two probabilities are combined, so that the final disease probability of the user can be determined. By adopting a model diagnosis mode, the diagnosis method of the psychological diseases can be based on the basis, the diagnosis result is scientific and objective, and the diagnosis method has relative universality, so that the psychological diseases of a large-range crowd can be diagnosed, and the diagnosis requirements of all psychological disease patients under the condition of limited resource conditions are further met.
Referring to fig. 2, in some embodiments, S10 includes:
s11: acquiring video data of a user for answering a predetermined diagnosis question;
s12: extracting image data of a user from the video data;
s13: audio data of the user is extracted from the video data.
In some embodiments, the processor is configured to obtain video data of the user answering a predetermined diagnostic question, and to extract image data of the user from the video data, and to extract audio data of the user from the video data.
Specifically, in this step, the video data may be a video-audio recording of the user by answering a diagnosis question predetermined by a professional psychotherapist. The video data may extract image data and audio data of the user, and the image data may be extracted from the video data frame by frame. The audio data may be extracted from the video data, and the audio data may provide emotional characteristics for the first diagnostic model and textual data for the second diagnostic model.
Therefore, by acquiring the video data of the user answering the preset diagnosis questions, the image data and the audio data can be extracted, and a data basis is provided for the subsequent psychological disease diagnosis.
Referring to fig. 3, in some embodiments, S30 includes:
s31: user facial feature information in the image data is extracted to form a facial feature vector matrix.
In some such approaches, the processor is operative to extract user facial feature information in the image data to form a facial feature vector matrix.
Specifically, the key information related to the human face in the image data may be extracted as facial feature vectors, where the facial features may be eyes, nose, mouth, eyebrows, cheeks, chin, etc., so that a feature vector may be formed for each feature, and a plurality of facial feature vectors with the same dimension may form a facial feature vector matrix.
In this way, by extracting the facial feature information of the user in the image data, a plurality of facial feature vectors with the same dimension are formed, so that the psychological disease state of the user can be analyzed according to the features of each part of the face, a facial feature vector matrix is further formed, the facial feature vector matrix is provided for the first diagnosis model, and a facial feature basis is provided for obtaining a first probability of the psychological disease state of the user.
Referring to fig. 3, in some embodiments, S30 includes:
s32: the audio data is mood classified based on a predetermined mood classification model to form a matrix of mood feature vectors.
In some such approaches, the processor is configured to perform emotion classification on the audio data based on a predetermined emotion classification model to form an emotion feature vector matrix.
Specifically, the predetermined emotion classification model may be a classification of the user's speech into several categories of calm, anger, surprise, depression, happiness, fear, sadness, etc., each emotion being represented by a vector. For example, calm can be expressed as (1, 0), anger can be expressed as (0, 1, 0) and so on, the vector for each emotion may be represented as a 1 x n matrix, forming a matrix of emotion feature vectors.
Therefore, a plurality of emotion characteristic vectors with the same dimensionality are formed by extracting the emotion characteristic information of the user in the audio data, an emotion characteristic vector matrix is further formed, and an emotion characteristic vector matrix is provided for the first diagnosis model, so that an emotion characteristic basis is provided for the first probability of the state of the user psychological disease.
Referring to fig. 3, in some embodiments, S30 includes:
s33: and mapping the dimension of the emotion characteristic vector matrix and the face characteristic vector matrix so that the dimension of the mapped emotion characteristic vector matrix is the same as the dimension of the face characteristic vector matrix.
In some embodiments, the processor is configured to map the dimensions of the matrix of emotional feature vectors with the matrix of facial feature vectors such that the dimensions of the matrix of mapped emotional feature vectors are the same as the dimensions of the matrix of facial feature vectors.
Specifically, the emotion feature vector matrix has a different dimension from the face feature vector matrix, and therefore, the emotion feature vector matrix and the face feature vector matrix cannot be merged together, and therefore, matching analysis cannot be performed together. For example, the emotion feature vector matrix may be a 1 × 2 matrix, written as:
A=[0 1]
the facial feature vector matrix may be a 3 x 3 matrix written as:
further, mapping the emotion feature vector matrix into a feature vector matrix with the same dimension as the face feature vector matrix can be expressed as:
it should be noted that the emotion feature vector matrix a may be a matrix of 1 × p, the facial feature vector matrix may be a matrix of m × n, and the emotion feature vector matrix a' after mapping is a matrix of 1 × n, which is not limited herein.
Therefore, the emotion characteristic vector matrix is mapped into the characteristic vector matrix with the same dimension as the face characteristic vector matrix, and a basis is provided for vector splicing and fusion of the emotion characteristic vector matrix and the face characteristic vector matrix.
Referring to fig. 3, in some embodiments, S30 includes:
s34: carrying out vector splicing and fusion on the mapped emotion characteristic vector matrix and the mapped face characteristic vector matrix;
s35: and inputting the spliced eigenvector matrix into a first diagnosis model to obtain a first probability of the psychological disease state.
In some embodiments, the processor is configured to perform vector splicing and fusion on the mapped emotion feature vector matrix and the facial feature vector matrix, and to input the spliced feature vector matrix into the first diagnosis model to obtain the first probability of the psychological disease state.
Specifically, after mapping, the emotion characteristic vector matrix and the face characteristic vector matrix with the same dimensionality can be spliced and fused to form a characteristic vector matrix. And inputting the spliced eigenvector matrix into a first diagnosis model, and obtaining a first probability of the psychological disease state through matching analysis of the first diagnosis model. For example, the emotion feature vector matrix and the face feature vector matrix with the same dimension may be written as:
emotion feature vector matrix:
A'=[0 1 0]
face feature vector matrix:
further, the emotion characteristic vector matrix and the face characteristic vector matrix are spliced into a fusion, and the fusion is recorded as:
it should be noted that the matrix C is formed by splicing and fusing the mapped emotion feature vector matrix (1 × n matrix) and facial feature vector matrix (m × n matrix), where the matrix C may be an (m + 1) × n matrix, and the specific embodiment is not limited herein.
Therefore, the emotion characteristic vector matrix and the facial characteristic vector matrix are spliced into the same matrix, so that emotion characteristic change and facial characteristic change of a user can be comprehensively considered when the first diagnosis model analyzes the mental disease state, and finally a model result of an image and audio is obtained, so that the first probability of the mental disease state is obtained.
Referring to fig. 4, in some embodiments, S20 includes:
s21: and performing text extraction processing on the video and audio data of the user answering the preset diagnosis question to obtain text data.
In some embodiments, the processor is configured to perform text extraction processing on the audio/video data of the user answering the predetermined diagnostic question to obtain text data.
Specifically, in this step, audio data extraction is performed by audio-visual data from which the user answers a predetermined diagnostic question. The voice data is converted into the text data, and the ASR technology can be used for carrying out automatic voice recognition, so that time and labor are saved, and convenience and rapidness are realized. And manual translation can also be used, so that the conversion precision of the manual translation is higher.
Therefore, after the audio and video data of the user are obtained, the audio data in the audio and video data are converted into text data, and therefore, confusion factors caused by different individuals, such as accents, speech speed, intonation and the like, can be avoided. Secondly, the meaning which the user wants to express can be judged according to the semantics.
Referring to fig. 5, in some embodiments, S40 includes:
s41: extracting the content of the text data;
s42: and performing disease matching according to the extracted content to obtain a second probability of the psychological disease state.
In some embodiments, the processor is configured to perform content extraction on the text data and perform disease matching based on the extracted content to obtain a second probability of a psychological disease state.
Specifically, the acquired text information may be extracted, and the voice irrelevant to the user's answer may be deleted, which mainly includes the predetermined diagnostic question and the dialog gap blank, so that it may be ensured that the text data input into the second diagnostic model only includes the text data of the user. The extracted text data may be composed of several sentences arranged in time order. For example, the text data may be "how the quality of sleep is in this period, i sleeps for a long time, i cannot fall asleep, what is the most happy thing of you in the near future, and no thing which can be happy recently", the text data may be "i sleeps for a long time, i cannot fall asleep", and "no is the most happy thing of me", a predetermined diagnosis question is deleted, text data to be answered by the user is retained, and the extracted text is composed of two sentences arranged in time sequence. Based on the second diagnosis model, a plurality of sentences arranged in time order may be disease-matched, thereby obtaining a second probability of the psychological disease state.
In this way, the content of the text data is extracted, and only the text data answered by the user is reserved, so that after the text data is subjected to disease matching, a second probability representing the psychological disease state of the user can be obtained.
Referring to fig. 6, in some embodiments, S50 includes:
s51: and performing fusion calculation on the first probability and the second probability according to a preset weight to obtain a final probability of the psychological disease state of the user.
In some embodiments, the processor is configured to perform a fusion calculation of the first probability and the second probability according to a predetermined weight to obtain a final probability of the user's mental disease state.
In particular, the weights of the predetermined first and second probabilities may be configured according to the form that different users exhibit with respect to the psychological disease state, e.g. the first probability weight may be adjusted down and the second probability weight may be adjusted up as soon as the user has no significant change in facial expression and mood when answering the predetermined diagnostic question, while the semantics of answering the predetermined diagnostic question are negative, sad and full of negative energy. The adjustment can be performed in real time according to actual conditions, and is not limited herein.
The final probability of the user's mental disease state may be calculated from the first probability and the second probability according to a predetermined weight, e.g., the first probability is P1 The second probability is P2 The final probability is P, the first probability weight is F, and the second probability weight is 1-F, wherein P1 ∈[0,1],P2 ∈[0,1],F∈[0,1]Then, P = P1 *F+P3 *(1-F)。
Therefore, the first probability and the second probability are fused and calculated according to the preset weight to obtain the final diagnosis result. And modifying the preset weight according to the actual condition to enable the final diagnosis result to be in accordance with the reality, thereby achieving the purpose of diagnosing the psychological disease state for the user.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media embodying computer-executable instructions which, when executed by one or more processors, cause the processors to perform the method of any of the embodiments described above.
As such, the present invention provides a non-transitory computer-readable storage medium of a computer program, storing the computer program, which, when executed by one or more processors, causes the processors to perform a psychological disease diagnosis method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, and the program can be stored in a non-volatile computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or the like.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.