further, mapping the emotion feature vector matrix into a feature vector matrix with the same dimension as the face feature vector matrix can be expressed as:

it should be noted that the emotion feature vector matrix a may be a matrix of 1 × p, the facial feature vector matrix may be a matrix of m × n, and the emotion feature vector matrix a' after mapping is a matrix of 1 × n, which is not limited herein.

Referring to fig. 3, in some embodiments, S30 includes:

s34: carrying out vector splicing and fusion on the mapped emotion characteristic vector matrix and the mapped face characteristic vector matrix;

s35: and inputting the spliced eigenvector matrix into a first diagnosis model to obtain a first probability of the psychological disease state.

In some embodiments, the processor is configured to perform vector splicing and fusion on the mapped emotion feature vector matrix and the facial feature vector matrix, and to input the spliced feature vector matrix into the first diagnosis model to obtain the first probability of the psychological disease state.

Specifically, after mapping, the emotion characteristic vector matrix and the face characteristic vector matrix with the same dimensionality can be spliced and fused to form a characteristic vector matrix. And inputting the spliced eigenvector matrix into a first diagnosis model, and obtaining a first probability of the psychological disease state through matching analysis of the first diagnosis model. For example, the emotion feature vector matrix and the face feature vector matrix with the same dimension may be written as:

emotion feature vector matrix:

A'＝[0 1 0]

face feature vector matrix:

further, the emotion characteristic vector matrix and the face characteristic vector matrix are spliced into a fusion, and the fusion is recorded as:

it should be noted that the matrix C is formed by splicing and fusing the mapped emotion feature vector matrix (1 × n matrix) and facial feature vector matrix (m × n matrix), where the matrix C may be an (m + 1) × n matrix, and the specific embodiment is not limited herein.

Referring to fig. 4, in some embodiments, S20 includes:

s21: and performing text extraction processing on the video and audio data of the user answering the preset diagnosis question to obtain text data.

In some embodiments, the processor is configured to perform text extraction processing on the audio/video data of the user answering the predetermined diagnostic question to obtain text data.

Specifically, in this step, audio data extraction is performed by audio-visual data from which the user answers a predetermined diagnostic question. The voice data is converted into the text data, and the ASR technology can be used for carrying out automatic voice recognition, so that time and labor are saved, and convenience and rapidness are realized. And manual translation can also be used, so that the conversion precision of the manual translation is higher.

Therefore, after the audio and video data of the user are obtained, the audio data in the audio and video data are converted into text data, and therefore, confusion factors caused by different individuals, such as accents, speech speed, intonation and the like, can be avoided. Secondly, the meaning which the user wants to express can be judged according to the semantics.

Referring to fig. 5, in some embodiments, S40 includes:

s41: extracting the content of the text data;

s42: and performing disease matching according to the extracted content to obtain a second probability of the psychological disease state.

In some embodiments, the processor is configured to perform content extraction on the text data and perform disease matching based on the extracted content to obtain a second probability of a psychological disease state.

Specifically, the acquired text information may be extracted, and the voice irrelevant to the user's answer may be deleted, which mainly includes the predetermined diagnostic question and the dialog gap blank, so that it may be ensured that the text data input into the second diagnostic model only includes the text data of the user. The extracted text data may be composed of several sentences arranged in time order. For example, the text data may be "how the quality of sleep is in this period, i sleeps for a long time, i cannot fall asleep, what is the most happy thing of you in the near future, and no thing which can be happy recently", the text data may be "i sleeps for a long time, i cannot fall asleep", and "no is the most happy thing of me", a predetermined diagnosis question is deleted, text data to be answered by the user is retained, and the extracted text is composed of two sentences arranged in time sequence. Based on the second diagnosis model, a plurality of sentences arranged in time order may be disease-matched, thereby obtaining a second probability of the psychological disease state.

Referring to fig. 6, in some embodiments, S50 includes:

s51: and performing fusion calculation on the first probability and the second probability according to a preset weight to obtain a final probability of the psychological disease state of the user.

In some embodiments, the processor is configured to perform a fusion calculation of the first probability and the second probability according to a predetermined weight to obtain a final probability of the user's mental disease state.

In particular, the weights of the predetermined first and second probabilities may be configured according to the form that different users exhibit with respect to the psychological disease state, e.g. the first probability weight may be adjusted down and the second probability weight may be adjusted up as soon as the user has no significant change in facial expression and mood when answering the predetermined diagnostic question, while the semantics of answering the predetermined diagnostic question are negative, sad and full of negative energy. The adjustment can be performed in real time according to actual conditions, and is not limited herein.

The final probability of the user's mental disease state may be calculated from the first probability and the second probability according to a predetermined weight, e.g., the first probability is P₁ The second probability is P₂ The final probability is P, the first probability weight is F, and the second probability weight is 1-F, wherein P₁ ∈[0,1]，P₂ ∈[0,1]，F∈[0,1]Then, P = P₁ *F+P₃ *(1-F)。

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media embodying computer-executable instructions which, when executed by one or more processors, cause the processors to perform the method of any of the embodiments described above.

As such, the present invention provides a non-transitory computer-readable storage medium of a computer program, storing the computer program, which, when executed by one or more processors, causes the processors to perform a psychological disease diagnosis method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, and the program can be stored in a non-volatile computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or the like.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A method for multi-modal diagnosis of a psychological disease, the method comprising:

2. The diagnostic method of claim 1, wherein the obtaining of audio-visual data of the user answering the predetermined diagnostic questions comprises:

extracting image data of a user from the video data;

and extracting audio data of the user from the video data.

3. The method of claim 2, wherein obtaining a first probability of a psychological disease state from the audiovisual data based on the first diagnostic model comprises:

4. The method of claim 3, wherein obtaining a first probability of a psychological disease state from the audiovisual data based on the first diagnostic model comprises:

and performing emotion classification on the audio data based on a predetermined emotion classification model to form an emotion feature vector matrix.

5. The method of claim 4, wherein obtaining a first probability of a psychological disease state from the audiovisual data based on the first diagnostic model comprises:

6. The method of claim 5, wherein obtaining the first probability of the mental disease state from the audio/video data based on the first diagnostic model comprises:

and inputting the spliced feature vector matrix into the first diagnosis model to obtain a first probability of the psychological disease state.

7. The diagnostic method of claim 1, wherein said obtaining text data of said user answering said predetermined diagnostic question comprises:

8. The method of claim 7, wherein the deriving a second probability of a psychological disease state from the textual data based on a second diagnostic model comprises:

extracting the content of the text data;

9. The diagnostic method of claim 1, wherein said deriving a final probability of the user's mental disease state from the first and second probabilities comprises:

10. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, carries out the method of any one of claims 1-9.

11. A non-transitory computer-readable storage medium of a computer program, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-9.