CN119579737B

Movatterモバイル変換

Info

Publication number: CN119579737B
Application number: CN202411501244.8A
Authority: CN
Inventors: 朱勇; 孙曼青; 陈攀
Original assignee: Lixin Tongzhi Technology Beijing Co ltd
Current assignee: Lixin Tongzhi Technology Beijing Co ltd
Priority date: 2024-10-25
Filing date: 2024-10-25
Publication date: 2025-09-26
Anticipated expiration: 2044-10-25
Also published as: CN119579737A

Abstract

The invention discloses a realization method, a device, computer equipment and a readable storage medium of a cloud rendering voice interaction digital person, which comprise the following steps: firstly, constructing a voice interaction digital person, performing audio lip movement synchronization and personification configuration, and then placing the voice interaction digital person on a front-end page through real-time cloud rendering. After the front-end voice interaction input of the user is obtained, interaction feedback of digital people is generated through previous configuration, and then the feedback is real-time cloud-rendered to a front-end page. The method meets the requirements of various industries on high-quality voice interaction digital people, reduces the requirements of user equipment by means of a cloud rendering technology, and improves user experience.

Description

Implementation method and device of cloud rendering voice interaction digital person, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of digital persons, in particular to a method and a device for realizing cloud rendering voice interaction digital persons, computer equipment and a readable storage medium.

Background

With the explosive development of metauniverse, digital people and AI technology, the demand of interactive digital people for conversations in various industries is increasing. However, high quality voice interactive digital people are currently scarce.

Disclosure of Invention

The invention aims to provide a method and a device for realizing cloud rendering voice interaction digital people, computer equipment and a readable storage medium.

In a first aspect, an embodiment of the present invention provides a method for implementing a cloud-rendered voice interaction digital person, including:

Constructing a voice interaction digital person, and carrying out audio lip movement synchronous configuration and personification configuration aiming at the voice interaction digital person;

Rendering the configured voice interaction digital person to a preset front-end page through a real-time cloud;

acquiring voice interaction input initiated by a user through the preset front-end page;

based on voice interaction input, generating interaction feedback of the voice interaction digital person through the audio lip movement synchronous configuration and the personification configuration;

And rendering the interactive feedback real-time cloud of the voice interactive digital person to a preset front-end page.

In one possible implementation, the constructing a voice interactive digital person includes:

Obtaining a basic character model, and carrying out grid carving on the basic character model to obtain a model surface piece of the basic character model;

Performing mapping treatment on the molded surface piece by using a UV editor to finish coloring mapping of the basic character model;

Performing skeleton creation and skin treatment on the basic character model to obtain a plurality of skeleton nodes and skeleton weights of the skeleton nodes, and determining a gesture matrix for controlling the movement of the skeleton nodes based on the skeleton weights;

Acquiring a plurality of preset expression groups, wherein each expression group corresponds to one expression vertex group, each expression vertex group comprises a plurality of expression vertices, each expression vertex comprises a default posture and an expression maximization posture, and the default posture and the expression maximization posture are used for calculating the posture position of the corresponding expression vertex according to the expression weight;

And constructing the voice interaction digital person based on the model surface patch, the coloring map, the skeleton creation and the skin treatment and the expression groups.

In one possible implementation, the audio lip movement synchronization configuration for the voice interactive digital person includes:

Acquiring sample voice video data;

detecting a face region from the sample voice video data, and identifying a plurality of face feature points from the face region;

Randomly generating a plurality of expression groups, converting the plurality of expression groups into a 2D image, determining face feature points in one-to-one correspondence, and training to obtain a first mapping model for representing the mapping relation between the plurality of expression groups and the face feature points;

determining a face feature point sequence based on the sample voice video data, and determining a predicted expression base corresponding to the face feature point sequence by utilizing the first mapping model;

and acquiring an audio sequence of the sample voice video data, training according to the audio sequence and the predicted expression groups to obtain a second mapping model for representing the mapping relation between the audio data and the expression groups, and completing the audio lip movement synchronous configuration.

In one possible implementation, the personifying configuration for the voice interaction digital person includes:

Acquiring a joint topological graph of the voice interaction digital person, wherein the joint topological graph comprises a plurality of root nodes and a plurality of child nodes respectively included by each root node;

Converting an initial rotation quaternion under a world coordinate system into a target rotation quaternion relative to a father node, and performing deep traversal on the root node and the child node to perform assignment;

Controlling a preset animation based on a state machine, acquiring target rotation quaternions of the root node and the child nodes at animation switching time, and realizing smooth transition between the state machine animation and the human body characteristic animation based on quaternion smooth interpolation;

setting random time to trigger a preset blink animation, and setting a eyeball shaking range, a shaking frequency and a shaking direction;

And configuring a preset expression base for a preset question-answering text, and triggering the preset expression base when triggering the preset question-answering text.

In a possible implementation manner, the rendering the configured voice interaction digital person to the preset front-end page through the real-time cloud includes:

carrying out the matting processing on the voice interaction digital person to obtain a voice interaction digital person after matting;

establishing Websoket connection with the preset front-end page, and packaging a character image switching interface, an audio lip movement synchronous playing interface and an action driving interface;

and displaying the voice interaction digital person after the image matting to a display area provided by the preset front-end page.

In a possible implementation manner, the generating the interactive feedback of the voice interactive digital person through the audio lip movement synchronization configuration and the personification configuration based on the voice interactive input includes:

receiving the voice interaction input, and performing text conversion on the voice interaction input to obtain input text data;

inputting the input text data into a pre-trained conversational language big model to obtain feedback text data;

performing audio conversion on the feedback text data to obtain feedback audio data;

And processing the feedback audio data through the audio lip movement synchronous configuration and the personification configuration to obtain the interactive feedback of the voice interactive digital person.

In a possible implementation manner, the cloud rendering the interactive feedback of the voice interactive digital person to the preset front-end page includes:

Rendering the interactive feedback of the voice interactive digital person to obtain a rendering result;

and the voice interaction digital person displays the rendering result to the preset front-end page through Websoket connection.

In a second aspect, an embodiment of the present invention provides an implementation apparatus for cloud-rendering voice interaction digital people, including:

The system comprises a construction module, a configuration module, a real-time cloud rendering module and a display module, wherein the construction module is used for constructing a voice interaction digital person, and carrying out audio lip movement synchronous configuration and anthropomorphic configuration aiming at the voice interaction digital person;

The interactive module is used for acquiring voice interaction input initiated by a user through the preset front-end page, generating interaction feedback of the voice interaction digital person through the audio lip movement synchronous configuration and the personification configuration based on the voice interaction input, and rendering the interaction feedback of the voice interaction digital person to the preset front-end page in real time.

In a third aspect, an embodiment of the present invention provides a computer device, including a processor and a nonvolatile memory storing computer instructions that, when executed by the processor, perform the method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a readable storage medium, where the readable storage medium includes a computer program, where the computer program controls a computer device where the readable storage medium is located to execute the method of the first aspect when running.

Compared with the prior art, the method and the device for realizing the cloud rendering voice interaction digital person, the computer equipment and the readable storage medium have the beneficial effects that the voice interaction digital person is constructed, the audio lip movement synchronization and anthropomorphic configuration are carried out, and then the cloud interaction digital person is arranged on a front-end page through real-time cloud rendering. After the front-end voice interaction input of the user is obtained, interaction feedback of digital people is generated through previous configuration, and then the feedback is real-time cloud-rendered to a front-end page. The method meets the requirements of various industries on high-quality voice interaction digital people, reduces the requirements of user equipment by means of a cloud rendering technology, and improves user experience.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described. It is appreciated that the following drawings depict only certain embodiments of the invention and are therefore not to be considered limiting of its scope. Other relevant drawings may be made by those of ordinary skill in the art without undue burden from these drawings.

Fig. 1 is a schematic flow chart of steps of a method for implementing a cloud-rendering voice interaction digital person according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a lip synchronization process framework according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a device for implementing cloud-rendering voice interaction digital people according to an embodiment of the present invention;

Fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The following describes specific embodiments of the present invention in detail with reference to the drawings.

In order to solve the technical problems in the foregoing background technology, fig. 1 is a schematic flow chart of a method for implementing a cloud-rendering voice interaction digital person according to an embodiment of the present disclosure, and the method for implementing the cloud-rendering voice interaction digital person is described in detail below.

Step S201, constructing a voice interaction digital person, and carrying out audio lip movement synchronous configuration and personification configuration aiming at the voice interaction digital person;

Step S202, rendering the configured voice interaction digital person to a preset front-end page through a real-time cloud;

step S203, obtaining voice interaction input initiated by a user through the preset front-end page;

step S204, based on voice interaction input, generating interaction feedback of the voice interaction digital person through the audio lip movement synchronous configuration and the personification configuration;

step S205, the interactive feedback real-time cloud rendering of the voice interactive digital person is conducted to a preset front-end page.

In an embodiment of the present invention, the server obtains the base character model from a local model library or external storage, for example. For example, the base character model may be a simple three-dimensional model of the human body, similar to a human body model without much detail. This model may have only basic body contours, such as a simple sphere for the head, a simple cylinder for the limbs, etc.

The server then begins the grid carving operation. The initial face of this basic character model is assumed to be relatively smooth, without the details of the five sense organs. The server adds detail step by step in the face area by means of a specific mesh carving algorithm. For example, in order to shape the nose, the server performs stretching, recessing, etc. on the mesh of the middle part of the face, and after a series of operations, a model mask having a finer face structure is obtained. These molded panels, like a puzzle of panels that make up the appearance of a digital person, combine to form the exact appearance of the digital person.

After obtaining the molded film, the server uses a UV editor. Taking the digital human clothing as an example, the server first spreads the molded face piece of the clothing section in a UV editor, just as if the three-dimensional clothing shape were tiled on a two-dimensional plane.

The server then selects the appropriate texture picture as the map of the garment. Assuming a blue striped shirt texture is selected, this texture picture is correspondingly applied to the unfolded UV. For a skin portion, a suitable skin tone texture is selected for mapping. Through this process, the digital person changes from a shape-only model mask to a colored and textured image, which appears more realistic.

When the server creates a skeletal structure, the digital person's arm is taken as an example. Skeletal nodes are created inside the arm, from the shoulder to the elbow to the wrist, forming a chain-like skeletal structure. These skeletal joints, like the joints of a human arm, can control the bending and stretching of the arm.

During the skinning process, the server determines the impact weight of each bone node on the surrounding molded panels. For example, for model patches around a shoulder bone node, the server calculates that the bone node has a higher impact weight on these patches, because the motion of the shoulder has a greater impact on the stretching of nearby skin and muscles. When the arm is bent, the model surface piece deforms according to the calculated gesture matrix according to the skeleton weight, so that the arm action of the digital person looks natural and smooth.

The expression base acquired by the server is basic data of different expressions which are defined in advance. For example, there is a smiling expression base, and the corresponding expression vertex group includes expression vertices related to the mouth, eyes, and the like. The default pose of the mouth expression vertex may be closed and the expression maximization pose is a smile shape that breaks open to a large extent.

The server integrates the expression groups with the previously constructed molded face pieces, the colored map, the bone structures and the like. When the digital person is required to express smiling expression, the mouth, eyes and the like of the digital person are correspondingly deformed by calculating the gesture position of the expression vertex according to the set expression weight, so that a complete voice interaction digital person capable of expressing various expressions is constructed.

The server obtains sample audio video data from a large multimedia database. For example, the database may contain lecture videos, dialogue videos, etc. of various persons. The server selects a video of a person for lecturing, wherein the video has clear voice content and corresponding facial expression and lip movement of the person.

The server uses an image recognition algorithm to detect the face region. In this lecture video, the server recognizes a face portion in the picture by analyzing pixel information in a video frame. Then, in the face region, key face feature points such as corners of mouth, corners of eyes, tips of nose and the like are further identified. For example, the position of the corners of the mouth can be used as an important feature point for judging lip movements, and the position of the corners of the eyes is also critical for expression recognition.

The server randomly generates expression groups, such as an expression group with a surprise expression. After converting the expression base into a 2D image, the server determines points in the 2D image corresponding to facial feature points previously identified in the sample video. For example, in a 2D image of a surprise expression, the feature points corresponding to the shape of the mouth opening correspond to the feature points of the mouth opening when the person in the sample video is surprised. Through a large amount of matching data of the expression base and the corresponding face feature points, the server trains a first mapping model, and the model can predict possible expression bases according to the face feature points.

In the speech video, along with the playing of the video, the server determines a face feature point sequence according to the face feature point information in each frame. For example, at the beginning of a video, the facial expression of a person is calm, the feature points show the mouth closed, and the corners of the eyes are normal. With the progress of the lecture content, the character starts to be excited, the mouth opens, the canthus rises, and the like. The server inputs the feature point sequence into a first mapping model, and the model can predict the change condition of the corresponding expression base in the process.

The server extracts an audio sequence from the lecture video, wherein the audio sequence contains the voice content, the tone of the voice, the speed of the voice and other information of the person lecture. The server combines the audio sequence with the previously predicted expression base for training. For example, when a high-pitched, rapid speech portion appears in the audio, the corresponding expression base may be a relatively excited expression. Through a large amount of data training, a second mapping model is obtained, and the model can accurately determine the expression groups which should correspond according to the input audio data, so that the audio lip movement synchronous configuration is realized.

The server obtains a digital person's joint topology map, taking the digital person's leg joints as an example. The hip joint of the leg can be used as a root node, and the subnodes such as knee joint, ankle joint and the like are arranged below the root node. This joint topology, like a family tree, clearly shows the hierarchical relationship between the individual joints.

In the motion simulation of a digital person, it is assumed that the digital person stands at an initial position in the world coordinate system. For leg joints, the server converts the initial rotational quaternion of the hip joint (root node) in world coordinate system to a target rotational quaternion with respect to the body (parent node). Then, through depth traversal, the server assigns this target rotation quaternion to the child nodes such as knee joint and ankle joint in turn. Thus, when the hip joint rotates, the knee joint and the ankle joint rotate according to the assignment in a correct relation, so that the leg part acts naturally.

The server uses a state machine to control the digital person's walking animation. When the digital person switches from a standing state to a walking state, at the animation switching time point, the server acquires target rotation quaternions of the hip joint (root node) and the knee joint and ankle joint (child node). Through the quaternion smooth interpolation algorithm, the articulation of the digital person does not suddenly jump in the process of converting the standing still animation into the walking animation, but smoothly transits, and is as natural as a real person does from still to walking.

The server sets blink animation for the digital person. For example, a blink animation is randomly triggered every 510 seconds. Meanwhile, the shake range of the eyeball is set, for example, the shake range in the left-right direction is between 12 millimeters, the shake frequency is 12 times per second, and the shake direction can be random or according to a certain mode, such as left-right first, and the like. Thus, the eyes of the digital person look more vivid, and have an anthropomorphic effect.

The server configures the expression base for some common question-answer texts. For example, when the user asks "do you do? the server configures a friendly smiling expression for the digital person. When the digital person receives the problem, the smiling expression group is triggered, so that the expression of the digital person is matched with the interactive content, and the personification effect is improved.

The server processes the voice interaction digital person by using an image matting algorithm. The background of the voice interaction digital person is assumed to be a solid-color virtual scene, and the server separates the digital person from the background by identifying the color difference between the digital person and the background to obtain an image of only the digital person main body, namely the voice interaction digital person after the image is scratched.

The server establishes Websocket connection with the front-end page as if a bi-directional communication pipe was established between the server and the front-end. The server then encapsulates a character switching interface through which, for example, when the front page has a request to switch digital character appearances (e.g., change clothes, change hair, etc.). For an audio lip-sync playing interface, when audio data is to be played and lip-sync is to be achieved, a front page sends a request to a server through the interface and receives synchronization data. The action driving interface is used for controlling various actions of the digital person, such as walking, waving hands and the like.

And the server sends the scratched digital signals to a display area of the front-end page. For example, the front page has a rectangular area dedicated to displaying digital persons, the server transmits image data of the digital persons to the front, the front page displays the digital persons in the area after receiving the data, and the user can see the voice interactive digital persons at the front.

When a user inputs voice through the microphone device at the front-end page, the front-end page sends the voice data to the server through the Websocket connection. For example, the user speaks "how do today weather? this voice data is transmitted to the server side to be processed.

After receiving the voice data of the user, the server uses voice recognition technology to convert the voice into characters. For example, the number of the cells to be processed, what will be "what is the weather today? to be accurately converted into corresponding text contents, input text data is obtained.

The server enters this input text data into a pre-trained conversational language big model, "what is today. This large model, trained on large amounts of text data, is able to generate appropriate replies from the inputs. For example, a large model may revert to "today's sunny weather, sunny. "this reply is feedback text data.

The server uses the speech synthesis technology to make the weather today clear and sunny. "this feedback text data is converted into audio data. The intonation, speech speed, etc. of the audio data may be adjusted according to preset rules, for example, synthesizing the audio in a relatively light intonation to obtain feedback audio data.

The server configures the feedback audio data through the lip movement synchronization of the audio trained before, and determines the expression base of the digital person when the audio is played. At the same time, according to the personified configuration, some actions are triggered, such as a digital person having a slight head shake while speaking. Thus, interactive feedback of voice interactive digital people is obtained, including feedback in terms of audio, lip movements, expression, actions and the like.

The server renders the interactive feedback of the digital person. For example, the expression, the action and the audio of the digital person are combined for rendering, so that the expression of the digital person can be synchronous with the audio, and the action is natural and smooth. The rendering process may involve operations such as optimizing images of digital persons, adjusting audio, and the like, resulting in rendering results.

And the server sends the rendering result to the front-end page through Websocket connection. And after the front-end page receives the data, updating the display content of the digital person in the display area. For example, the digital person starts to play audio and display expressions and actions according to the rendering result, and the user can see the reply of the digital person to the question of the user, so that a voice interaction process is completed.

In the embodiment of the invention, the construction of the voice interaction digital person can be implemented through the following examples.

In an embodiment of the present invention, the server obtains the base character model from a pre-built model repository. The model resource library contains basic character models of various types and styles, and is obtained through long-time accumulation and arrangement. For example, in a server providing digital personal services for video special effects production, the model repository may contain basic character models of different gender, age, race. The server selects a young male basic character model with basic human scale and simple outline from the library based on the needs of the current project, such as the need to create a young Asian male digital character. The basic model is only a general human body shape in an initial state, has very limited details, is similar to a simple geometric combination, has a head similar to a sphere, has a body part of a simple cuboid, has limbs of a cylinder and the like, and provides a basic framework for subsequent refinement treatment.

The server processes the underlying character model using a mesh sculpting tool in specialized three-dimensional modeling software (e.g., blender or Maya). Taking the face as an example, the server first subdivides the mesh of the face region, increasing the density of the mesh so that the facial features can be molded more carefully. For the eye part, the server performs a concave operation on the mesh around the eyes to form the shape of the eye socket, then gradually shapes the convex part of the eyeball in the eye socket, and determines the size and shape of the eyeball by continuously adjusting the positions of the vertices of the mesh. For the nose, stretching operation is performed from the grids in the center of the face to form the shape of the nose bridge, and then the grids at the two sides of the nose bridge are adjusted to form the shape of the nose wings. The mouth part is formed by operating the grid around the lips, determining the outline of the lips, and then displaying the thickness and curve of the lips by adjusting the shape of the grid.

In body parts, for example the chest and abdomen. The server adjusts the grid of the chest, and according to the structure of the muscle of the human body, the front and the side of the chest are subjected to different degrees of bulge operation so as to shape the chest muscle. For the abdomen, the grids of the abdomen are smoothed according to the physiological curve of the human body, so that the grids are slightly concave to show the real form of the human body. In the four limbs, the outline of the muscles is reflected by stretching and protruding the corresponding grids on the parts with developed muscles, such as the biceps brachii muscle of the arm, the quadriceps femoris muscle of the leg and the like. After these mesh engraving operations, the basic character model is transformed from a simple geometric combination to model patches with more detail that accurately describe the figure's shape, each patch consisting of a series of vertices, edges, and faces that together make up the three-dimensional shape of the figure.

The server opens a UV editor in the three-dimensional modeling software and operates on the model patches previously engraved. Taking the digital human head as an example, the server expands the model patches of the head onto the two-dimensional plane of the UV editor. This process is analogous to cutting and tiling a three-dimensional head model on a plane such that each patch has a corresponding two-dimensional coordinate area. For the face patches of the face, because of the relatively complex face structure, the patches of the eyes, nose, mouth and other parts need to be carefully unfolded respectively, so that the reasonable layout of the face patches in the UV editor is ensured, and the accurate mapping operation is carried out later. The same is true of body parts, for example, model patches of arms, breasts, legs and the like are respectively unfolded in a UV editor, and the connection relationship among the patches is noted, so that the situation of discontinuous textures in the mapping process is avoided.

The server selects the appropriate skin texture from a library of textures prepared in advance. The texture library contains skin textures of various skin colors and skin types, and is manufactured according to the characteristics of the real human skin. For example, for a young male digital person, the server selects a skin texture with a healthy skin tone, a fine texture. The server maps this skin texture onto the model patches of head, neck, hand and other exposed parts of the body that are unfolded in the UV editor. In the mapping process, the proper direction and proportion of the texture are ensured, for example, the texture direction of the skin is consistent with the physiological structure of a human body, and the skin texture of the arm and the leg is required to show natural stretching and shrinking effects.

For the hair segment, the server selects the corresponding hair texture based on the digital person's hairstyle design. In the case of short hair styling, a short and clean hair texture is selected and then applied precisely to the hair styling surface of the head. In the clothing aspect, assuming that the digital person wears a T-shirt and jeans, the server selects the appropriate T-shirt texture and jeans texture, respectively, from the clothing texture library. The texture of the T-shirt is attached to the model surface piece of the chest and the arm part, attention is paid to texture mapping of detail parts such as necklines, cuffs and the like of the T-shirt, and the jeans texture is attached to the model surface piece of the leg part, so that the jeans texture can accurately show details such as folds, stitches and the like of trousers. After these mapping processes, the base character model is finished with a colored mapping from a shape-only model surface to a digital character with a realistic appearance.

The server creates a skeletal structure for the underlying character model in three-dimensional modeling software. Taking main bones of a human body as an example, first, vertebral bones are created, bone nodes such as thoracic vertebrae, lumbar vertebrae, sacral vertebrae and the like are sequentially created from cervical vertebrae, and these bone nodes constitute a digital human spine skeleton which is a supporting structure of the whole body. Bones of the limbs are then created, and for the arm, starting from the clavicle and scapula of the shoulder, the humerus of the upper arm, the ulna of the forearm and the radius are connected, and then to the carpal bones of the wrist, the metacarpal bones of the palm and the phalanges of the fingers, forming a complete arm bone structure. The legs then start from the pelvis and connect the femur of the thigh, the tibia and fibula of the calf, to the tarsal bones of the ankle, the metatarsal bones of the sole and the phalanges of the toes. In addition, skeletal nodes such as the skull of the head are created that together form the complete skeletal system of the digital person, just like the skeletal frame of a real human body, providing the underlying support and motion joints for the digital person's movements.

After creating the skeletal nodes, the server needs to precisely locate and adjust the position of each skeletal node. For example, the skeletal nodes of the shoulders are precisely aligned with the shoulder regions of the body model, ensuring that the motion of the shoulders will naturally drive the motion of the arms during subsequent movements. For each node of the vertebra, the nodes are arranged according to the physiological curve of the human body, so that the body posture of the digital human body is natural. In the hands and feet, the skeletal nodes of the fingers and toes are precisely distributed within the corresponding model patches so that the movements of the fingers and toes can be precisely controlled.

The server assigns skeletal weights to the model patches using a skinning tool. Taking the arm as an example, for a model facet of the upper arm, the server assigns higher bone weights to the humeral bone nodes, as the motion of the upper arm is primarily controlled by the humerus. As the humerus moves, the upper arm model patch with the higher weight deforms closely following the motion of the humerus. For model patches of the forearm, in addition to assigning higher weights to the ulna and radius skeletal nodes, a proportion of the weights are assigned to adjacent skeletal nodes based on the characteristics of the forearm's muscles and skin to ensure that the forearm's skin and muscles deform naturally as the arm flexes and stretches. At other parts of the body, such as the model patches of the chest, certain weights are assigned to the bone nodes of the vertebrae and ribs to ensure natural deformation of the chest as the body twists and bends.

The server calculates a gesture matrix for controlling the movement of a plurality of bone nodes according to the bone weights. Taking a simple arm bending action as an example, when the humeral bone node is rotated a certain angle about an axis, the server determines the new position of each patch vertex by computing the pose matrix based on the bone weights assigned to the upper arm model patches. The gesture matrix contains rotation, translation and scaling information of skeleton nodes, and deformation conditions of the model surface plates during skeleton movement can be accurately calculated through matrix multiplication operation. For example, for a certain vertex on an arm, according to the bone weight of the face piece where the vertex is positioned and the rotation angle of the humerus, a new coordinate of the vertex in a three-dimensional space is calculated through a posture matrix, so that the natural bending action of the arm is realized.

The server acquires a plurality of expression groups from a pre-constructed expression group library. The expression base is constructed through a large amount of expression data collection and analysis. For example, the expression base includes common expressions such as happy, sad, anger, surprise, fear, and the like. Each expression base has its unique characteristics, for example, a happy expression base, which includes expression vertex groups related to happy expression, and these vertex groups cover the expression vertices of the mouth, eyes, eyebrows, etc.

For the expressive peak of the mouth, in the default posture, the lips are closed, and the corners of the mouth are in a natural position. For the expression vertex of the eye, the eye is normally open under the default posture, and the eyeball is in a head-up state. The expression vertex of the eyebrow is a natural radian in a default posture.

In the happy expression base, the expression maximization gesture of the mouth expression vertex is that the mouth angle is greatly raised upwards to form a large smile shape, and the teeth can be partially exposed. The expression vertex of the eye is in the expression maximizing posture, the eye is slightly squinted, and the eye beads are slightly moved upward to express a gladness of eyes. The expression vertices of the eyebrows become more stretched under the gesture, the radian is flatter, and the default gesture and the expression maximization gesture of the expression vertices provide a basis for calculating expressions of different degrees according to the expression weights. For example, when the expression weight is 0.5, the pose of the expression vertices of the mouth, eyes, and eyebrows may be intermediate between the default pose and the expression maximizing pose, thereby exhibiting a moderately pleasing expression.

And the server integrates the model surface patch, the coloring map, the skeleton creation and the skin treatment results obtained in the previous steps with a plurality of expression bases. The modeling face mask and the coloring map determine the appearance image of the digital person, skeleton creation and skin treatment enable the digital person to perform various actions, and expression bases endow the digital person with the capability of expressing various emotions. For example, when a digital person needs to express a happy expression, the server combines the current expression weight according to the expression vertex group information in the happy expression base, and the position of the expression vertex is calculated, so that the mouth, eyes, eyebrows and other parts of the digital person express the happy expression. Meanwhile, when a digital person performs actions, such as walking or waving hands, a skeleton system controls the body to move according to a gesture matrix, a model surface piece of skin and clothes can be naturally deformed according to skeleton weights, and finally, a complete voice interaction digital person which can perform voice interaction and has rich expression and action capacity is constructed.

In the embodiment of the invention, the audio lip movement synchronization configuration is performed on the voice interaction digital person, and the audio lip movement synchronization configuration can be implemented through the following example.

Acquiring sample voice video data;

In the embodiment of the invention, the server obtains the sample audio-video data from a large database specially used for storing the multimedia data. This database contains a wide variety of video material, covering different people, scenes, languages, and emotional expressions. For example, the database includes news broadcast videos, movie clips, lecture recordings, daily conversation videos, and the like. In order to construct an audio lip movement synchronization configuration suitable for various scenes, a server can select from different types of videos in a targeted manner.

For news broadcasting video, the pronunciation of the broadcaster is clear, the speech speed is stable, and the facial expression richness is moderate, so that the news broadcasting video is a good training material. The server selects a plurality of news broadcast video clips of different television stations and different broadcasters, wherein the video clips contain different news topics, are serious political news, relatively heavy in expression of the broadcasters, are easy entertainment news, and are more vivid in expression of the broadcasters.

At the same time, movie fragments are also an important sample source. Actors in movies can express a variety of complex emotions and have diverse lip movements and voice coordination in different episodes and contexts. The server selects clips from various types of movies, including drama, comedy, action, etc. In the drama, the actors have deep dialogs, the lips move slowly and are rich in emotion, and in the comedy, the dialogue rhythm of the actors is fast, the expression is exaggerated, and the method can provide rich samples for audio lip movement synchronous configuration.

The recorded video of the lecture has value as well, the lecture content of the lecturer on the table is rich, the voice intonation has fluctuation, and the facial expression corresponds to the lecture content. The server selects lecture videos in different fields, such as professional lectures, inspirations and the like in the technical field, and the matching of lip movements and audios in the lecture process of people in the videos has certain regularity and representativeness.

Daily dialogue video is closer to life reality, and can reflect voice and lip movement of people in a natural state. The server collects daily dialogue videos in various scenes, such as chatting in family parties, street dialogues among friends, and the like, and voices and lips in the videos may not be as standard as news broadcasting or speech, but can more embody diversity in real scenes.

The server processes each frame of image in the sample audio video data using a face detection algorithm based on deep learning. Taking a news broadcast video as an example, the server breaks the video into images frame by frame, and then scans each frame of image. This algorithm can distinguish which regions are likely faces by analyzing the pixel characteristics in the image. It first looks for some rough facial features such as skin color areas, rough shape of facial contours, etc. In news broadcast video, the face of the broadcaster is usually in the center of the picture, and the background is relatively simple, so that the algorithm can be easily positioned to the approximate area where the face is. For some videos with complex backgrounds, such as crowd scenes in movie clips or outdoor daily dialogue scenes, an algorithm can accurately position face areas through multi-scale feature extraction and classification, and interference factors in the backgrounds, such as face areas of principal angles, are accurately identified in the crowd scenes.

After the face region is detected, the server further recognizes a plurality of face feature points using a specific face feature point recognition model. For the face region in each frame of image, the model can accurately locate key face feature points such as the corners of eyes, corners of mouth, nose tips, eyebrows and the like. Taking the mouth angle as an example, the model can accurately find the accurate position of the edge of the lip, and determine the coordinate point of the mouth angle. For the eye part, not only the position of the canthus can be positioned, but also the characteristic points such as the center position of the eyeball can be determined. In the film segment, the actors have various exaggerated expressions, such as large mouth when laughing, the model can accurately identify the characteristic points such as the maximum stretching position of the mouth angle at the moment, and the model can accurately position the characteristic points of eyebrow bending when the presenter slightly wrinkles the eyebrow to express thinking expression.

And the server randomly generates a plurality of expression groups according to the predefined expression types and parameters. For example, the server sets six basic expression types, happy, sad, angry, surprised, fear and disgust. For the happy expression base, the server randomly determines initial parameters of facial vertices related to the happy expression according to a theoretical model of facial muscle movements. These parameters include the degree of opening of the mouth, the angle of upward movement of the mouth angle, the degree of squinting of the eyes, etc. Similarly, for other expression types, corresponding expression base parameters are randomly generated.

The server converts the randomly generated expression base into a 2D image using a graphics rendering technique. Taking the happy expression base as an example, according to the parameters determined before, the server draws a 2D image, in which the mouth presents an upward curved shape, the corners of the mouth are raised, the eyes are slightly squinted, and the eyebrows are stretched. For 2D images of anger expression groups, the mouth may be tight and curved downwards, the eyebrows are wrinkled, the eyes are gazelle, etc.

And the server matches the generated 2D expression base image with the face feature points which are identified from the sample voice video data. For the 2D image of the happy expression base, the server determines characteristic points such as mouth corners, lips and the like corresponding to the mouth shape and coordinate positions of characteristic points such as eye corners, eye beads and the like corresponding to the eye shape in the image, and corresponds to the facial characteristic points under the normal happy expression extracted from the sample video. With a large number of such expression-based 2D images and corresponding facial feature point data, the server is trained using a machine learning algorithm (e.g., a neural network algorithm). For example, the expression base image data and the corresponding facial feature point coordinate data are used as input, and a first mapping model for representing the mapping relation between a plurality of expression bases and the facial feature points is obtained through the learning and the optimization of a multi-layer neural network. The model can predict the expression groups which are possibly corresponding according to the input face feature points.

The server processes each frame of image in the sample audio-video data in time sequence. Taking a lecture video as an example, starting from a first frame of the video, the server extracts face feature points in a face area, and records coordinate positions of the feature points, such as position coordinates of corners of a lecturer's mouth, coordinates of corners of eyes, and the like in the first frame. Along with the playing of the video, the coordinate changes of the feature points are recorded frame by frame, so that a face feature point sequence is obtained. During the speech, the presenter may start to calm, the mouth angle is slightly raised, the mouth angle is greatly raised due to agitation along with the advancement of the speech content, the eyes are opened greatly, and the like, and the changes are recorded in the sequence of facial feature points.

And the server inputs the obtained human face characteristic point sequence into a first mapping model. And the model analyzes the input human face characteristic point sequence according to the mapping relation between the expression base and the human face characteristic points learned before. For example, when the face feature point sequence shows features such as gradually rising corners of the mouth and slightly squinting eyes, the first mapping model predicts that the expression group corresponding to the feature point sequence may be a happy expression group according to the learned mapping relation. When the features such as eyebrow tightening and wrinkling, mouth corner bending and the like are displayed in the facial feature point sequence, the model predicts that the corresponding expression group is an anger expression group.

The server extracts an audio sequence from the sample audio-video data using an audio processing tool. For news broadcast video, the server separates the audio part from the video to obtain an audio sequence containing the announcer voice. The audio sequence contains information such as voice content, intonation, speech speed of the announcer, etc. In movie clips, the server also extracts audio sequences of actors' dialects, which may contain speech expressions of different languages, dialects, and various emotional colors.

And the server correlates the extracted audio sequence with the expression base predicted according to the face feature point sequence. For example, in a news broadcast video, when a broadcaster in an audio sequence broadcasts entertainment news in a cheerful intonation, the corresponding predicted expression base is a happy expression base, and when a broadcaster broadcasts political news in a serious intonation, the corresponding expression base may be a fazenda expression base (which may be regarded as a variant of a neutral expression base). With a large number of such audio sequences and associated data of predicted expression bases, the server is trained using machine learning algorithms (e.g., support vector machines or deep neural networks). And taking the audio sequence data and the corresponding predicted expression groups as input, and obtaining a second mapping model for representing the mapping relation between the audio data and the expression groups through model learning and optimization. The model completes the audio lip movement synchronization configuration, namely, the expression base which should correspond to the audio data can be accurately predicted according to the input audio data, so that the synchronization of the audio and the lip movement is realized in the expression of the voice interaction digital person.

In the embodiment of the invention, the personification configuration is performed on the voice interaction digital person, and the voice interaction digital person can be implemented through the following example.

In the embodiment of the invention, the server acquires the joint topological graph from the skeleton structure data of the digital person in the process of constructing the voice interaction digital person. Taking the human body structure as an example, the skeleton joint topological graph of the digital human has clear hierarchical relationship.

Root nodes include important parts of the head, chest, pelvis, etc. The head is used as a root node and comprises child nodes such as eyes, ears, mouth and the like. The eye sub-nodes are responsible for eye movements of the digital person, such as eye opening, eye closing, eyeball rotation and the like, the ear sub-nodes are relatively less in movements but are also part of a head structure, and the mouth sub-nodes control mouth movements of the digital person, including mouth morphological changes related to expressions such as speaking, smiling, frowning and the like.

The chest serves as the root node, and the child nodes below it have shoulder joints. The shoulder joints in turn connect to the bones of the upper arm, which is the sub-node relationship below the sub-nodes. The upper arm bones are further connected to the forearm bones, ultimately to the wrist and hand. The hand includes a plurality of sub-nodes, such as finger joints, etc., which can perform various gesture actions, such as fist making, pointing, hand engaging, etc.

The pelvis serves as the root node, and its child nodes include the thigh bones. The thigh bone connects to the calf bone and then to the ankle and foot. The foot also includes a plurality of sub-nodes for controlling the bending, stretching, rotation, etc. of the foot to achieve the standing, walking, running, etc. postures of the digital person.

The joint topological graph clearly shows membership and hierarchical structure among joints just like a family tree, and provides a basic framework for subsequent digital human motion control and personification processing.

In the motion simulation of digital people, the world coordinate system is a global reference system. For example, when a digital person is initially created, it has an initial pose in the world coordinate system. Taking a digital human arm as an example, an initial rotation quaternion of the arm in the world coordinate system indicates an initial direction and rotation state of the arm relative to the world coordinate system. This initial rotation quaternion describes this state if the arm is straight down.

The server needs to convert the initial rotation quaternion in this world coordinate system to the target rotation quaternion relative to the parent node. For an arm, the shoulder is the parent node of the arm. The rotation quaternion of the arm relative to the world coordinate system is converted into the rotation quaternion relative to the shoulder, so that motion control in the local coordinate system of the digital person is facilitated. For example, if the whole body of the digital person has an inclination angle, the rotation condition of the arm in the world coordinate system is complex, but after the rotation quaternion relative to the shoulder is converted, the action relation of the arm relative to the body can be more intuitively represented.

The server traverses the root nodes and child nodes in the joint topology map using a depth-first traversal algorithm. Starting from the root node, for example, from the root node of the pelvis, the depth traverses to the child nodes such as the thigh, the calf, the ankle, etc. And for each node, assigning the converted target rotation quaternion. In the traversing process, taking a leg joint as an example, when the thigh skeleton node is assigned and then the traversing is continued to the lower leg skeleton node, the rotation quaternion of the lower leg skeleton node is assigned according to the state of the thigh skeleton node and the relation of the lower leg skeleton node and the lower leg. Thus, through the deep traversal of the entire joint topology graph, each joint node is endowed with the correct target rotation quaternion relative to the father node, thereby laying a foundation for the accurate action simulation of the digital person.

The server sets a plurality of preset animation states for the digital person, such as standing, walking, running, sitting, waving hands, and the like, and the states are controlled by a state machine. Taking the example of switching from standing to walking, the state machine manages this switching process. When the digital person receives the instruction to walk, the state machine starts triggering the animated transition from standing to walking.

In this process, the server needs to acquire the target rotation quaternion of each joint at the animation switching time point. At the moment of switching from standing to walking, the pelvis serves as the root node, the rotation quaternion of which changes from a relatively stable state in standing to a rhythmic swing state in walking. The child nodes of the thigh, calf, etc. will also change their rotational quaternions accordingly to effect a transition from stationary to moving.

The server uses a quaternion smooth interpolation algorithm to ensure a smooth transition between the state machine animation and the human feature animation. For example, during a standing to walking transition, assume that the digital person is in a standing state at the first frame, each joint has a corresponding rotational quaternion in a stationary state. When the mobile terminal starts to switch to a walking state, in an intermediate frame of animation switching, the server calculates a rotation quaternion of the intermediate state through quaternion smooth interpolation.

Taking the knee joint as an example, if the user jumps directly from a state where the knee is straightened while standing to a state where the knee is bent while walking, the motion is abrupt. Through the smooth interpolation of the quaternions, the server calculates the rotation quaternions which the knees should have in the intermediate frames according to the rotation quaternions of the knees in a standing state and the target rotation quaternions of the knees in a walking state, so that the action of the knees can smoothly transition from straightening to bending, and the action is just like the natural transition of a real person from standing to walking. The smooth transition is applied to not only leg joints but also other joints of the body, such as arm swing, slight head shake and the like, so that the animation of the whole digital person looks more natural and smooth.

The server sets a blink animation trigger mechanism for the eyes of the digital person. The server has a random time generator inside, and the random time generator randomly determines the time point for triggering the blink animation within a certain time range. For example, it is set to trigger a blink animation randomly every 10 to 30 seconds. This random time mechanism may continue to operate while the digital person is in conversation or stationary. Once the set random time is reached, a blink animation is triggered.

For the action of the eyeball, the server sets detailed parameters. Taking the shake range of the eyeball as an example, the shake range of the eyeball is set to be between 0.1 cm and 0.1 cm in the horizontal direction, and between 0.05 cm and 0.05 cm in the vertical direction. This means that the eye can make random shake of small amplitude in this range.

The dithering frequency is set to 1 to 3 times per second. When the blink animation is triggered or in the normal 'gazing' state of the digital person, eyeballs shake according to the frequency.

The jitter direction is randomly determined. The vibration can be simple horizontal vibration, vertical vibration or oblique vibration with a certain angle. For example, after a blink animation, the eye may shake horizontally by a small amount at a frequency of 2 times per second for 12 seconds and then shake vertically by a frequency of 1 time per second, and this random shake direction and frequency makes the eyes of the digital person look more vivid and active, as if the real human eye had some subtle and irregular movements.

The server has a preset question and answer text library, and the text is determined according to common interaction scenes when the digital person designs. For example, there are "hello" in the question-answer text library, how do what is done today? you can introduce themselves you can introduce one's own.

For each preset question-answering text, the server configures a corresponding preset expression group. For the "how you are doing today" the problem, the server configures a "friendly" expression base. This expression base may contain slightly raised mouth corners, relaxed eye muscles, etc. to convey a friendly, positive attitude. For the question "do you introduce themselves? configuring a 'confidence' expression base, may be manifested as a slight lifting of the head, firm eyes, a proper smile in the corners of the mouth, etc.

When the digital person receives a question and answer text input by the user, if the text is matched with one of the preset question and answer text libraries, a corresponding preset expression base is triggered. For example, when the user asks "how you are doing so today? the digital person can be enabled to present the expression corresponding to the friendly expression group configured before, so that the interaction of the digital person is more personified, and the interaction experience with the user is enhanced.

In the embodiment of the invention, the voice interaction digital person with the configured voice interaction digital person is rendered to the preset front-end page through the real-time cloud, and the voice interaction digital person can be implemented through the following example execution.

In the embodiment of the invention, when the server performs the matting processing on the voice interaction digital person, an appropriate matting algorithm is selected firstly. According to image characteristics of digital people, such as color distribution, edge definition and other factors, a semantic segmentation algorithm based on deep learning is selected. The algorithm can accurately distinguish the digital person from the background through training of a large amount of image data.

Before matting, the server needs to preprocess the image data of the voice interaction digital person. This process includes adjusting the resolution, color pattern, etc. of the image to ensure that the format of the image data meets the requirements of the matting algorithm. For example, if the matting algorithm requires that the input image be in RGB color mode and the resolution be of a particular size, the server will convert and adjust the digital person's image accordingly.

The server inputs the preprocessed digital human image into a semantic segmentation algorithm. With a voice interactive digital person standing in a virtual indoor scene, the algorithm classifies each pixel in the image and determines whether it belongs to the digital person body or the background part. For the skin part of the digital person, the algorithm can identify the skin part as a part of the digital person according to the characteristics of the color, the texture and the like of the skin, and for the clothes of the digital person, the clothes are distinguished according to the difference between the color and the style of the clothes and the background. In the context of an indoor scene, e.g. the color of a wall, the shape of furniture, etc. are identified as background elements.

During semantic segmentation, the algorithm generates a segmentation mask, which is an image of the same size as the original image, wherein the digital human body part is marked in one color (e.g., white) and the background part is marked in another color (e.g., black). With this segmentation mask, the server can clearly determine the exact extent of the digital person in the image.

The server removes a background portion from the original image according to the segmentation mask. For pixels marked as background, the server sets them as transparent pixels (in image format supporting transparent channels). Thus, the voice interaction digital person after the image is scratched is obtained, and the digital person looks like being independently extracted from the original background, can be conveniently placed on any other background or is displayed in an independent mode in a front page, so that the universality and the display effect of the digital person are enhanced.

The server first determines the address and port information of a preset front-end page. Assuming that the preset front-end page runs on a specific port (e.g., 8080 port) in the local development environment, the server initiates Websoket connection requests according to this information. The server creates Websoket a connection object that is responsible for managing the bi-directional communication with the front-end page.

During the connection establishment process, the server and the front-end page may interact with some handshaking protocols. For example, the server transmits a connection request message containing its own identification information and contents such as a supported communication protocol version. After the front-end page receives the request, a confirmation message is replied, indicating that the connection is agreed to be established, and the server is informed of relevant information of some front-end pages, such as page layout, supported digital person display format and the like. After these interactions, websoket connections are successfully established, just as a bridge for real-time communication is set up between the server and the front-end page.

Once a connection is established, the server will continuously monitor the status of this connection. If a connection interruption occurs, for example, due to a network failure or an unexpected closing of a front page, the server may attempt to reestablish the connection. Meanwhile, the server can adjust the strategy of data transmission according to the network condition, such as reducing the frequency of data transmission or adopting a data compression technology when the network is congested, so as to ensure that data can be stably and efficiently transmitted between the server and the front-end page.

The server defines the function and data format of the character image switching interface internally. This interface allows the front-end page to request a change in the appearance of the voice interactive digital person. For example, the server sets switchable graphic elements for a number of hairstyles, clothing styles, etc. for the digital person.

When this interface is encapsulated, the server determines the way the interface is invoked, such as by sending a message in a specific JSON format to trigger the image switch. The message may contain identification information of the character element to be switched, such as "hairstyle 1", "clothing style 2", etc. After receiving the character image switching request sent by the front-end page, the server searches corresponding digital character image resources in the internal database according to the identification information in the request, and then sends the updated digital character image data to the front-end page.

For the audio lip-sync playing interface, the server is required to ensure that the audio playing of the digital person and the lip can be accurately synchronized. When the server encapsulates this interface, the transmission format of the audio data and the corresponding lip-sync data is defined.

When a digital person speaks, the server will divide the audio data into individual small audio pieces and tag each audio piece with corresponding lip sync information. For example, for an audio segment containing a particular phoneme, the server may tag the digital person's lip-deformation data corresponding to that phoneme. When the front-end page requests to play the audio through the interface, the server can send the audio data and the lip movement synchronization data to the front-end page together according to the pre-marked synchronization information, so that the front-end page can accurately display the audio lip movement synchronization effect of the digital person.

The server encapsulates the action driven interface to control various actions of the digital person. The interface receives action instructions sent by a front-end page, such as action instructions of walking, waving hands, nodding, and the like.

The server internally maps these action instructions into the digital person's skeletal animation system. For example, when a "walking" instruction is received, the server calculates the target position and rotation angle of each skeletal joint in the walking process according to the walking animation parameters, such as stride, stride frequency, body posture, etc., previously set for the digital person. And then the articulation data are sent to a front-end page through an action driving interface, and the front-end page displays the walking action of the digital person according to the received data.

The server obtains information about the presentation area from the communication established with the front-end page. Assuming that the preset front page is an HTML5 page, the presentation area is a < div > element with a specific ID, for example < div id= "digitalhumandisplay". The size, position and style of this presentation area are well defined in the CSS style sheet of the front page, e.g. 500px wide and 600px high, in the center of the page, white background color, etc.

The server compresses the image data of the voice interaction digital person after the image is scratched, so that the data transmission quantity is reduced. For example, an image compression algorithm in a JPEG or PNG format is adopted to compress the size of image data to a suitable range on the premise of ensuring the image quality.

The server then sends the compressed image data to the front page via Websoket connections. Upon receipt of the image data, the front page creates a < img > tag within the presentation area (if static presentation) or uses a drawing API of JavaScript (if dynamic presentation, such as an animation effect) to display the image of the digital person.

During the interaction of the digital person with the user, for example, when the digital person speaks or otherwise acts, the server will continually update and send the relevant data to the front page. If the digital person makes a new action, the server sends the articulation data corresponding to the action to the front-end page, and the front-end page updates the display effect of the digital person according to the data. Similarly, when the expression of the digital person changes, the server also sends the expression related data to the front-end page, so that the digital person can display various expressions and actions in real time in the display area of the front-end page, and a vivid and lifelike voice interaction digital person experience is provided for the user.

In the embodiment of the invention, the interactive feedback of the voice interactive digital person is generated through the audio lip movement synchronous configuration and the personification configuration based on the voice interactive input, and the voice interactive digital person can be implemented through the following example.

In an embodiment of the invention, the server receives voice interaction input through a Websocket connection established with the front-end page. When a user speaks in the front-end page through the microphone equipment, the front-end page collects voice of the user and sends collected voice data to the server in real time through Websocket connection. For example, the user speaks into the microphone: "you can speak me interesting" is the story? a specific audio format (e.g., PCM format) is transmitted to the server side.

After receiving the voice data, the server invokes a pre-configured voice recognition engine. The speech recognition engine is trained on a large amount of speech data and is capable of recognizing multiple languages and accents. The server first initializes the speech recognition engine and loads the relevant language model and acoustic model. Taking english recognition as an example, the server loads a language model containing information such as english words and grammar, and an acoustic model based on english speech features.

The server inputs the received voice data into a voice recognition engine. The speech recognition engine analyzes the speech data, matches acoustic features in the speech signal to the acoustic model, and combines the language model to determine the most likely text content. "you can give me" to the user is an interesting story spoken? the speech recognition engine will recognize the speech based on the characteristics of phonemes, intonation, speech speed, etc., recognizing the Chinese characters corresponding to each syllable, and finally converting the whole voice into accurate text content to obtain input text data, namely' is you capable of telling me interesting stories?

The server pre-processes the resulting input text data "do you speak me interesting stories. This preprocessing may include washing the text to remove some unnecessary punctuation or special characters and then formatting the text as required by the input of the conversational language big model. For example, the text is converted to a uniform coding format (e.g., UTF 8) and some necessary identification information is added, such as a question request indicating that this is a user.

The server sends the preprocessed input text data to the pre-trained conversational language big model. This language big model may be based on a transducer architecture, pre-trained with a large amount of text data (e.g., news articles, novels, encyclopedia, etc.). For example, the server sends the data to a language big model like GPT-3.

After receiving the input text, the language big model is processed according to the internal neural network structure and the pre-trained knowledge. For the question "do you speak me an interesting story? the language big model searches its vast knowledge system for suitable story content. It is assumed that the language big model finds an interesting story about the small fox and the small rabbit from its memory and then generates corresponding feedback text data of "there was previously a small fox and a small rabbit, which live in the forest.

The server will select the appropriate audio synthesis technique when converting the feedback text data to audio. For example, the server may use deep learning based speech synthesis techniques that are capable of generating natural, smooth, emotional speech. The server can select corresponding voice tone according to the character characteristics of the digital person, such as age, gender and other factors. If the digital person is a young female character, the server will select a young female voice timbre model.

The server first analyzes the feedback text data. For the text "have one fox and one rabbit in the past, they live in forest. The server will analyze the grammatical structure, semantic information, punctuation, etc. in the text. For example, the pause position of a sentence is determined by analyzing punctuation marks, and the fluctuation of intonation is determined according to semantics.

The server generates audio using the selected speech timbre model and audio synthesis technique based on the analysis results. During the generation process, each word and syllable is assigned a corresponding pitch, duration, and intonation according to its position and importance in the sentence. For example, the key characters "small fox" and "little rabbit" in the story are highlighted with a slightly higher pitch and slower duration, and the descriptive words, such as "in forest", are expressed in a more gradual intonation. Finally, the server converts the entire feedback text data into feedback audio data that contains the speech content of the story and has the appropriate intonation, speech speed, etc.

The server inputs feedback audio data into a second mapping model in the audio lip sync configuration (trained during the previous audio lip sync configuration) according to this model. The model determines the expression base corresponding to the audio data according to the characteristics of phonemes, intonation and the like in the audio. For example, for audio telling an interesting story, the model may determine a "smiling" expressive base, as the atmosphere of the story is relaxed and pleasant.

And the server generates lip movement data according to the determined expression base and the facial structure data of the digital person. This lip movement data details how the lips of the digital person should move when playing the feedback audio data. For example, when the word "small fox" appears in audio, the lip movement data will specify a change in shape of the lips to exactly fit the phoneme from which the word was sent.

According to the personification configuration, the server triggers some actions of the digital person according to factors such as story content and intonation of audio. For example, when teaching the scenario of a small fox and a small rabbit running in a forest, the server will trigger the "running" action of the digital person according to the animation rules set previously. The triggering of this action is based on the semantic content of the story and the rhythm of the audio so that the digital person's action matches the story being told.

In addition to action triggering, the server processes the expression and eye spirit of the digital person according to rules such as blink animation setting, eye shake and the like in anthropomorphic configuration. In the story telling process, triggering a preset blink animation according to random time, and adjusting the vibration range, frequency and direction of eyeballs according to the emotion atmosphere of the story. For example, in a story section where a person is told easily, the shake frequency of the eyeball may be slightly faster to show the liveliness of the person, and in a strenuous plot, the shake range of the eyeball may be slightly larger while the eye is more focused.

And the server integrates the generated lip movement data, action data, expression data, eye data and the like with feedback audio data to obtain interactive feedback of the voice interactive digital person. The interactive feedback comprises the complete performance of the digital person when telling the story, including the aspects of audio playing, lip movement synchronization, body actions, expression change, eye mind dynamics and the like, so that a vivid and personified voice interactive experience is provided for the user.

In the embodiment of the invention, the interactive feedback real-time cloud rendering of the voice interactive digital person to the preset front-end page can be implemented through the following example execution.

In the embodiment of the invention, the server firstly sorts and analyzes the data after receiving the interactive feedback of the voice interactive digital person. The interactive feedback includes a plurality of portions including audio data, lip movement data, motion data, expression and eye data, and the like. The server parses the structure and content of these data, e.g. for lip movement data it specifies the shape and position the lips of the digital person should take at the moment corresponding to each audio frame, for motion data it details the information of the target rotation quaternion, displacement etc. of the various skeletal joints at different points in time, and expression and eye movement data specifies details of facial expression and eye movement.

The server may also perform some preprocessing based on the current status of the digital person and the previous interaction history. For example, if the digital person is in a static state before, the story is to be told and the corresponding action is to be performed, the server adjusts the initial action data according to the transition rule from static to action start, so as to ensure that the initial stage of the action looks natural and smooth.

The server prepares resources required for rendering, including loading three-dimensional model data of the digital person, texture maps, skeletal animation data, and the like. The data is stored in a storage system of the server, which may be a local hard disk or a distributed file system. Taking three-dimensional model data of a digital person as an example, the server reads data such as vertex coordinates, patch information, bone structures and the like of the model from storage into a memory. For texture mapping, such as skin texture of a digital person, clothes texture and the like, the texture mapping is also loaded into a memory and necessary decompression and preprocessing operations are performed, so that texture information can be obtained quickly in a rendering process.

At the same time, the server initializes the rendering engine. If a rendering engine based on OpenGL or DirectX is used, parameters such as rendering context, viewport size, projection matrix and the like are set. For example, the view port size is set according to the size of the front page display region, and if the front display region is 800×600 pixels, the view port of the rendering engine is set to this size accordingly, so as to ensure that the rendering result can be correctly adapted to the display region of the front page.

The server uses the audio rendering module to render the audio data in the interactive feedback. This module will perform decoding operations according to the format of the audio data (e.g., MP3 or WAV) and will convert the digital audio signal into an analog audio signal that can be played (if sound is ultimately to be played on the user device). During the rendering process, some effect processing such as volume adjustment, noise reduction processing, etc. is performed on the audio. For example, if a digital person needs to adjust the volume according to the plot when telling a story, the server will dynamically adjust the volume when rendering audio according to preset rules. If there is a sudden loud plot in the story, the server will increase the volume of that portion as the audio is rendered to enhance the expressive power of the story.

The server renders the face of the digital person according to the lip movement data and the expression data. For the lip portion, the shape of the lips is precisely adjusted according to lip shape parameters at each time point in the lip movement data using a morphing technique in three-dimensional modeling software, such as BlendShape technique. For example, when syllables that make an "o" sound appear in audio, the lips are rendered into circles according to the lip movement data. Meanwhile, the movements of other muscles of the face, such as slight bulge of cheeks, upward or sagging of eyebrows, and the like, are adjusted according to the expression data so as to accurately represent the expression of the digital person.

Based on the motion data, the server renders the physical motion of the digital person. For movement of the bone joints, the server moves and rotates the bone in three dimensions according to the target rotational quaternion and displacement information for each joint. Taking the arm swing of the digital person as an example, the server gradually adjusts the posture of the arm skeleton according to the rotation angle of the arm skeleton in each time frame in the motion data, and meanwhile, the skinning algorithm calculates the deformation of skin and muscle according to the motion of the skeleton, so that the arm of the digital person can look to swing naturally. When the walking action is rendered, the server can enable the digital person to walk with correct stride and rhythm according to the displacement data of the footsteps and the rotation data of the leg joints, and the gravity center of the body can be reasonably transferred along with the movement of the footsteps.

Based on the eye data, the server renders eye movements of the digital person. The position of the eyeball is adjusted in each frame according to the jitter range, frequency and direction data of the eyeball. For example, if the eye data specifies that the eye is slightly jittered to the upper left at a certain moment, the server will adjust the coordinates of the eye according to the corresponding direction and amplitude at the time of rendering, and at the same time, adjust the luster and focus of the eye according to the expression and storyline of the digital person, so that the eye looks more vivid and affective.

And the server integrates the audio rendering result and the visual element rendering result to obtain a complete rendering result. The rendering result is a comprehensive data containing the audio and visual representation of the digital person at a specific moment. For example, in a frame, the rendering results contain all the information of the sound of the story, the shape of the lips, the expression of the face, the motion of the body, and the position of the eyes, etc. that are combined into a data format that can be understood and presented by the front page, such as JSON format or binary stream format.

The server packages the rendering results so that they can be efficiently transferred to the front-end page over the Websoket connections. If the rendering result is a large amount of data, the server may employ a data compression technique, such as a lossless compression algorithm (e.g., ZIP compression) or a lossy compression algorithm (e.g., JPEG compression if image data is a type of data, etc.), to reduce the size of the data transmission. For example, if the image data in the rendering result occupies a large space, the server compresses the image data, and may decompress and restore the image after receiving the data on the front page.

During the packaging process, the server may also add some necessary metadata, such as version information of the data, a timestamp, etc. These metadata help the front-end page to properly parse and process the received data. For example, the version information may enable the front-end page to know whether it can support the received data format, and if not, the user may be prompted to update or take other compatible measures, and the timestamp may be used to determine the timeliness of the data by the front-end page, such as whether it is the latest rendering result, etc.

The server sets parameters of data transmission according to the current state of Websoket connection and the receiving capability of the front-end page. For example, the rate of data transmission, the block size, etc. are determined. If the network bandwidth of the front page is limited, the server can reduce the transmission rate, and the rendering result is divided into smaller blocks for transmission, so that network congestion and data loss are avoided. Meanwhile, the server may set the priority of transmission, for example, the audio data may have a higher priority, and the transmission of the audio data may be preferentially ensured because if the audio playing is not smooth, the interactive experience of the user may be seriously affected.

And the server sends the packed and optimized rendering result to a preset front-end page through Websoket connection. During the transmission, the server continuously monitors the progress and status of the transmission. If a network failure or transmission error occurs, the server may attempt to retransmit portions of the data or adjust the transmission policy. For example, if a certain data block is lost during transmission, the server detects this error according to a checksum or the like mechanism and then resends the lost data block.

After receiving the rendering result sent by the server, the front-end page firstly performs unpacking and analysis operations. And separating audio data, visual element data and the like in the rendering result by the front-end page according to the metadata and the format information in the data. For audio data, the front-end page can be sent to audio playing equipment (such as a loudspeaker or an earphone) for playing, and for visual element data such as digital human images and animation data, the front-end page can be rendered and displayed in a display area by using front-end technologies such as JavaScript. For example, the front-end page uses an HTML5 < canvas > element or a WebGL technology, draws the image of the digital person in a display area according to three-dimensional model data, action data, expression data and the like of the digital person, and synchronously displays the lip movements, actions, expressions, eye mind and other changes of the digital person according to the playing progress of the audio, so that complete interactive feedback of the voice interactive digital person is presented for a user.

In order to more clearly describe the solution provided by the embodiments of the present invention, a more complete implementation is provided below.

Manufacturing high-definition digital people:

(1) The making of the digital person model in the modeling software blender includes:

a. And in the 3d virtual world, the model is composed of a plurality of triangular (or rectangular) patches, the coordinate information and index of all vertexes are sequentially stored in the model, the shader program can sequentially read the coordinates of all vertexes and sequentially render each triangular patch, and the triangular patches finally form the model, so that the model is finer as the number of the triangles is larger. As shown in the following figures, 6 vertexes render a small rectangle m1 in the order of index sequences 1, 2, 3 and 4, and render a small rectangle m2 in the order of index sequences 5, 6, 7 and 8, and all the patches of the model are rendered according to the small rectangle m2.

B. UV mapping, namely coloring the model, is completed by using a UV editor in a blender, wherein the process of UV mapping effectively tells the triangle patches forming the model at which UV coordinates to take color, and a shader program can acquire pixels on the mapping and render the pixels through the UV coordinates;

c. Creating human skeleton and covering, creating human skeleton node, brushing skeleton weight on the vertex, and according to the principle of linear covering algorithm, the position of deformed p is the weighted linear combination of affine transformation Tj of control unit, i.e. the point matrix change (rotation and displacement) Tp matrix of all joints affecting p point is multiplied by weight w (p) and summed to obtain new pose matrix (rotation and displacement) of p point

D. blendshape expression base preparation (arkitblendshapes scheme is used in the embodiment, and bs is abbreviated below);

arkitblendshapes are composed of 52 basic expressions, an expression base can be understood as a vertex group, one expression base is a vertex group composed of a group of vertices, each vertex P in a vertex group P epsilon P is provided with a default posture T1 and an expression maximization posture T2, when a weight value w is given, the posture of the P at the moment can be calculated, and the postures of all the vertices form the posture P of the expression base (vertex group).

When a group of expression groups P epsilon S are calculated according to a group of weights W epsilon W, lip shape and expression at a certain moment can be obtained, and when each frame is continuously refreshed, lip animation can be realized.

(2) And creating a scene import model in the rendering engine unit 3d, and creating related materials.

Character skin texture creating a sss skin texture shadergraph. Clothing material creating a clothing material shadergraph. Image post-processing production, namely setting globalvolume in a scene.

And (3) implementing sound lip synchronization:

(1) Lip movement code writing based on expression base lot data driving;

asr, language model, tts, lip movement algorithm, principle:

The audio-video data-image recognition algorithm-facial feature point data-audio phonemes-phoneme and facial feature point mapping relationship is obtained using asr.

Voice video data:

The lip movement algorithm needs to use a large amount of videos as training data, the videos need to be clear in facial pictures and clear in language word spitting, and the lip movement data generated by more data can be more accurate by collecting video of a host or a presenter as a data source;

The image recognition algorithm obtains facial feature point data, and a Dlib face recognition open source library is used for detecting a face area, and feature point models are used for detecting 68 face feature points in the face area.

Randomly generating bs data, recording video, image recognition algorithm, facial feature point data, feature point data and bs data mapping relation;

randomly generating bs data, wherein random numbers ranging from 0 to 1 are used as weights of expression groups in units, weights are generated for each of 52 expression groups, a mapping relation between phonemes and the bs data is constructed, and referring to FIG. 2, the algorithm can generate digital lip movement bs data in real time according to input audio;

digital human personification:

(1) Writing a digital person action driving code, realizing bvh data driving capability to enable a digital person to support action data transmitted by an action recognition algorithm so as to support the generation of actions by a digital person low-cost algorithm;

bvh the principle of dynamic data capture driving skeleton animation, namely, all human body actions can be realized by the rotation of all limb joints and the displacement and rotation of the body, imagination that a child can get a doll toy to help understand the principle;

bvh, data conversion, namely, as bvh uses the rotation quaternion Quaternion under the world coordinate system, in order to be used in units conveniently, the rotation quaternion qW under the world coordinate system is required to be converted into the rotation quaternion qL relative to a father node according to a quaternion rotation formula q1W= qW, and joint point depth is traversed and assigned according to a joint topological graph in bvh after conversion;

the Unity state machine controls the pre-animation:

an animation state machine:

A smooth transition between state machine animation and bvh animation. Recording the rotation quaternion of all the nodes at the animation switching time by using an array, reading bvh the rotation quaternion of all the nodes of the initial frame, storing the rotation quaternion into the array0, and inserting multi-frame data between the array and the array0 by using a quaternion difference algorithm (for example, the animation frame rate is 30, and inserting 5 frames)

(2) Setting a vibration range A epsilon [ x0, x1] [ y0, y1], vibration frequency f and a random direction v vector, so that an eyeball p moves v in an area A at a fixed frequency f, and withdraws movement and re-moves the direction v randomly when the eyeball p exceeds the area;

(3) Lip movements with expressions are inserted into faq question-answering texts, expression configuration is inserted into the text, tts data are returned to a rendering engine, and the engine drives expression related bs by using prefabricated expression data when the text with the inserted expressions is played;

4. Cloud rendering:

(1) webtc access;

(2) A front page;

(3) The character matting embeds the front page, black-and-white images and primary color images of left-right layout are generated by using a loader code in the units, webgl codes are written at the front end, and the left black-and-white images are used for matting the right color images to generate transparent background images;

(4) Establishing front-end communication with cloud rendering server websoket;

(5) The digital man-driven interface is packaged for front-end call, and comprises a character image switching interface, an audio lip movement synchronous playing interface and an action driving interface.

So designed, bs data generated in real time by using a lip motion algorithm is analyzed frame by frame, and then a unity blendshape driving interface is called to drive the change of expression base.

Generating bs data randomly by using units, outputting a 2d image, extracting face feature points from the 2d image by using a face recognition algorithm to establish a mapping between the bs data and the face feature points, extracting phonemes by using asr from a video of a cell phone anchor or host, extracting the face feature points by using the face recognition algorithm to establish a mapping between the phonemes and the face feature points, and finally establishing a mapping between the phonemes and the bs.

Analyzing and recombining the dynamic capturing bvh data frame by frame to form a rotation quaternion, performing deep traversal according to a topological structure diagram of the joint points, and setting rotation for each joint point;

User voice is obtained in real time, asr is used as ears to convert the voice into words, chatgpt-4 is used to generate dialogue words, tts is used as a mouth to convert the words into the voice, the voice is input into a lip movement algorithm, the lip movement algorithm returns bs data, the bs data and the voice are transmitted to a unit rendering program together, the unit plays the voice and drives a lip shape simultaneously, a unit picture and voice are transmitted to the front end in real time by utilizing a webrtc technology, and a real-time reply picture is presented to the user.

When the animation switches between the dynamic capture bvh and the preset animator animation, the quaternion smoothing interpolation method quaternion. Lerp (q 1, q2, t) is used to insert the complementary frames at the beginning and end frames of bvh.

Referring to fig. 3 in combination, fig. 3 is a device 110 for implementing a cloud-rendering voice interaction digital person according to an embodiment of the present invention, including:

The construction module 1101 is configured to construct a voice interaction digital person, and perform audio lip movement synchronous configuration and personification configuration for the voice interaction digital person;

The interaction module 1102 is configured to obtain a voice interaction input initiated by a user through the preset front-end page, generate interaction feedback of the voice interaction digital person through the audio lip movement synchronization configuration and the personification configuration based on the voice interaction input, and render the interaction feedback of the voice interaction digital person to the preset front-end page in real time.

It should be noted that, the implementation principle of the implementation device 110 of the foregoing cloud-rendering voice interaction digital person may refer to the implementation principle of the implementation method of the foregoing cloud-rendering voice interaction digital person, which is not described herein again. It should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated when actually implemented. The modules can be realized in the form of software which is called by the processing element, in the form of hardware, in the form of software which is called by the processing element, and in the form of hardware. For example, the implementation device 110 of the cloud-rendering voice interaction digital person may be a processing element that is set up separately, may be implemented in a chip of the device, or may be stored in a memory of the device in the form of program codes, and the function of the implementation device 110 of the cloud-rendering voice interaction digital person may be invoked and executed by the processing element of the device. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as one or more Application SPECIFIC INTEGRATED Circuits (ASICs), or one or more microprocessors (DIGITAL SIGNAL processors, DSPs), or one or more field programmable gate arrays (fieldprogrammable GATE ARRAY, FPGAs), or the like. For another example, when a module above is implemented in the form of processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (centralprocessing unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The embodiment of the invention provides a computer device 100, where the computer device 100 includes a processor and a nonvolatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device 100 executes the foregoing implementation apparatus 110 of cloud-rendering voice interaction digital people. As shown in fig. 4, fig. 4 is a block diagram of a computer device 100 according to an embodiment of the present invention. The computer device 100 comprises a cloud-rendered voice-interactive digital person implementation means 110, a memory 111, a processor 112 and a communication unit 113.

For data transmission or interaction, the memory 111, the processor 112 and the communication unit 113 are electrically connected to each other directly or indirectly. For example, the elements may be electrically connected to each other via one or more communication buses or signal lines. The implementation means 110 of the cloud-rendered voice interactive digital person comprises at least one software functional module which may be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the computer device 100. The processor 112 is configured to execute the cloud-rendering voice-interactive digital person implementation device 110 stored in the memory 111, for example, a software function module, a computer program, and the like included in the cloud-rendering voice-interactive digital person implementation device 110.

An embodiment of the present invention provides a readable storage medium, where the readable storage medium includes a computer program, and when the computer program runs, the computer program controls a computer device where the readable storage medium is located to execute the foregoing implementation apparatus 110 of a cloud-rendering voice interaction digital person.

The foregoing description, for purpose of explanation, has been presented with reference to particular embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical application, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

Translated fromChinese

1.一种云渲染语音交互数字人的实现方法，其特征在于，包括：1. A method for implementing a cloud-rendered voice-interactive digital human, comprising:

构建语音交互数字人，并针对所述语音交互数字人进行音频唇动同步配置和拟人化配置；Constructing a voice-interactive digital human, and performing audio lip synchronization configuration and anthropomorphism configuration on the voice-interactive digital human;

将配置完成的语音交互数字人通过实时云渲染至预设前端页面；Render the configured voice interactive digital human to the preset front-end page through real-time cloud;

获取用户通过所述预设前端页面发起的语音交互输入；Obtaining voice interaction input initiated by the user through the preset front-end page;

基于语音交互输入，通过所述音频唇动同步配置和拟人化配置生成所述语音交互数字人的交互反馈；Based on the voice interaction input, generating interaction feedback of the voice interactive digital human through the audio lip movement synchronization configuration and the anthropomorphic configuration;

将所述语音交互数字人的交互反馈实时云渲染至预设前端页面；Rendering the interactive feedback of the voice interactive digital human to a preset front-end page in real time via the cloud;

针对所述语音交互数字人进行音频唇动同步配置，包括：Performing audio lip synchronization configuration for the voice interactive digital human includes:

获取样本语音视频数据；Obtain sample voice and video data;

从所述样本语音视频数据中检测出人脸区域，并从所述人脸区域中识别出多个人脸特征点；Detecting a face region from the sample voice and video data, and identifying a plurality of facial feature points from the face region;

随机生成多个表情基，将所述多个表情基转换为2D图像后确定出一一对应的人脸特征点，训练得到用于表征所述多个表情基和所述个人脸特征点映射关系的第一映射模型；Randomly generating a plurality of expression bases, converting the plurality of expression bases into 2D images and determining one-to-one corresponding facial feature points, and training a first mapping model for characterizing a mapping relationship between the plurality of expression bases and the facial feature points of the individual;

基于所述样本语音视频数据确定出人脸特征点序列，并利用所述第一映射模型确定出所述人脸特征点序列对应的预测表情基；Determining a facial feature point sequence based on the sample voice and video data, and determining a predicted expression base corresponding to the facial feature point sequence using the first mapping model;

获取所述样本语音视频数据的音频序列，并根据所述音频序列和所述预测表情基训练得到用于表征音频数据和所述多个表情基映射关系的第二映射模型，完成所述音频唇动同步配置。An audio sequence of the sample speech and video data is obtained, and a second mapping model for characterizing the mapping relationship between the audio data and the multiple expression bases is obtained based on the audio sequence and the predicted expression base training to complete the audio lip synchronization configuration.

2.根据权利要求1所述的方法，其特征在于，所述构建语音交互数字人，包括：2. The method according to claim 1, wherein constructing a voice-interactive digital human comprises:

获取基础人物模型，并对所述基础人物模型进行网格雕刻，得到所述基础人物模型的模型面片；Obtaining a basic character model, and performing mesh carving on the basic character model to obtain a model facet of the basic character model;

利用UV编辑器对所述模型面片进行贴图处理，完成对所述基础人物模型的上色贴图；Using a UV editor to perform mapping on the model surface to complete the color mapping of the basic character model;

针对所述基础人物模型进行骨骼创建及蒙皮处理，得到多个骨骼节点以及每个所述骨骼节点的骨骼权重，并基于所述骨骼权重确定控制所述多个骨骼节点运动的姿态矩阵；Performing skeleton creation and skinning processing on the basic character model to obtain a plurality of skeleton nodes and a skeleton weight of each of the skeleton nodes, and determining a posture matrix for controlling the movement of the plurality of skeleton nodes based on the skeleton weights;

获取预先设置的多个表情基，每个所述表情基对应一个表情顶点组，每个所述表情顶点组包括多个表情顶点，每个所述表情顶点包括默认姿态和表情最大化姿态，所述默认姿态和所述表情最大化姿态用于根据表情权重计算得到对应的表情顶点的姿态位置；Acquire a plurality of preset expression bases, each of the expression bases corresponding to an expression vertex group, each of the expression vertex groups including a plurality of expression vertices, each of the expression vertices including a default posture and an expression maximization posture, the default posture and the expression maximization posture being used to calculate a posture position of the corresponding expression vertex according to an expression weight;

基于所述模型面片、上色贴图、骨骼创建及蒙皮处理以及所述多个表情基构建得到所述语音交互数字人。The voice interactive digital human is obtained based on the model facets, coloring maps, skeleton creation and skinning processing, and the multiple expression base structures.

3.根据权利要求1所述的方法，其特征在于，针对所述语音交互数字人进行拟人化配置，包括：3. The method according to claim 1, wherein the step of performing anthropomorphic configuration on the voice interactive digital human comprises:

获取所述语音交互数字人的关节拓扑图，所述关节拓扑图包括多个根节点和每个所述根节点各自包括的多个子节点；Acquire a joint topology graph of the voice-interactive digital human, wherein the joint topology graph includes a plurality of root nodes and a plurality of child nodes included in each of the root nodes;

将世界坐标系下的初始旋转四元数转换为相对于父节点的目标旋转四元数，并深度遍历所述根节点和所述子节点进行赋值；Convert the initial rotation quaternion in the world coordinate system to the target rotation quaternion relative to the parent node, and deeply traverse the root node and the child nodes to assign values;

基于状态机控制预设动画，获取动画切换时间所述根节点和所述子节点的目标旋转四元数，并基于四元数平滑插值实现状态机动画与人体特征动画之间的平滑过渡；Controlling a preset animation based on a state machine, obtaining target rotation quaternions of the root node and the child node at the animation switching time, and achieving a smooth transition between the state machine animation and the human body feature animation based on quaternion smooth interpolation;

设置随机时间触发预设眨眼动画，并设定眼球抖动范围、抖动频率以及抖动方向；Set a random time to trigger the preset blink animation, and set the eye movement range, frequency, and direction;

为预设问答文本配置预设表情基，用于在触发所述预设问答文本触发所述预设表情基。A preset expression base is configured for the preset question and answer text, so as to trigger the preset expression base when the preset question and answer text is triggered.

4.根据权利要求1所述的方法，其特征在于，所述将配置完成的语音交互数字人通过实时云渲染至预设前端页面，包括：4. The method according to claim 1, wherein rendering the configured voice interactive digital human to a preset front-end page via real-time cloud comprises:

对所述语音交互数字人进行抠图处理，得到抠图后的语音交互数字人；Performing cutout processing on the voice interactive digital human to obtain a cutout voice interactive digital human;

与所述预设前端页面建立Websoket连接，并封装人物形象切换接口、音频唇动同步播放接口和动作驱动接口；Establish a Websoket connection with the preset front-end page, and encapsulate the character image switching interface, audio lip synchronization playback interface and action drive interface;

将所述抠图后的语音交互数字人展示至所述预设前端页面提供的展示区域。The voice-interactive digital human after the cutout is displayed in the display area provided by the preset front-end page.

5.根据权利要求1所述的方法，其特征在于，所述基于语音交互输入，通过所述音频唇动同步配置和拟人化配置生成所述语音交互数字人的交互反馈，包括：5. The method according to claim 1, wherein generating interactive feedback of the voice interactive digital human based on the voice interactive input through the audio lip synchronization configuration and the anthropomorphic configuration comprises:

接收所述语音交互输入，并对所述语音交互输入进行文字转换，得到输入文本数据；receiving the voice interaction input and converting the voice interaction input into text to obtain input text data;

将所述输入文本数据输入预先训练的对话式语言大模型，得到反馈文本数据；Inputting the input text data into a pre-trained conversational language model to obtain feedback text data;

将所述反馈文本数据进行音频转化，得到反馈音频数据；Converting the feedback text data into audio data to obtain feedback audio data;

通过所述音频唇动同步配置和所述拟人化配置对所述反馈音频数据进行处理，得到所述语音交互数字人的交互反馈。The feedback audio data is processed through the audio lip movement synchronization configuration and the anthropomorphism configuration to obtain the interactive feedback of the voice interactive digital human.

6.根据权利要求5所述的方法，其特征在于，所述将所述语音交互数字人的交互反馈实时云渲染至预设前端页面，包括：6. The method according to claim 5, wherein the step of rendering the interactive feedback of the voice-interactive digital human to a preset front-end page in real time via a cloud comprises:

将所述语音交互数字人的交互反馈进行渲染，得到渲染结果；Rendering the interactive feedback of the voice interactive digital human to obtain a rendering result;

由所述语音交互数字人通过Websoket连接展示所述渲染结果至所述预设前端页面。The voice interactive digital human displays the rendering result to the preset front-end page through a Websoket connection.

7.一种云渲染语音交互数字人的实现装置，其特征在于，包括：7. A device for implementing a cloud-rendered voice-interactive digital human, comprising:

构建模块，用于构建语音交互数字人，并针对所述语音交互数字人进行音频唇动同步配置和拟人化配置；将配置完成的语音交互数字人通过实时云渲染至预设前端页面；A construction module is used to construct a voice-interactive digital human, and perform audio lip synchronization and anthropomorphism configuration on the voice-interactive digital human; and render the configured voice-interactive digital human to a preset front-end page via real-time cloud;

交互模块，用于获取用户通过所述预设前端页面发起的语音交互输入；基于语音交互输入，通过所述音频唇动同步配置和拟人化配置生成所述语音交互数字人的交互反馈；将所述语音交互数字人的交互反馈实时云渲染至预设前端页面；An interaction module is configured to obtain voice interaction input initiated by a user through the preset front-end page; based on the voice interaction input, generate interaction feedback of the voice-interactive digital human through the audio lip synchronization configuration and the anthropomorphism configuration; and render the interaction feedback of the voice-interactive digital human to the preset front-end page in real time through the cloud;

所述构建模块，具体用于：The building blocks are specifically used for:

获取样本语音视频数据；从所述样本语音视频数据中检测出人脸区域，并从所述人脸区域中识别出多个人脸特征点；随机生成多个表情基，将所述多个表情基转换为2D图像后确定出一一对应的人脸特征点，训练得到用于表征所述多个表情基和所述个人脸特征点映射关系的第一映射模型；基于所述样本语音视频数据确定出人脸特征点序列，并利用所述第一映射模型确定出所述人脸特征点序列对应的预测表情基；获取所述样本语音视频数据的音频序列，并根据所述音频序列和所述预测表情基训练得到用于表征音频数据和所述多个表情基映射关系的第二映射模型，完成所述音频唇动同步配置。Acquire sample voice and video data; detect a face area from the sample voice and video data, and identify multiple facial feature points from the face area; randomly generate multiple expression bases, convert the multiple expression bases into 2D images, determine one-to-one corresponding facial feature points, and train a first mapping model for characterizing the mapping relationship between the multiple expression bases and the individual facial feature points; determine a facial feature point sequence based on the sample voice and video data, and use the first mapping model to determine a predicted expression base corresponding to the facial feature point sequence; acquire an audio sequence of the sample voice and video data, and train a second mapping model for characterizing the mapping relationship between the audio data and the multiple expression bases based on the audio sequence and the predicted expression base to complete the audio lip synchronization configuration.

8.一种计算机设备，其特征在于，所述计算机设备包括处理器及存储有计算机指令的非易失性存储器，所述计算机指令被所述处理器执行时，所述计算机设备执行权利要求1-6中任意一项所述的方法。8. A computer device, characterized in that the computer device comprises a processor and a non-volatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device executes the method according to any one of claims 1 to 6.

9.一种可读存储介质，其特征在于，所述可读存储介质包括计算机程序，所述计算机程序运行时控制所述可读存储介质所在计算机设备执行权利要求1-6中任意一项所述的方法。9. A readable storage medium, characterized in that the readable storage medium comprises a computer program, and when the computer program is executed, the computer device where the readable storage medium is located is controlled to execute the method according to any one of claims 1 to 6.