CN119271043A

Movatterモバイル変換

Info

Publication number: CN119271043A
Application number: CN202411238343.1A
Authority: CN
Inventors: 曾镜彬; 叶征明; 张邦禹; 范小林
Original assignee: Shenzhen Yuanxiang Shijie Technology Group Co ltd
Current assignee: Shenzhen Yuanxiang Shijie Technology Group Co ltd
Priority date: 2024-09-05
Filing date: 2024-09-05
Publication date: 2025-01-07

Abstract

The application discloses a multi-mode interactive digital exhibition hall system which comprises a data acquisition module, a data processing module, an instruction identification module and an execution module, wherein the data acquisition module acquires multi-mode data of a user for operating an exhibit in a virtual space through a multi-mode sensing interface of an MR device, the data processing module preprocesses the acquired multi-mode data to standardize a data format, the instruction identification module is used for carrying out instruction identification on the preprocessed multi-mode data to acquire corresponding operation instructions, and the execution module executes corresponding operations according to the acquired operation instructions to update object states and behaviors in the virtual scene. According to the application, the user operation is perceived through the multi-mode perception interface, the content of the exhibited item is rapidly acquired and understood, the external interference is reduced, the immersion sense and the interactivity are improved, and the exhibition effect is improved.

Description

Multi-mode interactive digital exhibition hall system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-mode interactive digital exhibition hall system.

Background

With the rapid development of information technology, digital exhibition halls are becoming an important form of modern exhibition and exhibition. Digital showrooms offer a completely new, immersive viewing experience for visitors by integrating a variety of advanced technologies including Mixed Reality (MR), gesture recognition, voice control, and handle interaction. The application of the multi-mode interaction technology not only enriches the expression form of the exhibition content, but also can update the exhibition content in real time, thereby improving the participation and the interactivity of audiences.

Although some techniques find application in digital showrooms, existing solutions still suffer from the following disadvantages:

The interaction mode is single, namely, the existing digital exhibition hall mostly adopts a single interaction mode, for example, the interaction is carried out only through a touch screen or simple gesture control. The single interaction mode limits the operation freedom degree and experience richness of the user, and cannot fully meet the diversified interaction requirements of the user.

The immersion is insufficient, and although the mixed reality technology enhances the immersion of a user to a certain extent, the fusion degree and interactivity of virtual content and a real environment in the existing system still need to be improved. Users often feel a sense of cracking between virtual and real in the experience process, affecting the overall immersion experience.

The technology integration is imperfect, and the integration and application of technologies such as mixed reality, gesture recognition, voice control and handle interaction in the existing digital exhibition hall system are insufficient. The synergy between the techniques is poor, resulting in limited overall performance and user experience of the system. For example, the response speed and accuracy of gesture recognition and voice control are not high, and the operation smoothness of a user is affected.

Content update lag-exhibition content in digital exhibition halls usually requires professional technicians to update and maintain, and has long content update period and poor flexibility. This late way of content update is difficult to meet with rapidly changing display needs and viewer cravings for fresh content.

Disclosure of Invention

Aiming at the technical problems, the invention provides a multi-mode interactive digital exhibition hall system, provides comprehensive and immersive visiting experience, and realizes real-time updating of content and an effective user feedback mechanism.

The embodiment of the invention provides a multi-mode interactive digital exhibition hall system which comprises a data acquisition module, a data processing module, an instruction identification module and an execution module, wherein the data acquisition module is used for acquiring multi-mode data of a user for performing operation on an exhibit in a virtual space through a multi-mode sensing interface of MR equipment, the data processing module is used for preprocessing the acquired multi-mode data to standardize a data format, the instruction identification module is used for carrying out instruction identification on the preprocessed multi-mode data to acquire corresponding operation instructions, and the execution module is used for executing corresponding operation according to the acquired operation instructions to update the state and behavior of an object in the virtual scene.

Optionally, the data acquisition module comprises a gesture acquisition module, a voice acquisition module and a handle interaction module, wherein the gesture acquisition module comprises a sensor or a camera arranged in the MR equipment and is used for capturing hand images and joint point position data of a user in real time, the voice acquisition module comprises a microphone and is used for acquiring voice signals of the user, and the handle interaction module is used for acquiring operation actions and gesture data of the user in real time.

Optionally, the data processing module comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for preprocessing and converting the hand image and joint point position data acquired by the gesture acquisition module into gesture data, the second processing module is used for converting the acquired voice signal into voice text data, and the third processing module is used for preprocessing and converting the acquired operation action and gesture data into operation data.

Optionally, the instruction recognition module comprises a gesture recognition module used for matching the hand data with a preset gesture template through a deep learning algorithm to determine a specific gesture instruction, a voice recognition module used for recognizing the voice text data through a voice recognition model or a voice recognition algorithm to determine a specific voice instruction, and a handle recognition module used for converting the operation data processed by the third processing module into a predefined operation instruction through mapping.

Optionally, the system further comprises a model disassembly and interaction module for disassembling, reorganizing and observing preset exhibits in the virtual exhibition hall.

Optionally, the system further comprises an exhibit interaction module for interacting with the virtual exhibit in a short distance and zooming in, zooming out or rotating the exhibit.

The first processing module is further used for extracting key point data of the hand from the collected hand image, converting the key point data into a skeleton model of the hand, extracting main features of the hand according to key features of the skeleton model, comparing the extracted main features of the hand with samples in a predefined hand gesture library, classifying the hand gestures and identifying specific hand gestures.

Optionally, the key features include joint angle, finger length and relative position features, and the gesture classification mode includes a pattern matching algorithm or a deep learning model mode.

Optionally, the gesture recognition module is further configured to extract key point data of a hand from the hand image, convert the extracted key point data into feature vectors, wherein the feature vectors represent spatial structures and dynamic information of gestures, train a learning model, train the learning model by using labeled gesture data, and recognize the gesture by using the trained model to recognize specific gestures.

Optionally, the second processing module is further configured to label the voice text data, mark intent and entity of each sentence, and train the language processing model by using the labeled data set, so as to obtain a trained voice recognition model for recognizing the voice text data.

In the technical scheme provided by the embodiment of the invention, the data acquisition module is used for acquiring the multi-mode data, the data processing module is used for preprocessing the multi-mode data, the data after preprocessing is used for carrying out instruction recognition, the corresponding operation instruction is obtained, and the object state and the behavior in the virtual scene are updated through the operation instruction.

Drawings

FIG. 1 is a schematic diagram of a multi-modal interactive digital exhibition hall system according to the present invention.

Fig. 2 is a schematic diagram of the architecture of creating and joining virtual showrooms in the multi-modal interactive digital showroom system of the present invention.

Fig. 3 is a schematic diagram of the exhibition progress stage in the multi-modal interactive digital exhibition system of the present invention.

Fig. 4 is a schematic diagram of the immersive display experience in the multimodal interactive digital display system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "connected" and "connected" should be interpreted broadly, and for example, they may be directly connected, or they may be indirectly connected through an intermediate medium, or they may be in communication with each other between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art. In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, the present invention provides a multi-modal interactive digital exhibition hall system, which includes a data acquisition module 100, a data processing module 200, an instruction recognition module 300 and an execution module 400, wherein the acquisition module 100 is configured to acquire multi-modal data of a user operating on an exhibited item in a virtual space through a multi-modal sensing interface of an MR device, the data processing module 200 is configured to pre-process the acquired multi-modal data to standardize a data format, the instruction recognition module 300 is configured to perform instruction recognition on the pre-processed multi-modal data to obtain a corresponding operation instruction, and the execution module 400 is configured to execute corresponding operation according to the obtained operation instruction to update an object state and a behavior in the virtual scene.

The multi-mode interactive digital exhibition hall system is a highly interactive virtual exhibition hall, and a user can visit, operate and process various exhibits in the virtual exhibition hall. The virtual exhibition hall environment simulates various elements of a real exhibition hall, so that the visiting process is more immersive. According to the invention, through the multi-mode interaction and the highly immersive virtual exhibition environment, the ornamental interest and concentration of the user are obviously improved, and the ornamental experience with strong interactivity can excite the active visiting enthusiasm of the user. The virtual exhibition hall and various interaction modes enable ornamental contents to be more vivid and specific, and users can more intuitively understand and appreciate the exhibits through actual operation and interaction.

In one embodiment of the present invention, the system further includes a model disassembly and interaction module for disassembling, reorganizing, and observing preset exhibits in the virtual exhibition hall. The module enables a user to understand complex concepts and structures more deeply through practical operations.

In one embodiment of the present invention, the system further comprises an exhibit interaction module for interacting with the virtual exhibit in close proximity to zoom in, zoom out, or rotate the exhibit. The module enables the exhibit to be more flexible, and helps the user to pay attention to details and understand the exhibit in depth. The present invention provides physical collision and interaction simulation functions that enable users to observe the real behavior of an exhibit, such as collision reactions and force transfer, in a virtual exhibition hall. This immersive experience can enhance the user's memory of the exhibit.

According to the invention, through the functions of the disassembly of the exhibits, the interaction of the close-range exhibits and the simulation of physical collision, the visiting process is more vivid and interesting, and the users can understand and memorize the exhibits deeply through the interaction and practice.

The invention obviously improves the defects of the prior art through modules such as multi-mode interaction, virtual exhibition hall environment, model disassembly, physical collision and the like, and provides a high-efficiency, interactive and personalized digital exhibition hall system. The user not only can enjoy immersive visiting experience, but also can obviously promote the whole exhibition effect.

In the create virtual showroom phase, a user logs directly into the system through a Mixed Reality (MR) head mounted device to create a virtual showroom, and the user enters a showroom code into the virtual space through the head mounted device. After the exhibition hall is created, the system generates an exhibition hall code, and the user can invite other people to join the exhibition hall through the exhibition hall code. In the virtual exhibition hall progression phase, users communicate and interact in virtual space through MR devices. The initial setting comprises basic elements such as an exhibition area, an exhibit display area, an interactive screen and the like, a visitor can perform actions and voice expression through the MR equipment, and meanwhile, an exhibit model, pictures and texts and videos are uploaded to the cloud end at a computer end. In the virtual space, visitors can control the exhibits, so that real-time interaction is realized.

In a virtual showroom, all of the exhibits' interactions may be close-range interactions. When users share exhibits in the virtual space, other visitors can synchronously view and operate the exhibits, so that real-time cooperation is realized. For example, when a user manipulates the exhibit model in the MR device, other visitors can also synchronously see and interact with the exhibit model.

In one embodiment of the present invention, the data processing module is configured to perform noise reduction processing, smoothing processing, and normalization processing on the acquired multi-modal data.

In one embodiment of the invention, the data acquisition module comprises a gesture acquisition module, a data processing module and an instruction recognition module, wherein the gesture acquisition module comprises a sensor or a camera built in the MR equipment and is used for capturing hand images and joint point position data of a user in real time, the data processing module comprises a first processing module and is used for preprocessing and converting the hand images and the joint point position data into gesture data, and the instruction recognition module comprises a gesture recognition module and is used for matching the hand data with a preset gesture template through a deep learning algorithm and determining specific gesture instructions. The execution module is used for transmitting the recognized gesture instruction to the MR device through the API, and the MR device executes corresponding operations, such as selecting or moving the virtual object, according to the gesture instruction.

In one embodiment of the present invention, the data acquisition module includes a voice acquisition module, the voice acquisition module includes a microphone for acquiring a voice signal of a user, the microphone may be a microphone of the MR device, the data processing module includes a second processing module for converting the acquired voice signal into voice text data, the instruction recognition module includes a voice recognition module for recognizing the voice text data through a voice recognition model or a voice recognition algorithm to determine a specific voice instruction, and the execution module is used for transmitting the recognized voice instruction to the MR device through an API, and the MR device performs a corresponding operation, such as a start function and a control virtual object, according to the voice instruction.

In one embodiment of the invention, the data acquisition module comprises a handle interaction module for acquiring operation action and gesture data of a user in real time, the data processing module comprises a third processing module for preprocessing the acquired operation action and gesture data and converting the acquired operation action and gesture data into operation data, and the instruction recognition module comprises a handle recognition module for converting the operation data processed by the third processing module into predefined operation instructions, such as instructions for moving or rotating a virtual object, through mapping. The execution module is used for transmitting the identified operation instruction to the MR device through the API, and the MR device executes corresponding operation such as selecting or moving the virtual object according to the operation instruction.

In one embodiment of the present invention, the gesture acquisition module includes a depth camera for acquiring three-dimensional depth data of a user's hand and hand images in real time, including each key point of the hand (such as finger joint position) and its coordinates in three-dimensional space. The first processing module is used for extracting key point data (such as finger tips, joints and the like) of the hand from the three-dimensional depth data and the hand image. The three-dimensional depth data provided by the depth camera can be used for constructing a skeleton model of the hand, marking specific positions and postures of fingers, and converting the key point data into the skeleton model of the hand for further analyzing gestures. The coordinates of each key point and the angle information between the fingers are extracted and combined into the overall structure of the hand. The gesture recognition module is also used for extracting main characteristics of gestures according to characteristics such as joint angles, finger lengths, relative positions and the like of the skeleton model. The main characteristics may include the opening degree of the fingers, the curvature, the rotation angle of the palm, etc. Comparing the extracted main features with samples in a predefined gesture library, classifying the gestures by using a pattern matching algorithm (such as dynamic time warping) or a deep learning model (such as a convolutional neural network), and identifying specific gestures made by a user. The recognized gesture instructions are mapped to specific operation commands (e.g., "grab", "zoom", "rotate", etc.). These instructions are passed to the virtual show system or application through the device's API. And updating the object state in the virtual scene according to the operation command, and realizing the interaction between the user and the virtual exhibit.

In one embodiment of the present invention, the gesture acquisition module is further configured to acquire a plurality of gesture data sets, including images or video clips of various gestures. The data should encompass a variety of gesture samples, such as "grab," "zoom in," "rotate," and so forth. The data is labeled as the corresponding gesture category for training. The first processing module performs preprocessing on the acquired image or video data of the gesture, including image cropping, scaling, normalization, etc., to normalize the data format, or performs data enhancement on the data, such as rotation, translation, flipping, etc., to increase the robustness of the model. The keypoint data of the hand is extracted from the image using a deep learning model (such as convolutional neural network CNN) or conventional computer vision techniques. These key points include the joint position of the finger, the shape of the palm, etc. And converting the extracted key point data into feature vectors, and representing the spatial structure and dynamic information of the gestures. Features include length, angle, relative position, etc. of the finger. Training a learning model, and training the learning model by using labeled gesture data, wherein in the training process, the model optimizes parameters by minimizing a loss function (such as cross entropy loss) so as to improve the recognition accuracy of various gestures. And carrying out gesture recognition on gesture data by using the trained model, and recognizing a specific gesture.

In the training process, the performance of the model is evaluated by using a verification data set, and indexes such as accuracy, recall, F1 score and the like are monitored. The test dataset was used to ultimately evaluate the generalization ability of the model on unseen data. Super parameters of the model (e.g., learning rate, batch size, regularization term, etc.) are adjusted to optimize the performance and training stability of the model. Depending on the test results, fine tuning or retraining of the model may be required to ensure the reliability of the model in practical applications. And integrating the trained model into practical application to perform real-time gesture recognition. The model receives real-time gesture input of a user, and outputs a recognition result through an reasoning process.

In one embodiment of the present invention, the second processing module is further configured to label the voice text data, mark intent and entity of each sentence, and train the language processing model using the labeled data set, to obtain a trained voice recognition model for recognizing the voice text data. The speech recognition module includes a BERT, GPT, or specialized speech recognition model of the transducer architecture. The training process includes a speech recognition model (converting speech to text) and a natural language understanding model (parsing the intent of the text instructions). The speech recognition module of the present invention may also recognize intent in text (e.g., "play music", "adjust volume"). The intent classifier is trained using model-trained data, identifying the needs of the user. Mapping the recognized voice command into the actual operating system, for example, a "play music" command may correspond to starting a music player and playing a specified track. And feeds back the execution result to the user, for example, confirms the volume adjustment or the played music, so as to ensure that the user is satisfied with the system response.

In one embodiment of the invention, the hand interaction module includes a glove with built-in sensors for the user to wear, which can capture three-dimensional position and motion of the hand, including bending, stretching of the fingers and movements of the palm. The sensor transmits the hand motion data to the third processing module in a wireless or wired manner. And the third processing module processes the motion data captured by the sensor and recognizes specific motions and gestures of the hand through the handle recognition module. For example, by sensing the motion of the glove, the system can recognize gestures such as "grab", "rotate", and "drag". Based on the identified hand movements, the system applies the corresponding operations to the virtual exhibit. For example, a user selects a virtual object by a "grab" gesture, adjusts the angle of the object by a "rotate" gesture, or moves the position of the object by a "drag" gesture.

The invention can also realize real-time interaction and cooperation, and the system realizes real-time cooperation among users through a low-delay network communication technology and a distributed computing architecture in a virtual space. The MR device is connected with the cloud server through the Wi-Fi network, and high-bandwidth and low-delay signal transmission is ensured. Operations (such as rotation, scaling, movement) performed by the user on the exhibited items in the virtual space are captured in real time and transmitted to the cloud server through the network. The cloud server merges and synchronizes the operations of all users, and feeds back the updated exhibit state to other visitors in real time, so that the users can synchronously view and operate the exhibits. The system ensures that when one user operates the exhibited article, the operations of other users do not conflict through a cooperative locking mechanism, and real-time cooperation is realized.

The invention also supports synchronous interaction and communication of a plurality of users in the virtual exhibition hall, and particularly, the invention uses a multi-user synchronous framework based on WebRTC (real-time communication) to allow the plurality of users to be simultaneously connected to the virtual exhibition hall for real-time audio and video communication and data sharing. In order to ensure real-time performance, the system only transmits the changed data through a differential data transmission and local updating mechanism so as to reduce network bandwidth occupation.

The invention also realizes dynamic scene management, all interactive scenes in the virtual exhibition hall are managed by the server in real time, and the operation and position change of the user can be synchronized to the MR equipment of all participants in real time, so that the state of the exhibition hall is kept consistent.

The invention can also realize real-time uploading and updating of the exhibition content, and particularly, the system adopts a distributed cloud storage service, and a user can upload exhibit resources (such as models, pictures and texts and videos) to a cloud for storage through a computer end or MR equipment. The system provides a standardized API interface, allows an application program to call the exhibit resource stored in the cloud in real time and automatically update the exhibit content, and a user can upload the file to the cloud and ensure the call and sharing at any time in the exhibiting process through the following technical means. The uploaded file is encrypted and accessed through fine-grained rights management. Only authorized users can access and operate the files, so that the security and confidentiality of data are ensured. The system integrates the CDN service distributed globally, and ensures that users in all places can access and download exhibition contents quickly and reliably. After the file is uploaded, the system can be immediately and synchronously updated to the MR equipment of all users, so that the real-time performance and consistency of the content in the exhibition process are ensured.

The multi-mode interaction technology is combined with the immersive exhibition experience, so that a user can quickly acquire and understand the content of the exhibited article, external interference is reduced, the exhibition effect is improved, the user can select a proper interaction mode and an exhibition mode according to own habits and demands, personalized exhibition experience is provided, synchronous interaction and communication are carried out through the MR equipment, cooperation and communication among the users are promoted, and team exhibition and cooperation capability is enhanced.

And in the content preparation stage, a user uploads the exhibit resource comprising a model, images and texts and video through a computer terminal, and the content can be updated in real time, so that the flexibility and instantaneity of exhibition are ensured.

And the stage of creating and joining the virtual exhibition hall, namely, a user logs in the system through the MR head equipment to create the virtual exhibition hall, inputs the exhibition hall code into the virtual space, and generates the exhibition hall code to invite other people to join. Please refer to fig. 2 in detail.

In the exhibition proceeding stage, please refer to fig. 3, the virtual exhibition hall is initially set, basic elements (exhibition area, exhibit display area, interaction screen, etc.) in the virtual exhibition hall are displayed, users interact through the MR device, scenes of actions and voice expression of the users in the virtual space are displayed, visitors control the exhibits in the virtual space, and the process of interaction of close-range exhibits, such as dragging, rotating, amplifying the exhibits, etc., is displayed.

And short-distance exhibit interaction, namely, the exhibited articles are shared by the users in the virtual space, and other visitors can synchronously view and operate, so that real-time collaboration is realized.

And the multi-mode interaction technology integrates gesture recognition, voice control and handle interaction technology, so that natural and convenient interaction experience is realized.

Immersive exhibition experience please refer to fig. 4, which shows a scene that a user enters a virtual three-dimensional space to perform exhibition experience, and the user can enter the virtual three-dimensional space through an MR device to perform the exhibition process and perform interactive operation.

The invention provides a novel digital exhibition hall system through innovative multi-mode interaction technology and mixed reality technology, which can effectively improve exhibition efficiency, enhance exhibition interest, provide personalized exhibition experience, support real-time collaboration, promote immersion and interactivity and bring brand-new revolution and development to the exhibition field.

The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solution of the embodiments of the present invention in essence.

Claims

Translated fromChinese

1.一种多模态互动数字展厅系统，其特征在于，所述系统包括：1. A multimodal interactive digital exhibition hall system, characterized in that the system comprises:

数据采集模块，用于通过MR设备的多模态感知接口采集用户在虚拟空间中对展品操作的多模态数据；A data acquisition module, used to collect multimodal data of the user's operation on the exhibits in the virtual space through the multimodal perception interface of the MR device;

数据处理模块，用于对采集的多模态数据进行预处理，以标准化数据格式；A data processing module, used to pre-process the collected multi-modal data to standardize the data format;

指令识别模块，用于对预处理的多模态数据进行指令识别，获取对应的操作指令；An instruction recognition module is used to perform instruction recognition on the preprocessed multimodal data and obtain corresponding operation instructions;

执行模块，用于根据获取的操作指令执行相应操作更新虚拟场景中的物体状态和行为。The execution module is used to execute corresponding operations according to the acquired operation instructions to update the state and behavior of objects in the virtual scene.

2.根据权利要求1所述的多模态互动数字展厅系统，其特征在于，所述数据采集模块包括：2. The multimodal interactive digital exhibition hall system according to claim 1, wherein the data acquisition module comprises:

手势采集模块，包括MR设备内置的传感器或摄像头，用于实时捕捉用户的手部图像和关节点位置数据；A gesture acquisition module, including a sensor or camera built into the MR device, is used to capture the user's hand image and joint position data in real time;

语音采集模块，包括麦克风，用于采集用户的语音信号；A voice collection module, including a microphone, is used to collect the user's voice signal;

手柄交互模块，用于实时采集用户的操作动作和姿态数据。The handle interaction module is used to collect the user's operation action and posture data in real time.

3.根据权利要求2所述的多模态互动数字展厅系统，其特征在于，所述数据处理模块包括：3. The multimodal interactive digital exhibition hall system according to claim 2, wherein the data processing module comprises:

第一处理模块，用于将手势采集模块采集的手部图像和关节点位置数据进行预处理转换为手势数据；A first processing module, used for pre-processing the hand image and joint point position data collected by the gesture collection module and converting them into gesture data;

第二处理模块，用于将采集的语音信号转换为语音文本数据；A second processing module, used for converting the collected voice signal into voice text data;

第三处理模块，用于将采集的操作动作和姿态数据进行预处理转换为操作数据。The third processing module is used to pre-process the collected operation action and posture data and convert them into operation data.

4.根据权利要求3所述的多模态互动数字展厅系统，其特征在于，所述指令识别模块包括：4. The multimodal interactive digital exhibition hall system according to claim 3, wherein the instruction recognition module comprises:

手势识别模块，用于通过深度学习算法将所述手部数据与预设的手势模板进行匹配，确定具体的手势指令；A gesture recognition module, used to match the hand data with a preset gesture template through a deep learning algorithm to determine a specific gesture instruction;

语音识别模块，用于对所述语音文本数据通过语音识别模型或语音识别算法进行识别，确定具体的语音指令；A speech recognition module, used to recognize the speech text data through a speech recognition model or a speech recognition algorithm to determine a specific speech instruction;

手柄识别模块，用于根据第三处理模块处理的操作数据通过映射转换为预定义的操作指令。The handle recognition module is used to convert the operation data processed by the third processing module into predefined operation instructions through mapping.

5.根据权利要求1所述的多模态互动数字展厅系统，其特征在于，所述系统还包括：5. The multimodal interactive digital exhibition hall system according to claim 1, characterized in that the system further comprises:

模型拆解与交互模块，用于对虚拟展厅中的预设展品进行拆解、重组和观察。The model disassembly and interaction module is used to disassemble, reassemble and observe the preset exhibits in the virtual exhibition hall.

6.根据权利要求1所述的多模态互动数字展厅系统，其特征在于，所述系统还包括：6. The multimodal interactive digital exhibition hall system according to claim 1, characterized in that the system further comprises:

展品交互模块，用于与虚拟展品近距离互动，对展品进行放大、缩小或旋转。The exhibit interaction module is used to interact with virtual exhibits at close range and to enlarge, reduce or rotate the exhibits.

7.根据权利要求3所述的多模态互动数字展厅系统，其特征在于，所述第一处理模块还用于从采集的手部图像中提取手部的关键点数据，将关键点数据转化为手部的骨架模型；所述手势识别模块还用于根据骨架模型的关键特征，提取手势的主要特征，将提取的手势的主要特征与预定义的手势库中的样本进行比对并进行手势分类，识别出具体的手势。7. The multimodal interactive digital exhibition hall system according to claim 3 is characterized in that the first processing module is also used to extract key point data of the hand from the collected hand image and convert the key point data into a skeleton model of the hand; the gesture recognition module is also used to extract the main features of the gesture according to the key features of the skeleton model, compare the extracted main features of the gesture with the samples in the predefined gesture library and perform gesture classification to identify the specific gesture.

8.根据权利要求7所述的多模态互动数字展厅系统，其特征在于，所述关键特征包括关节角度、手指长度和相对位置特征，所述手势分类方式包括模式匹配算法或深度学习模型方式。8. The multimodal interactive digital exhibition hall system according to claim 7 is characterized in that the key features include joint angles, finger lengths and relative position features, and the gesture classification method includes a pattern matching algorithm or a deep learning model method.

9.根据权利要求7所述的多模态互动数字展厅系统，其特征在于，所述手势识别模块还用于从手部图像中提取手部的关键点数据，将提取的关键点数据转化为特征向量，特征向量表示手势的空间结构和动态信息；训练学习模型，并使用标注的手势数据对学习模型进行训练，使用训练后的模型对手势数据进行手势识别，识别出具体的手势。9. The multimodal interactive digital exhibition hall system according to claim 7 is characterized in that the gesture recognition module is also used to extract key point data of the hand from the hand image, and convert the extracted key point data into a feature vector, which represents the spatial structure and dynamic information of the gesture; train a learning model, and use the labeled gesture data to train the learning model, and use the trained model to perform gesture recognition on the gesture data to identify specific gestures.

10.根据权利要求4所述的多模态互动数字展厅系统，其特征在于，所述第二处理模块还用于对语音文本数据进行标注，标明每个语句的意图和实体，并使用标注好的数据集训练语言处理模型，获取训练好的语音识别模型用于对语音文本数据进行识别。10. The multimodal interactive digital exhibition hall system according to claim 4 is characterized in that the second processing module is also used to annotate the voice text data, indicate the intention and entity of each sentence, and use the annotated data set to train the language processing model, and obtain the trained speech recognition model for recognizing the voice text data.