Disclosure of Invention
Aiming at the technical problems, the invention provides a multi-mode interactive digital exhibition hall system, provides comprehensive and immersive visiting experience, and realizes real-time updating of content and an effective user feedback mechanism.
The embodiment of the invention provides a multi-mode interactive digital exhibition hall system which comprises a data acquisition module, a data processing module, an instruction identification module and an execution module, wherein the data acquisition module is used for acquiring multi-mode data of a user for performing operation on an exhibit in a virtual space through a multi-mode sensing interface of MR equipment, the data processing module is used for preprocessing the acquired multi-mode data to standardize a data format, the instruction identification module is used for carrying out instruction identification on the preprocessed multi-mode data to acquire corresponding operation instructions, and the execution module is used for executing corresponding operation according to the acquired operation instructions to update the state and behavior of an object in the virtual scene.
Optionally, the data acquisition module comprises a gesture acquisition module, a voice acquisition module and a handle interaction module, wherein the gesture acquisition module comprises a sensor or a camera arranged in the MR equipment and is used for capturing hand images and joint point position data of a user in real time, the voice acquisition module comprises a microphone and is used for acquiring voice signals of the user, and the handle interaction module is used for acquiring operation actions and gesture data of the user in real time.
Optionally, the data processing module comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for preprocessing and converting the hand image and joint point position data acquired by the gesture acquisition module into gesture data, the second processing module is used for converting the acquired voice signal into voice text data, and the third processing module is used for preprocessing and converting the acquired operation action and gesture data into operation data.
Optionally, the instruction recognition module comprises a gesture recognition module used for matching the hand data with a preset gesture template through a deep learning algorithm to determine a specific gesture instruction, a voice recognition module used for recognizing the voice text data through a voice recognition model or a voice recognition algorithm to determine a specific voice instruction, and a handle recognition module used for converting the operation data processed by the third processing module into a predefined operation instruction through mapping.
Optionally, the system further comprises a model disassembly and interaction module for disassembling, reorganizing and observing preset exhibits in the virtual exhibition hall.
Optionally, the system further comprises an exhibit interaction module for interacting with the virtual exhibit in a short distance and zooming in, zooming out or rotating the exhibit.
The first processing module is further used for extracting key point data of the hand from the collected hand image, converting the key point data into a skeleton model of the hand, extracting main features of the hand according to key features of the skeleton model, comparing the extracted main features of the hand with samples in a predefined hand gesture library, classifying the hand gestures and identifying specific hand gestures.
Optionally, the key features include joint angle, finger length and relative position features, and the gesture classification mode includes a pattern matching algorithm or a deep learning model mode.
Optionally, the gesture recognition module is further configured to extract key point data of a hand from the hand image, convert the extracted key point data into feature vectors, wherein the feature vectors represent spatial structures and dynamic information of gestures, train a learning model, train the learning model by using labeled gesture data, and recognize the gesture by using the trained model to recognize specific gestures.
Optionally, the second processing module is further configured to label the voice text data, mark intent and entity of each sentence, and train the language processing model by using the labeled data set, so as to obtain a trained voice recognition model for recognizing the voice text data.
In the technical scheme provided by the embodiment of the invention, the data acquisition module is used for acquiring the multi-mode data, the data processing module is used for preprocessing the multi-mode data, the data after preprocessing is used for carrying out instruction recognition, the corresponding operation instruction is obtained, and the object state and the behavior in the virtual scene are updated through the operation instruction.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
In the description of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "connected" and "connected" should be interpreted broadly, and for example, they may be directly connected, or they may be indirectly connected through an intermediate medium, or they may be in communication with each other between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art. In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Referring to fig. 1, the present invention provides a multi-modal interactive digital exhibition hall system, which includes a data acquisition module 100, a data processing module 200, an instruction recognition module 300 and an execution module 400, wherein the acquisition module 100 is configured to acquire multi-modal data of a user operating on an exhibited item in a virtual space through a multi-modal sensing interface of an MR device, the data processing module 200 is configured to pre-process the acquired multi-modal data to standardize a data format, the instruction recognition module 300 is configured to perform instruction recognition on the pre-processed multi-modal data to obtain a corresponding operation instruction, and the execution module 400 is configured to execute corresponding operation according to the obtained operation instruction to update an object state and a behavior in the virtual scene.
The multi-mode interactive digital exhibition hall system is a highly interactive virtual exhibition hall, and a user can visit, operate and process various exhibits in the virtual exhibition hall. The virtual exhibition hall environment simulates various elements of a real exhibition hall, so that the visiting process is more immersive. According to the invention, through the multi-mode interaction and the highly immersive virtual exhibition environment, the ornamental interest and concentration of the user are obviously improved, and the ornamental experience with strong interactivity can excite the active visiting enthusiasm of the user. The virtual exhibition hall and various interaction modes enable ornamental contents to be more vivid and specific, and users can more intuitively understand and appreciate the exhibits through actual operation and interaction.
In one embodiment of the present invention, the system further includes a model disassembly and interaction module for disassembling, reorganizing, and observing preset exhibits in the virtual exhibition hall. The module enables a user to understand complex concepts and structures more deeply through practical operations.
In one embodiment of the present invention, the system further comprises an exhibit interaction module for interacting with the virtual exhibit in close proximity to zoom in, zoom out, or rotate the exhibit. The module enables the exhibit to be more flexible, and helps the user to pay attention to details and understand the exhibit in depth. The present invention provides physical collision and interaction simulation functions that enable users to observe the real behavior of an exhibit, such as collision reactions and force transfer, in a virtual exhibition hall. This immersive experience can enhance the user's memory of the exhibit.
According to the invention, through the functions of the disassembly of the exhibits, the interaction of the close-range exhibits and the simulation of physical collision, the visiting process is more vivid and interesting, and the users can understand and memorize the exhibits deeply through the interaction and practice.
The invention obviously improves the defects of the prior art through modules such as multi-mode interaction, virtual exhibition hall environment, model disassembly, physical collision and the like, and provides a high-efficiency, interactive and personalized digital exhibition hall system. The user not only can enjoy immersive visiting experience, but also can obviously promote the whole exhibition effect.
In the create virtual showroom phase, a user logs directly into the system through a Mixed Reality (MR) head mounted device to create a virtual showroom, and the user enters a showroom code into the virtual space through the head mounted device. After the exhibition hall is created, the system generates an exhibition hall code, and the user can invite other people to join the exhibition hall through the exhibition hall code. In the virtual exhibition hall progression phase, users communicate and interact in virtual space through MR devices. The initial setting comprises basic elements such as an exhibition area, an exhibit display area, an interactive screen and the like, a visitor can perform actions and voice expression through the MR equipment, and meanwhile, an exhibit model, pictures and texts and videos are uploaded to the cloud end at a computer end. In the virtual space, visitors can control the exhibits, so that real-time interaction is realized.
In a virtual showroom, all of the exhibits' interactions may be close-range interactions. When users share exhibits in the virtual space, other visitors can synchronously view and operate the exhibits, so that real-time cooperation is realized. For example, when a user manipulates the exhibit model in the MR device, other visitors can also synchronously see and interact with the exhibit model.
In one embodiment of the present invention, the data processing module is configured to perform noise reduction processing, smoothing processing, and normalization processing on the acquired multi-modal data.
In one embodiment of the invention, the data acquisition module comprises a gesture acquisition module, a data processing module and an instruction recognition module, wherein the gesture acquisition module comprises a sensor or a camera built in the MR equipment and is used for capturing hand images and joint point position data of a user in real time, the data processing module comprises a first processing module and is used for preprocessing and converting the hand images and the joint point position data into gesture data, and the instruction recognition module comprises a gesture recognition module and is used for matching the hand data with a preset gesture template through a deep learning algorithm and determining specific gesture instructions. The execution module is used for transmitting the recognized gesture instruction to the MR device through the API, and the MR device executes corresponding operations, such as selecting or moving the virtual object, according to the gesture instruction.
In one embodiment of the present invention, the data acquisition module includes a voice acquisition module, the voice acquisition module includes a microphone for acquiring a voice signal of a user, the microphone may be a microphone of the MR device, the data processing module includes a second processing module for converting the acquired voice signal into voice text data, the instruction recognition module includes a voice recognition module for recognizing the voice text data through a voice recognition model or a voice recognition algorithm to determine a specific voice instruction, and the execution module is used for transmitting the recognized voice instruction to the MR device through an API, and the MR device performs a corresponding operation, such as a start function and a control virtual object, according to the voice instruction.
In one embodiment of the invention, the data acquisition module comprises a handle interaction module for acquiring operation action and gesture data of a user in real time, the data processing module comprises a third processing module for preprocessing the acquired operation action and gesture data and converting the acquired operation action and gesture data into operation data, and the instruction recognition module comprises a handle recognition module for converting the operation data processed by the third processing module into predefined operation instructions, such as instructions for moving or rotating a virtual object, through mapping. The execution module is used for transmitting the identified operation instruction to the MR device through the API, and the MR device executes corresponding operation such as selecting or moving the virtual object according to the operation instruction.
In one embodiment of the present invention, the gesture acquisition module includes a depth camera for acquiring three-dimensional depth data of a user's hand and hand images in real time, including each key point of the hand (such as finger joint position) and its coordinates in three-dimensional space. The first processing module is used for extracting key point data (such as finger tips, joints and the like) of the hand from the three-dimensional depth data and the hand image. The three-dimensional depth data provided by the depth camera can be used for constructing a skeleton model of the hand, marking specific positions and postures of fingers, and converting the key point data into the skeleton model of the hand for further analyzing gestures. The coordinates of each key point and the angle information between the fingers are extracted and combined into the overall structure of the hand. The gesture recognition module is also used for extracting main characteristics of gestures according to characteristics such as joint angles, finger lengths, relative positions and the like of the skeleton model. The main characteristics may include the opening degree of the fingers, the curvature, the rotation angle of the palm, etc. Comparing the extracted main features with samples in a predefined gesture library, classifying the gestures by using a pattern matching algorithm (such as dynamic time warping) or a deep learning model (such as a convolutional neural network), and identifying specific gestures made by a user. The recognized gesture instructions are mapped to specific operation commands (e.g., "grab", "zoom", "rotate", etc.). These instructions are passed to the virtual show system or application through the device's API. And updating the object state in the virtual scene according to the operation command, and realizing the interaction between the user and the virtual exhibit.
In one embodiment of the present invention, the gesture acquisition module is further configured to acquire a plurality of gesture data sets, including images or video clips of various gestures. The data should encompass a variety of gesture samples, such as "grab," "zoom in," "rotate," and so forth. The data is labeled as the corresponding gesture category for training. The first processing module performs preprocessing on the acquired image or video data of the gesture, including image cropping, scaling, normalization, etc., to normalize the data format, or performs data enhancement on the data, such as rotation, translation, flipping, etc., to increase the robustness of the model. The keypoint data of the hand is extracted from the image using a deep learning model (such as convolutional neural network CNN) or conventional computer vision techniques. These key points include the joint position of the finger, the shape of the palm, etc. And converting the extracted key point data into feature vectors, and representing the spatial structure and dynamic information of the gestures. Features include length, angle, relative position, etc. of the finger. Training a learning model, and training the learning model by using labeled gesture data, wherein in the training process, the model optimizes parameters by minimizing a loss function (such as cross entropy loss) so as to improve the recognition accuracy of various gestures. And carrying out gesture recognition on gesture data by using the trained model, and recognizing a specific gesture.
In the training process, the performance of the model is evaluated by using a verification data set, and indexes such as accuracy, recall, F1 score and the like are monitored. The test dataset was used to ultimately evaluate the generalization ability of the model on unseen data. Super parameters of the model (e.g., learning rate, batch size, regularization term, etc.) are adjusted to optimize the performance and training stability of the model. Depending on the test results, fine tuning or retraining of the model may be required to ensure the reliability of the model in practical applications. And integrating the trained model into practical application to perform real-time gesture recognition. The model receives real-time gesture input of a user, and outputs a recognition result through an reasoning process.
In one embodiment of the present invention, the second processing module is further configured to label the voice text data, mark intent and entity of each sentence, and train the language processing model using the labeled data set, to obtain a trained voice recognition model for recognizing the voice text data. The speech recognition module includes a BERT, GPT, or specialized speech recognition model of the transducer architecture. The training process includes a speech recognition model (converting speech to text) and a natural language understanding model (parsing the intent of the text instructions). The speech recognition module of the present invention may also recognize intent in text (e.g., "play music", "adjust volume"). The intent classifier is trained using model-trained data, identifying the needs of the user. Mapping the recognized voice command into the actual operating system, for example, a "play music" command may correspond to starting a music player and playing a specified track. And feeds back the execution result to the user, for example, confirms the volume adjustment or the played music, so as to ensure that the user is satisfied with the system response.
In one embodiment of the invention, the hand interaction module includes a glove with built-in sensors for the user to wear, which can capture three-dimensional position and motion of the hand, including bending, stretching of the fingers and movements of the palm. The sensor transmits the hand motion data to the third processing module in a wireless or wired manner. And the third processing module processes the motion data captured by the sensor and recognizes specific motions and gestures of the hand through the handle recognition module. For example, by sensing the motion of the glove, the system can recognize gestures such as "grab", "rotate", and "drag". Based on the identified hand movements, the system applies the corresponding operations to the virtual exhibit. For example, a user selects a virtual object by a "grab" gesture, adjusts the angle of the object by a "rotate" gesture, or moves the position of the object by a "drag" gesture.
The invention can also realize real-time interaction and cooperation, and the system realizes real-time cooperation among users through a low-delay network communication technology and a distributed computing architecture in a virtual space. The MR device is connected with the cloud server through the Wi-Fi network, and high-bandwidth and low-delay signal transmission is ensured. Operations (such as rotation, scaling, movement) performed by the user on the exhibited items in the virtual space are captured in real time and transmitted to the cloud server through the network. The cloud server merges and synchronizes the operations of all users, and feeds back the updated exhibit state to other visitors in real time, so that the users can synchronously view and operate the exhibits. The system ensures that when one user operates the exhibited article, the operations of other users do not conflict through a cooperative locking mechanism, and real-time cooperation is realized.
The invention also supports synchronous interaction and communication of a plurality of users in the virtual exhibition hall, and particularly, the invention uses a multi-user synchronous framework based on WebRTC (real-time communication) to allow the plurality of users to be simultaneously connected to the virtual exhibition hall for real-time audio and video communication and data sharing. In order to ensure real-time performance, the system only transmits the changed data through a differential data transmission and local updating mechanism so as to reduce network bandwidth occupation.
The invention also realizes dynamic scene management, all interactive scenes in the virtual exhibition hall are managed by the server in real time, and the operation and position change of the user can be synchronized to the MR equipment of all participants in real time, so that the state of the exhibition hall is kept consistent.
The invention can also realize real-time uploading and updating of the exhibition content, and particularly, the system adopts a distributed cloud storage service, and a user can upload exhibit resources (such as models, pictures and texts and videos) to a cloud for storage through a computer end or MR equipment. The system provides a standardized API interface, allows an application program to call the exhibit resource stored in the cloud in real time and automatically update the exhibit content, and a user can upload the file to the cloud and ensure the call and sharing at any time in the exhibiting process through the following technical means. The uploaded file is encrypted and accessed through fine-grained rights management. Only authorized users can access and operate the files, so that the security and confidentiality of data are ensured. The system integrates the CDN service distributed globally, and ensures that users in all places can access and download exhibition contents quickly and reliably. After the file is uploaded, the system can be immediately and synchronously updated to the MR equipment of all users, so that the real-time performance and consistency of the content in the exhibition process are ensured.
The multi-mode interaction technology is combined with the immersive exhibition experience, so that a user can quickly acquire and understand the content of the exhibited article, external interference is reduced, the exhibition effect is improved, the user can select a proper interaction mode and an exhibition mode according to own habits and demands, personalized exhibition experience is provided, synchronous interaction and communication are carried out through the MR equipment, cooperation and communication among the users are promoted, and team exhibition and cooperation capability is enhanced.
And in the content preparation stage, a user uploads the exhibit resource comprising a model, images and texts and video through a computer terminal, and the content can be updated in real time, so that the flexibility and instantaneity of exhibition are ensured.
And the stage of creating and joining the virtual exhibition hall, namely, a user logs in the system through the MR head equipment to create the virtual exhibition hall, inputs the exhibition hall code into the virtual space, and generates the exhibition hall code to invite other people to join. Please refer to fig. 2 in detail.
In the exhibition proceeding stage, please refer to fig. 3, the virtual exhibition hall is initially set, basic elements (exhibition area, exhibit display area, interaction screen, etc.) in the virtual exhibition hall are displayed, users interact through the MR device, scenes of actions and voice expression of the users in the virtual space are displayed, visitors control the exhibits in the virtual space, and the process of interaction of close-range exhibits, such as dragging, rotating, amplifying the exhibits, etc., is displayed.
And short-distance exhibit interaction, namely, the exhibited articles are shared by the users in the virtual space, and other visitors can synchronously view and operate, so that real-time collaboration is realized.
And the multi-mode interaction technology integrates gesture recognition, voice control and handle interaction technology, so that natural and convenient interaction experience is realized.
Immersive exhibition experience please refer to fig. 4, which shows a scene that a user enters a virtual three-dimensional space to perform exhibition experience, and the user can enter the virtual three-dimensional space through an MR device to perform the exhibition process and perform interactive operation.
The invention provides a novel digital exhibition hall system through innovative multi-mode interaction technology and mixed reality technology, which can effectively improve exhibition efficiency, enhance exhibition interest, provide personalized exhibition experience, support real-time collaboration, promote immersion and interactivity and bring brand-new revolution and development to the exhibition field.
The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solution of the embodiments of the present invention in essence.