CN120182509B

Movatterモバイル変換

Info

Publication number: CN120182509B
Application number: CN202510653377.5A
Authority: CN
Inventors: 赵吴凡; 张帅; 华彤延; 洪忠铖
Original assignee: Hong Kong University Of Science And Technology Guangzhou
Current assignee: Hong Kong University Of Science And Technology Guangzhou
Priority date: 2025-05-21
Filing date: 2025-05-21
Publication date: 2025-08-12
Anticipated expiration: 2045-05-21
Also published as: CN120182509A

Abstract

The invention provides an indoor scene reconstruction method, device, storage medium and equipment, which comprise the steps of obtaining laser radar point cloud information and a panoramic RGB image, preprocessing the panoramic RGB image to obtain a hexahedral cube image, projecting the laser radar point cloud information to the hexahedral cube image in combination with internal parameters and external parameters of the panoramic camera, extracting color information of each point in the point cloud information to generate color point cloud data, generating an RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera, inputting the hexahedral cube image to an object segmentation model, dividing an indoor scene into a plurality of independent object areas, inputting each independent object area to a vision-language model to obtain semantic tags of each independent object, carrying out projection alignment on each independent object containing the semantic tags and the RGB-D sequence to obtain aligned point cloud data, and inputting the aligned point cloud input data to a nerve core surface reconstruction model to obtain a reconstructed indoor scene.

Description

Method, device, storage medium and equipment for reconstructing indoor scene

Technical Field

The present invention relates to the field of image processing, and in particular, to a method, an apparatus, a storage medium, and a device for indoor scene reconstruction.

Background

In recent years, under the cooperative driving of artificial intelligence, robotics and space intelligence, indoor perception and three-dimensional reconstruction technology gradually become hot fields of academic research and engineering practice. The rise of the multi-mode sensor data fusion technology provides a new solution for the perception and reconstruction of indoor scenes. Based on sensors such as lidar and cameras, conventional three-dimensional reconstruction techniques typically generate the geometry of a scene through the processing of point cloud data or images.

However, as the application scene of the three-dimensional reconstruction technology is continuously expanded, the existing three-dimensional reconstruction technology has the problems that firstly, the point cloud information based on the set information is difficult to meet the requirements of high-level application such as intelligent robot navigation and task planning, secondly, the processing and calculation complexity of large-scale point cloud data is high, the real-time performance is insufficient, and finally, the use of a single sensor limits the detail expression of the scene and the diversity of applicable scenes.

Disclosure of Invention

Based on the method, the device, the storage medium and the equipment for reconstructing the indoor scene, the point cloud data and the panoramic RGB image are fused, image information with depth information and color information is constructed, semantic annotation of objects in the scene is carried out by adopting an object segmentation model and a vision-language model based on the fused image information, so that the indoor scene containing semantic information is constructed, and accurate and rich data support can be provided under the three-dimensional modeling or path planning scene.

In a first aspect, the present invention provides a method for indoor scene reconstruction, including:

Acquiring laser radar point cloud information and a panoramic RGB image;

preprocessing the panoramic RGB image to obtain a hexahedral cube image;

projecting the laser radar point cloud information to a hexahedral cube image by combining internal parameters and external parameters of a panoramic camera, extracting color information of each point in the point cloud information, and generating color point cloud data;

Generating an RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera;

Inputting the hexahedral cube images into an object segmentation model, and dividing an indoor scene into a plurality of independent object areas;

inputting each independent object area into a vision-language model to obtain semantic tags of each independent object;

performing projection alignment on each independent object containing the semantic tag and the RGB-D sequence to obtain aligned point cloud data;

and inputting the aligned point cloud input data to a nerve core surface reconstruction model to obtain a reconstructed indoor scene.

Further, the method for reconstructing the indoor scene further comprises the following steps:

searching corresponding independent object results in the reconstructed indoor scene according to the received user instruction;

the searched independent object result is sent to a user side;

If the independent object result fed back by the user side is inconsistent with the user instruction, adding a supplementary semantic tag to the opposite object area indicated by the user instruction.

Further, the object segmentation model is an instance segmentation model.

Further, the visual-language model is a contrast language-image pre-training model or Grounding DINO model.

Further, the preprocessing is performed on the panoramic RGB image to obtain a hexahedral cube image, which specifically includes:

the panoramic RGB image is converted into a hexahedral cube image by adopting an equidistant columnar projection mode.

Further, the step of projecting the laser radar point cloud information to a hexahedral cube image by combining the internal parameters and the external parameters of the panoramic camera, extracting color information of each point in the point cloud information, and generating color point cloud data includes the following steps:

The following steps are performed for any one target point in the laser point cloud information:

step S201, correcting the coordinates of the target point according to the external parameters of the panoramic camera to obtain corrected target point coordinates;

Step S202, rotating the corrected target point coordinates according to the internal parameters of the panoramic camera to obtain target point mapping coordinates;

Step S203, combining the width and the height of the panoramic RGB image to obtain the projection point coordinates of the target point on the hexahedral cube;

step S204, the color data of the projection point is recorded as the color data of the target point;

step S205, repeating steps S201-S204 until each target point in the laser point cloud information determines color data, and generating color point cloud data.

Further, the generating the RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera includes the following steps:

Generating a depth map according to the distance between each point in the color point cloud data and the panoramic camera;

and combining the depth map with the color point cloud data to obtain an RGB-D data sequence.

Further, the combining the depth map with color point cloud data to obtain an RGB-D data sequence further includes:

And for each target point of the color point cloud data, if the depth value of the target point is smaller than the historical depth value, updating the color characteristics of the target point.

The invention also provides a device for reconstructing an indoor scene, which comprises:

the image acquisition module is used for acquiring laser radar point cloud information and panoramic RGB images;

The panoramic hexahedral module is used for preprocessing the panoramic RGB image to obtain a hexahedral cube image;

the point cloud data color extraction module is used for projecting the laser radar point cloud information to a hexahedral cube image in combination with the internal parameters and the external parameters of the panoramic camera, extracting the color information of each point in the point cloud information and generating color point cloud data;

The depth parameter combination module is used for generating an RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera;

The object segmentation module is used for inputting the hexahedral cube images into an object segmentation model and dividing an indoor scene into a plurality of independent object areas;

The semantic annotation module is used for inputting each independent object region into the vision-language model to obtain semantic tags of each independent object;

the data fusion module is used for carrying out projection alignment on each independent object containing the semantic tag and the RGB-D sequence to obtain aligned point cloud data;

and the scene reconstruction module is used for inputting the aligned point cloud input data to the nerve core surface reconstruction model to obtain a reconstructed indoor scene.

In a third aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any one of the first aspects of indoor scene reconstruction.

In a fourth aspect, the present invention also provides a computer device comprising a memory storing a computer program and a processor, which when executing the computer program, performs the method of any one of the indoor scene reconstruction of the first aspect.

The technical scheme has the advantages that the laser radar point cloud information and the panoramic RGB image are fused, accurate alignment of the handheld laser radar and the panoramic camera data is achieved, the three-dimensional scene graph of the open vocabulary is built through the algorithm, the objects which are not seen can be identified, the labels with rich semantics can be generated, the limitation of the predefined category is broken through, and the technical foundation is laid for open semantic segmentation. In the aspect of scene reconstruction, a neural surface reconstruction model is adopted to reconstruct the point cloud of the object instance in high precision in geometry and texture, so that the complex structure and detail of the scene are effectively captured, and meanwhile, the color and texture information is truly restored. By means of the object-level instance segmentation and real-time updating scene graph construction technology, the method and the device can adapt to the change of a dynamic scene, such as the movement, the addition or the removal of objects, and the understanding and modeling capability of the dynamic environment are remarkably improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a schematic diagram of a method for indoor scene reconstruction according to an embodiment of the present application;

FIG. 2 is a schematic diagram of data acquisition in an indoor scene reconstruction method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of combining panoramic RGB image with lidar point cloud information in an embodiment of the present application;

FIG. 4 is a schematic flow chart of a portion of a scene reconstruction in a method for reconstructing an indoor scene according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a semantic tag labeling flow in a method for indoor scene reconstruction according to an embodiment of the present application;

FIG. 6 is a schematic diagram of object segmentation in a method for indoor scene reconstruction according to an embodiment of the present application;

Fig. 7 is a schematic diagram of an apparatus for indoor scene reconstruction according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In order to more specifically describe the present invention, the method, apparatus, storage medium and device for indoor scene reconstruction provided by the present invention are specifically described below with reference to the accompanying drawings.

Unless defined otherwise, technical or scientific terms used in the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or article preceding the word is meant to encompass the element or article listed thereafter and equivalents thereof without excluding other elements or articles. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which can be changed accordingly when the absolute position of the object to be described is changed.

In recent years, the rise of the multi-mode sensor data fusion technology provides a new solution for the perception and reconstruction of indoor scenes. By combining high-precision point cloud data of the laser radar with rich visual information of the camera, researchers can introduce semantic-level understanding while retaining geometric information. However, how to efficiently fuse multi-modal data and achieve end-to-end high precision and efficiency in scene segmentation, semantic annotation, and three-dimensional reconstruction is still a challenging research topic.

Deep learning semantic segmentation technology based on laser radar (LiDAR) point cloud, such as RandLA-Net model (Random Sampled PointNet, point cloud semantic segmentation model), KPConv model (Kernel Point Convolution, nuclear point convolution model) and SparseConvNet model (Submanifold Sparse Convolutional Network, sparse convolution model), realizes object recognition and semantic segmentation in a scene by modeling geometrical characteristics of point cloud data. The method is excellent in processing on the sparse point cloud, and can capture local geometric characteristics of the point cloud and generate three-dimensional semantic information. However, the method has significant limitations, and is characterized in that the method is dependent on geometric characteristics, lacks understanding capability of semantic hierarchy, cannot segment or identify new objects in unknown types or complex scenes, and further is difficult to effectively process object change conditions in dynamic scenes, and cannot update semantic information in real time. In addition, because the laser radar point cloud data volume is huge, the consumption of computing resources and memory is high when the large-scale point cloud data is directly processed, and the requirement of real-time application is difficult to meet. The segmentation method based on the point cloud data is limited in the tasks of open scenes and multi-mode requirements.

The semantic segmentation technology (such as Mask R-CNN, deepLab, vision-transducer) based on RGB images relies on convolutional neural networks or visual transducers to obtain excellent results on image feature extraction and example segmentation, so that various object types can be identified, and a high-resolution segmentation result can be generated. However, the implementation of this method depends only on two-dimensional images, has limited modeling capability for three-dimensional geometric information, and has difficulty in processing spatial structure and depth information in point cloud data. In addition, the technology is severely dependent on annotation data, cannot identify new types except closed-set vocabulary, has weak generalization capability, lacks dynamic updating capability on semantic information in a dynamic scene, cannot adapt to scene change in real time, and is difficult to support complex scene planning tasks, such as robot navigation and grabbing tasks requiring space semantic relation.

Furthermore, the prior art also comprises a multi-mode semantic segmentation technology (such as BEVFusion, fusionMLP, maskFusion) which realizes semantic segmentation and understanding of the three-dimensional scene by fusing the advantages of the laser radar point cloud information and the RGB image data. By utilizing semantic information in the image data and geometric information of the laser radar point cloud data, scenes can be more comprehensively described, and the precision of semantic segmentation is remarkably improved. In addition, the multi-mode data fusion increases the computational complexity, has higher requirements on hardware resources, and limits the application of the multi-mode semantic segmentation technology in real-time tasks.

Based on the consideration of the prior art, the invention provides a method for reconstructing an indoor scene, which obtains an RGB-D image by fusing point cloud data and a panoramic RGB image, and performs object segmentation and semantic annotation by combining an instance segmentation model and a visual-language model to obtain the indoor scene with a semantic tag.

The embodiment of the application provides an application scene of a method for reconstructing an indoor scene, which comprises terminal equipment provided by the embodiment, wherein the terminal equipment comprises, but is not limited to, a smart phone and computer equipment, and the computer equipment can be at least one of a desktop computer, a portable computer, a laptop computer, a mainframe computer, a tablet computer and the like. The terminal device receives indoor point cloud data sent by the laser radar and panoramic RGB images sent by the panoramic camera, and constructs a three-dimensional scene containing semantic information, and referring to a schematic diagram of an indoor scene reconstruction method shown in fig. 1, for specific processes, please refer to an embodiment of the indoor scene reconstruction method.

Step S101, laser radar point cloud information and a panoramic RGB image are acquired.

The laser radar point cloud data refer to a vector set of an indoor scene in a three-dimensional coordinate system, which is acquired by using a laser radar, and is used for providing three-dimensional geometric structure information of the indoor scene, in combination with a data acquisition schematic diagram in the indoor scene reconstruction method shown in fig. 2. Panoramic RGB image refers to a 360-degree panoramic RGB image of an indoor scene acquired by a panoramic camera, the RGB image is an image which generates various colors based on different intensity combinations of three basic colors of Red (Red), green (Green) and Blue (Blue), each pixel point in the RGB image is usually composed of three components, the brightness values of the three colors of Red, green and Blue are respectively represented, and the value range of each component in the RGB image is 0-255.

In this embodiment, as shown in fig. 2, laser radar point cloud information is acquired by using a handheld laser radar to provide three-dimensional geometric structure information of an indoor scene, and 360-degree panoramic RGB images of the indoor scene are acquired by using a panoramic camera to supplement semantic information and texture details. In this embodiment, the handheld laser radar may be a smart L1 laser radar, and the panoramic camera may be an instra 360 panoramic camera. In the embodiment, the hand-held laser radar and the panoramic camera are designed as a whole,

Step S102, preprocessing the panoramic RGB image to obtain a hexahedral cube image.

Specifically, considering that the panoramic RGB image can be severely stretched and distorted in a polar region (such as the top and bottom of the panoramic RGB image) due to a projection mode, which causes blurring or deformation of image details and is unfavorable for the subsequent indoor scene reconstruction process, the panoramic RGB image needs to be converted into a hexahedral cube image which is more similar to a conventional plane image and has little distortion in the application. It should be noted that, in this embodiment, the panoramic RGB image may be converted into a hexahedral cube image by using an equidistant columnar projection method. Specifically, spherical coordinates may be used for each pixel in the panoramic RGB imagePixel point converted into hexahedral cube image according to the following expressionThe expression is as follows:

,,

Spherical coordinatesRepresenting the longitude of the panoramic RGB image pixel on the sphere,Spherical coordinatesRepresenting the latitude of the pixel points of the panoramic RGB image on the spherical surface,。

The equidistant columnar projection refers to projecting each part of the stereoscopic image on a plane in an equidistant mode, so that the size of the image on the projection plane corresponds to the size of the stereoscopic image one by one. The purpose of equidistant columnar projection is to show a larger stereoscopic image on a smaller projection plane while keeping the shape and scale of the image unchanged. Each surface of the hexahedral cube image after equidistant columnar projection is closer to the perspective of a conventional plane image, and the middle area has almost no distortion, so that the method is more beneficial to the subsequent indoor scene object perception.

Step S103, combining the internal parameters and the external parameters of the panoramic camera, projecting the laser radar point cloud information to a hexahedral cube image, extracting the color information of each point in the point cloud information, and generating color point cloud data.

Specifically, in combination with the schematic diagram of combining the panoramic RGB image and the laser radar point cloud information shown in fig. 3, in this embodiment, the panoramic camera and the laser radar are integrally set, the internal parameters of the panoramic camera include the focal length and the principal point coordinates of the panoramic camera, and the external parameters of the panoramic camera are the relative pose between the laser radar and the panoramic camera. And accurately projecting the point cloud data to the hexahedral cube image, extracting the color information of each point in the point cloud information, and generating color point cloud data. The color point cloud data includes color information of the point cloud data in addition to three-dimensional structure information of the point cloud data.

The embodiment projects laser radar point cloud information to a hexahedral cube image, extracts color information of each point in the point cloud information, and generates color point cloud data by the following method:

the following operations are performed for any one target point of the laser point cloud information:

Step S201, correcting the coordinates of the target point according to the external parameters of the panoramic camera to obtain corrected target point coordinates.

Specifically, the expression for correcting the coordinates of the target point according to the external parameters of the panoramic camera is:

,

In order to correct the coordinates of the target point,Is the coordinates of the target point,The camera is an external parameter of the panoramic camera, and represents the relative pose between the laser radar and the panoramic camera.

Step S202, rotating the corrected target point coordinates according to the internal parameters of the panoramic camera to obtain target point mapping coordinates.

Specifically, the corrected target point coordinates are rotated according to the internal parameters of the panoramic camera, and the expression for obtaining the target point mapping coordinates is as follows:

,

coordinates are mapped for the target point,Is an internal reference of a panoramic camera, whereinIs the focal length of the panoramic camera in the horizontal direction,For the focal length of the panoramic camera in the vertical direction,Is the abscissa of the principal point of the image,Is the ordinate of the principal point of the image.

Step S203, combining the width and the height of the panoramic RGB image to obtain the projection point coordinates of the target point on the hexahedral cube.

Specifically, the expression for obtaining the coordinates of the projection point of the target point on the hexahedral cube by combining the width and the height of the panoramic RGB image is as follows:

,

is the coordinates of the projection points of the hexahedral cube,For the width of the panoramic RGB image,For the height of the panoramic RGB image,The spherical coordinates of the panoramic RGB image corresponding to the coordinates are mapped for the target,,。

Step S204, the color data of the projection point is recorded as the color data of the target point.

Specifically, the expression for recording the color data of the projection point as the color data of the target point is:

,

is the color data of the target point,Is the color data of the proxel.

Step S104, according to the distance between each point in the color point cloud data and the panoramic camera, an RGB-D data sequence is generated.

Specifically, according to the distance between each point in the color point cloud data and the panoramic camera, the generation of the RGB-D data sequence comprises the following steps:

step S301, generating a depth map according to the distance between each point in the color point cloud data and the panoramic camera;

Step S302, combining the depth map with color point cloud data to obtain an RGB-D data sequence.

The step S302 of combining the depth map with the color point cloud data to obtain the RGB-D data sequence includes the following specific steps:

For any one depth image pixel coordinate in the depth mapBackprojecting panoramic camera parameters into spatial coordinatesAccording to space coordinatesFinding nearest neighbors in color point cloud dataWill nearest neighbor pointMatching RGB values of (2) to depth image pixel coordinatesThereby, four-tuple can be generated by combining the pixel coordinates of the depth image with RGB valuesWhereinThe value is the RGB value of the nearest neighbor,Is the distance value of the depth image pixel. When each depth image pixel coordinate in the depth map is endowed with RGB value of nearest neighbor point in color point cloud data, generating corresponding quadruple, and then generating quadruple of all depth image pixel points in the depth mapAnd forming a four-channel image or sequence array, and sorting according to frame numbers to obtain the RGB-D data sequence.

It should be noted that, in generating the depth map according to the distance between the color point cloud data and the panoramic camera, because the coefficient point cloud lacks sufficient density and resolution to cause the false projection, the problem of "missing projection" (Leak) caused by the sparsity of the point cloud map, the problem of false projection of the occlusion point, and the lack of point cloud support for completely occluding the object surface data, the occlusion relationship cannot be accurately determined. For this problem, a Depth Buffer (Depth Buffer) is used to compare the Depth value of the current point with the value of the Depth Buffer, and the color or feature of the pixel point is updated only if the Depth of the current point is smaller than the recorded Depth value. Aiming at the shielding problem, a 5*5 minimum filter is utilized for screening, and errors or noise which is introduced by projecting the sparse point cloud onto the image frame is processed, so that the effect of removing abnormal values or wrong projection points is achieved.

The data processing process not only realizes the accurate alignment of the point cloud information and the image data, but also lays a foundation for the subsequent semantic segmentation and three-dimensional reconstruction.

Step S105, inputting the hexahedral cube image into an object segmentation model, and dividing the indoor scene into a plurality of independent object regions.

Specifically, the object segmentation Model of the present embodiment may be an instance segmentation Model (SEGMENT ANYTHING Model, SAM Model for short), and the indoor scene is divided into a plurality of independent object regions by generating a binary semantic mask for the hexahedral cube image. It should be noted that the SAM model can segment objects in any image through various interactive cues (such as points, boxes, text, or masks), and does not need to be fine-tuned for a specific task, so as to realize object segmentation.

Further, the binarized semantic mask is projected to a three-dimensional point cloud space through internal parameters and external parameters of the camera, wherein the internal parameters are used for calculating projection coordinates, and the external parameters are used for describing the pose relation between the laser radar and the panoramic camera, so that the generation of the multi-mode data and object-level instance point cloud is realized.

And S106, inputting each independent object area into a vision-language model to obtain semantic tags of each independent object.

Specifically, in this embodiment, in combination with the schematic view of object segmentation in the indoor scene reconstruction method shown in fig. 6, the visual-language model is a contrast language-Image Pre-training model (Contrastive Language-Image Pre-training, abbreviated as CLIP model), which can perform feature extraction on the RGB Image area of each independent object area, so as to generate the semantic tag of the open vocabulary.

Further, before step S106, a nearest point priority policy may be further set to perform an operation of eliminating projection errors on each independent object area, so as to ensure geometric accuracy of the object point cloud.

Specifically, for each semantic instance, a corresponding three-dimensional point cloud subset is extracted based on a projection relation or a preliminary segmentation result and used as an input area for subsequent processing, various possible depth deviations of image back projection or multi-view synthesis driving are considered, error estimation between the observation distance and an ideal position of each point is established, a nearest point priority strategy is adopted for candidate points in the same pixel area or space adjacent area, only a zone with the minimum distance with a panoramic camera is reserved as an effective observation result, geometrical artifact and ghost interference caused by multi-view re-projection can be effectively reduced by excluding redundant projection points which are far or have shielding, the screened point cloud geometry is used as high-precision geometric representation, and semantic label prediction is carried out by inputting a vision-language model, so that the recognition precision and robustness are improved.

Furthermore, considering the situations that the semantic objects are complex, the object combination is strong, and the description of the independent semantic tags is not clear enough to have ambiguity, in the embodiment, the semantic tags of the open vocabulary can label the category of the object (such as a chair or a table), and can supplement attribute description (such as a red round object) to provide multi-tag candidates for the complex object. For the part of the supplemental attribute description, the specific procedure is as follows:

Step S1061, searching for a corresponding independent object result in the reconstructed indoor scene according to the received user instruction.

Step S1062, the searched independent object result is sent to the user side.

Step S1063, if the result of the independent object searched by the user side is inconsistent with the user instruction, adding a supplementary semantic tag to the independent object area indicated by the user instruction. The user instruction is an object to be searched by a user adopting natural language description, a specific object description can be extracted from the user instruction through a natural language recognition model, and corresponding independent object results can be searched by traversing semantic tags of each independent object in the reconstructed indoor scene according to the object description.

Wherein, the supplementary semantic tags include at least one of color attribute tags, shape attribute tags and position attribute tags. The color attribute label indicates the color attribute of the independent object, such as red, green, yellow and the like, the shape attribute label indicates the shape attribute of the independent object, such as circle, rectangle, heart and the like, the position attribute label indicates the specific position of the independent object in the reconstructed scene, such as upper left, northeast corner, between an A object and a B object and the like, the condition that the original label is ambiguous or the designability is not strong in the reconstructed indoor scene can be adjusted through the supplementary semantic label formed by at least one of the color attribute label, the shape attribute label and the position attribute label, for example, the original label is a chair, but the chairs or the chairs with various colors and shapes exist in each position in the reconstructed indoor scene, at this time, the independent object results of all chairs in the reconstructed indoor scene are fed back to the user side, if the independent object results searched by the user side are inconsistent with the user instruction (the user instruction is a chair needing to find a red heart-shaped backrest), and the supplementary semantic label (such as red heart-shaped backrest) is added according to the independent object area indicated by the user instruction.

In addition, aiming at the semantic tags of the open vocabulary, a user can generate specific confidence level assessment by combining the semantic tags when searching specific objects in the subsequent stage, so that the accuracy of semantic labeling is improved. The confidence evaluation refers to the accuracy of the result when the user searches for a specific object in the subsequent stage.

And step S107, performing projection alignment on each independent object containing the semantic tag and the RGB-D sequence to obtain aligned point cloud data.

And S108, inputting the aligned point cloud input data to a nerve core surface reconstruction model to obtain a reconstructed indoor scene.

Specifically, as shown in a scene reconstruction flow chart in fig. 4, in a scene reconstruction stage, a neural surface reconstruction model (Navigation Knowledge Situation and Reasoning, abbreviated as NKSR) is adopted to perform high-precision three-dimensional geometric reconstruction on the segmented object point cloud, and a NKSR model learns local geometric features in point cloud data by using a neural network and predicts a three-dimensional surface of an object through the features. It should be noted that, NKSR models combine Convolutional Neural Networks (CNN) and Kernel Methods (Kernel Methods), and can reconstruct the object surface efficiently and accurately by learning the local geometric information of the point cloud information, so as to provide high-quality basic data for subsequent applications, such as virtual reality, indoor navigation, scene simulation, and the like.

Based on the scheme, the handheld laser radar is combined with the panoramic camera, and the three-dimensional geometric information of the laser radar point cloud information and the texture and color information of the panoramic camera are accurately aligned, so that the limitation of a single mode is broken through, and standardized RGB-D data are generated. The multi-mode fusion mode effectively improves the comprehensiveness and detail expression capability of scene perception, and provides rich data support for subsequent semantic segmentation, three-dimensional modeling and path planning. In addition, the embodiment combines the SAM model and the CLIP model, realizes semantic segmentation of open vocabulary, can automatically generate semantic tags and instance segmentation results without additional data labeling or model training, breaks through category limitation compared with the traditional pre-defined category segmentation technology, has strong universality and adaptability, can identify unknown objects in complex scenes and endow the semantic tags, and greatly reduces model development cost and labeling difficulty.

Further, in this embodiment, the SAM model and the CLIP model are combined to detect the object and generate a specific label, which is not limited by a predefined category, but may sometimes cause unstable detection results or inconsistent detection results due to ambiguity generated by the label. In this regard, referring to the semantic tag labeling flow schematic diagram shown in fig. 5, in this embodiment, a Grounding DINO (Grounding Distillation with NO labels, unlabeled object detection model) model may be further used to label semantic tags for each independent object region, and a class file is introduced to be used as a predefined class tag set of the object detection model, so as to ensure consistency and repeatability of the segmentation result, define the class file according to its own actual requirement, and improve stability of the segmentation result.

In order to better illustrate the advantages of the indoor scene reconstruction method in practical application, the indoor gas pipeline wiring and indoor robot navigation are described by using the method.

(1) Indoor gas pipeline wiring example:

the method comprises the steps of collecting point cloud data and a 360-degree panorama of a kitchen space through a handheld laser radar and a panoramic camera, converting the panorama into a six-sided perspective view, generating a depth map according to the distance between the point cloud and the panoramic camera, generating RGB-D data, combining SAM and CLIP to realize open vocabulary semantic segmentation, and automatically identifying and marking the positions and shapes of objects such as walls, ceilings, floors, furniture and equipment which possibly obstruct wiring. And generating a semantic tag through the CLIP, adding detailed attribute description for the segmented object, and performing high-precision three-dimensional reconstruction on the point cloud by utilizing NKSR algorithm to generate a kitchen three-dimensional scene model with rich geometric details and real textures.

On the basis, the optimal path of the gas pipeline is automatically planned by combining with the pipeline wiring rule, the bracket fixing points are designed, dangerous areas are marked, and the safety and the rationality of wiring are ensured. The indoor scene reconstruction method remarkably reduces the data acquisition and modeling cost while improving the design efficiency of the gas wiring, has the advantages of high convenience and intelligence, and provides powerful technical support for construction, maintenance and safety evaluation of the gas wiring.

(2) Indoor robot navigation scene:

The method comprises the steps of utilizing point cloud data of an indoor scene collected by a laser radar and a panoramic camera and a 360-degree panoramic image to convert the panoramic image into a six-sided stereoscopic image, generating a depth image according to the distance between the point cloud and the panoramic camera, generating RGB-D data, generating semantic tags by dividing a large model SAM and a CLIP, accurately sensing and analyzing the environment by the system, and constructing a three-dimensional map with semantics. Based on a three-dimensional map with semantics, the robot can avoid obstacles and adjust navigation paths, and an optimal path is planned by using algorithms such as A/Dijkstra, so that the method is effectively applied to the fields of home service, commercial logistics, intelligent inspection and the like, and the method has the technical advantages and social values of intelligence, flexibility and high efficiency.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrow, the steps are not necessarily performed in order as indicated by the arrow. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include a plurality of sub-steps or sub-stages, which are not necessarily performed at the same time, but may be performed at different times, and the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in turn or alternately with at least some of the other steps or sub-steps of other steps.

The embodiment of the present invention describes the method for reconstructing an indoor scene in detail, and the method disclosed in the present invention can be implemented by using various types of devices, so that the present invention also discloses an apparatus for reconstructing an indoor scene, and a specific embodiment is given below with reference to fig. 7.

The image acquisition module 501 is used for acquiring laser radar point cloud information and panoramic RGB images;

A panorama-to-hexahedral module 502, configured to pre-process the panoramic RGB image to obtain a hexahedral cube image;

the point cloud data color extraction module 503 is configured to combine the internal parameters and the external parameters of the panoramic camera, project the laser radar point cloud information to a hexahedral cube image, extract color information of each point in the point cloud information, and generate color point cloud data;

A depth parameter combining module 504, configured to generate an RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera;

the object segmentation module 505 is configured to input the hexahedral cube image to an object segmentation model, and divide an indoor scene into a plurality of independent object regions;

The semantic annotation module 506 is configured to input each independent object region into a vision-language model to obtain a semantic label of each independent object;

The data fusion module 507 is configured to perform projection alignment on each independent object containing the semantic tag and the RGB-D sequence, so as to obtain aligned point cloud data;

The scene reconstruction module 508 is configured to input the aligned point cloud input data to a neural surface reconstruction model to obtain a reconstructed indoor scene.

The device for reconstructing the indoor scene can be fully referred to the above limitation of the method, and will not be repeated here. Each of the modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of the processor of the terminal device, or may be stored in software in the memory of the terminal device, so that the processor invokes and executes the operations corresponding to the above modules.

In one embodiment, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of indoor scene reconstruction described above.

The computer readable storage medium may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read-only memory), an EPROM (erasable programmable read-only memory), a hard disk, or a ROM. Optionally, the computer readable storage medium comprises a non-transitory computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium has storage space for program code to perform any of the method steps described above. These program code can be read from or written to one or more computer program products, which can be compressed in a suitable form.

In one embodiment, the present invention provides a computer device comprising a memory storing a computer program and a processor executing the method of indoor scene reconstruction described above.

The computer device includes a memory, a processor, and one or more computer programs, wherein the one or more computer programs may be stored in the memory and configured to be executed by the one or more processors, and one or more application programs configured to perform the method of indoor scene reconstruction described above.

The processor may include one or more processing cores. The processor uses various interfaces and lines to connect various portions of the overall computer device, perform various functions of the computer device, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in memory, and invoking data stored in memory. Alternatively, the processor may be implemented in hardware in at least one of digital signal Processing (DIGITAL SIGNAL Processing, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), a report validator (Graphics Processing Unit, GPU) of the embedded point data, and a modem. The CPU mainly processes an operating system, a user interface, an application program and the like, the GPU is used for rendering and drawing display contents, and the modem is used for processing wireless communication. It will be appreciated that the modem may not be integrated into the processor and may be implemented solely by a single communication chip.

The Memory may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instruction sets. The memory may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the terminal device in use, etc.

The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and that such modifications or substitutions do not depart from the spirit and scope of the technical solution of the embodiments of the present invention in essence.

Claims

1. A method of indoor scene reconstruction, comprising:

Acquiring laser radar point cloud information and a panoramic RGB image;

preprocessing the panoramic RGB image to obtain a hexahedral cube image;

Inputting the aligned point cloud input data to a nerve core surface reconstruction model to obtain a reconstructed indoor scene;

the searched independent object result is sent to a user side;

if the independent object result fed back by the user side is inconsistent with the user instruction, adding a supplementary semantic tag to the opposite object area indicated by the user instruction;

The supplemental semantic tags include at least one of a color attribute tag, a shape attribute tag, and a location attribute tag.

2. The method of indoor scene reconstruction according to claim 1, wherein the object segmentation model is an instance segmentation model.

3. The method of indoor scene reconstruction according to claim 2, wherein the vision-language model is a contrast language-image pre-training model or Grounding DINO model.

4. A method for reconstructing an indoor scene according to claim 3, wherein said preprocessing of said panoramic RGB image results in a hexahedral cube image, specifically:

5. The method for reconstructing an indoor scene according to claim 4, wherein the step of combining the internal parameters and the external parameters of the panoramic camera, projecting the laser radar point cloud information to the hexahedral cube image, extracting color information of each point in the point cloud information, and generating color point cloud data comprises the steps of:

6. The method for reconstructing an indoor scene according to claim 5, wherein the generating the RGB-D data sequence according to the distance between each point in the color point cloud data and the panoramic camera comprises the steps of:

7. The method for reconstructing an indoor scene according to claim 6, wherein said combining the depth map with color point cloud data to obtain an RGB-D data sequence further comprises:

8. An apparatus for reconstructing an indoor scene, comprising:

the scene reconstruction module is used for inputting the aligned point cloud input data to the nerve core surface reconstruction model to obtain a reconstructed indoor scene;

the searched independent object result is sent to a user side;

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of indoor scene reconstruction as claimed in any one of claims 1-7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, performs the method of indoor scene reconstruction according to any one of claims 1-7.