- Notifications
You must be signed in to change notification settings - Fork49
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
License
InternRobotics/EmbodiedScan
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Tai Wang* Xiaohan Mao* Chenming Zhu* Runsen Xu Ruiyuan Lyu Peisen Li Xiao Chen
Wenwei Zhang Kai Chen Tianfan Xue Xihui Liu Cewu Lu Dahua Lin Jiangmiao Pang
Shanghai AI Laboratory Shanghai Jiao Tong University The University of Hong Kong
The Chinese University of Hong Kong Tsinghua University
🤖Demo
- [2024-03] We first release the data and baselines for the challenge. Please fill in theform to apply for downloading the data and try our baselines. Welcome any feedback!
- [2024-02] We will co-organizeAutonomous Grand Challenge in CVPR 2024. Welcome to try the Multi-View 3D Visual Grounding track! We will release more details about the challenge with the baseline after the Chinese New Year.
- [2023-12] We release thepaper of EmbodiedScan. Please check thewebpage and view our demos!
We test our codes under the following environment:
- Ubuntu 20.04
- NVIDIA Driver: 525.147.05
- CUDA 12.0
- Python 3.8.18
- PyTorch 1.11.0+cu113
- PyTorch3D 0.7.2
- Clone this repository.
git clone https://github.com/OpenRobotLab/EmbodiedScan.gitcd EmbodiedScan- Create an environment and install PyTorch.
conda create -n embodiedscan python=3.8 -y# pytorch3d needs python>3.7conda activate embodiedscan# Install PyTorch, for example, install PyTorch 1.11.0 for CUDA 11.3# For more information, please refer to https://pytorch.org/get-started/locally/conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
- Install EmbodiedScan.
# We plan to make EmbodiedScan easier to install by "pip install EmbodiedScan".# Please stay tuned for the future official release.# Make sure you are under ./EmbodiedScan/# This script will install the dependencies and EmbodiedScan package automatically.# use [python install.py run] to install only the execution dependencies# use [python install.py visual] to install only the visualization dependenciespython install.py all# install all the dependencies
Note: The automatic installation script make each step a subprocess and the related messages are only printed when the subprocess is finished or killed. Therefore, it is normal to seemingly hang when installing heavier packages, such as Mink Engine and PyTorch3D.
BTW, from our experience, it is easier to encounter problems when installing these two packages. Feel free to post your questions or suggestions during the installation procedure.
Please refer to theguide for downloading and organization.
We will update the authorization approach and release remaining data afterward. Please stay tuned.
We provide a simple tutorialhere as a guideline for the basic analysis and visualization of our dataset. Welcome to try and post your suggestions!
We provide a demo for running EmbodiedScan's model on a sample scan. Please download the raw data fromGoogle Drive orBaiduYun and refer to thenotebook for more details.
Embodied Perceptron accepts RGB-D sequence with any number of views along with texts as multi-modal input. It uses classical encoders to extract features for each modality and adopts dense and isomorphic sparse fusion with corresponding decoders for different predictions. The 3D features integrated with the text feature can be further used for language-grounded understanding.We provide configs for different taskshere and you can run the train and test script in thetools folder for training and inference.For example, to train a multi-view 3D detection model with pytorch, just run:
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet --launcher="pytorch"Or on the cluster with multiple machines, run the script with the slurm launcher following the sample script providedhere.
NOTE: To run the multi-view 3D grounding experiments, please first download the 3D detection pretrained model to accelerate its training procedure. After downloading the detection checkpoint, please check the path used in the config, for example, theload_fromhere, is correct.
To inference and evaluate the model (e.g., the checkpointwork_dirs/mv-3ddet/epoch_12.pth), just run the test script:
python tools/test.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py work_dirs/mv-3ddet/epoch_12.pth --launcher="pytorch"We preliminarily support format-only inference for multi-view 3D visual grounding. To achieve format-only inference during test, just setformat_only=True intest_evaluator in the corresponding config likehere. Then just run the test script like:
python tools/test.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py work_dirs/mv-grounding/epoch_12.pth --launcher="pytorch"The prediction file will be saved to./test_results.json in the current directory.You can also set theresult_dir intest_evaluator to specify the directory to save the result file.
Finally, to pack the prediction file into the submission format, please modify the scripttools/submit_results.py according to your team information and saving paths, and run:
python tools/submit_results.py
Then you can submit the resulting pkl file to the test server (to go live by the end of March) and wait for the lottery :)
We also provide a sample scripttools/eval_script.py for evaluating the submission file and you can check it by yourself to ensure your submitted file has the correct format.
We preliminarily provide several baseline results here with their logs and pretrained models.
Note that the performance is a little different from the results provided in the paper because we re-split the training set as the released training and validation set while keeping the original validation set as the test set for the public benchmark.
| Method | Input | AP@0.25 | AR@0.25 | AP@0.5 | AR@0.5 | Download |
|---|---|---|---|---|---|---|
| Baseline | RGB-D | 15.22 | 52.23 | 8.13 | 26.66 | Model,Log |
| Method | AP@0.25 | AP@0.5 | Download |
|---|---|---|---|
| Baseline-Mini | 33.59 | 14.40 | Model,Log |
| Baseline-Mini (w/ FCAF box coder) | - | - | - |
| Baseline-Full | - | - | - |
Please see thepaper for more details of our two benchmarks, fundamental 3D perception and language-grounded benchmarks. This dataset is still scaling up and the benchmark is being polished and extended. Please stay tuned for our recent updates.
- Release the paper and partial codes for datasets.
- Release EmbodiedScan annotation files.
- Release partial codes for models and evaluation.
- Polish dataset APIs and related codes.
- Release Embodied Perceptron pretrained models.
- Release multi-modal datasets and codes.
- Release codes for baselines and benchmarks.
- Full release and further updates.
If you find our work helpful, please cite:
@inproceedings{wang2023embodiedscan,title={EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI},author={Wang, Tai and Mao, Xiaohan and Zhu, Chenming and Xu, Runsen and Lyu, Ruiyuan and Li, Peisen and Chen, Xiao and Zhang, Wenwei and Chen, Kai and Xue, Tianfan and Liu, Xihui and Lu, Cewu and Lin, Dahua and Pang, Jiangmiao},year={2024},booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},}
If you use our dataset and benchmark, please kindly cite the original datasets involved in our work. BibTex entries are provided below.
Dataset BibTex
@inproceedings{dai2017scannet,title={ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes},author={Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie{\ss}ner, Matthias},booktitle ={Proceedings IEEE Computer Vision and Pattern Recognition (CVPR)},year ={2017}}
@inproceedings{Wald2019RIO,title={RIO: 3D Object Instance Re-Localization in Changing Indoor Environments},author={Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, Matthias Niessner},booktitle={Proceedings IEEE International Conference on Computer Vision (ICCV)},year ={2019}}
@article{Matterport3D,title={{Matterport3D}: Learning from {RGB-D} Data in Indoor Environments},author={Chang, Angel and Dai, Angela and Funkhouser, Thomas and Halber, Maciej and Niessner, Matthias and Savva, Manolis and Song, Shuran and Zeng, Andy and Zhang, Yinda},journal={International Conference on 3D Vision (3DV)},year={2017}}
This work is under theCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- OpenMMLab: Our dataset code usesMMEngine and our model is built uponMMDetection3D.
- PyTorch3D: We use some functions supported in PyTorch3D for efficient computations on fundamental 3D data structures.
- ScanNet,3RScan,Matterport3D: Our dataset uses the raw data from these datasets.
- ReferIt3D: We refer to the SR3D's approach to obtaining the language prompt annotations.
- SUSTechPOINTS: Our annotation tool is developed based on the open-source framework used by SUSTechPOINTS.
About
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
Topics
Resources
License
Code of conduct
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.



