Open3DA/LL3DAPublic

NotificationsYou must be signed in to change notification settings
Fork12
Star297

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.

License

MIT license

297 stars 12 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets		assets
bert-base-embedding		bert-base-embedding
data		data
datasets		datasets
eval_utils		eval_utils
models		models
pretrained/vote2cap-detr		pretrained/vote2cap-detr
scripts-v0/opt-1.3b		scripts-v0/opt-1.3b
scripts		scripts
third_party/pointnet2		third_party/pointnet2
utils		utils
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
main.py		main.py
submit_scanqa.py		submit_scanqa.py

Repository files navigation

Official repo for LL3DA

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

💻Project Page •📄Arxiv Paper •🎞YouTube • 🤗HuggingFace Demo (WIP) •Citation

🏃 Intro LL3DA

LL3DA is a Large Language 3D Assistant that could respond to both visual and textual interactions withincomplex 3D environments.

Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.

🚩 News

2024-03-04. 💥 The code is fully released! Now you can train your customized models!
2024-02-27. 🎉 LL3DA is accepted by CVPR 2024! See you in Seattle!
2023-11-30. 📣 Upload paper and init project

TODO:

Upload our paper to arXiv and build project pages.
Pray for acceptance.
Upload all the code and training scripts.
Release pre-trained weights. (seecheckpoint)
Add local demo interface.
Train on larger 3D VL benchmarks and scale up models.

⚡ Quick Start

Environment Setup

Step 1. Build Dependencies. Our code is tested with CUDA 11.6 and Python 3.8.16. To run the codes, you should first install the following packages:

h5pyscipycythonplyfile'trimesh>=2.35.39,<2.35.40''networkx>=2.2,<2.3''torch=1.13.1+cu116''transformers>=4.37.0'

After that, build thepointnet2 and acceleratedgiou from source:

cd third_party/pointnet2python setup.py install

cd utilspython cython_compile.py build_ext --inplace

Step 2. Download pre-trained embeddings. Download the pre-processed BERT embedding weights fromhuggingface and store them under the./bert-base-embedding folder. The weights arethe same from the official BERT model, we just modified the names of certain parameters.

Data Preparation

Our repo requires the 3D data from ScanNet, the natural language annotations, and the pre-trained LLM weights.

Step 1. Download and Prepare the ScanNet 3D Data.

Updates 2024-07-01: You can download the pre-processed data fromhere.

Follow the instructionshere and download the ScanNetV2 dataset.
Change theSCANNET_DIR to the scans folder indata/scannet/batch_load_scannet_data.py, and run the following commands.

cd data/scannet/python batch_load_scannet_data.py

Step 2. Prepare Language Annotations

To train the model, you are required to prepare language annotations fromScanRefer,Nr3D,ScanQA, and the ScanNet part of3D-LLM.

ScanRefer. Follow the commandshere to download theScanRefer dataset.
Nr3D. Follow the commandshere to download theNr3D dataset, andpre-process it.
ScanQA. Follow the commandshere to download theScanQA dataset.
3D-LLM. The data are located athere. We have also shared our pre-processing scriptshere.

We will update the latest released data (V3) from 3D-LLM.

Finally, organize the files into the following folders:

./data/  ScanRefer/    ScanRefer_filtered_train.json    ScanRefer_filtered_train.txt    ScanRefer_filtered_val.json    ScanRefer_filtered_val.txt  Nr3D/    nr3d_train.json    nr3d_train.txt    nr3d_val.json    nr3d_val.txt  ScanQA/    ScanQA_v1.0_test_w_obj.json    ScanQA_v1.0_test_wo_obj.json    ScanQA_v1.0_train.json    ScanQA_v1.0_val.json  3D_LLM/    3d_llm_embodied_dialogue_filtered_train.json    3d_llm_embodied_dialogue_filtered_val.json    3d_llm_embodied_planning_filtered_train.json    3d_llm_embodied_planning_filtered_val.json    3d_llm_scene_description_train.json    3d_llm_scene_description_val.json

Step 3. [Optional] Download Pre-trained LLM weights. If your server has no trouble auto-downloading weights from huggingface🤗, feel free to skip this step.

Download files from theopt-1.3b checkpoint (or any other decoder-only LLM) athuggingface, and store them under the./facebook/opt-1.3b directory. Make sure the required files are downloaded:

./facebook/opt-1.3b/  config.json  merges.txt  pytorch_model.bin  special_tokens_map.json  tokenizer_config.json  vocab.json

💻 Train your own models

Updates 2024-07-01: The released version is slightly different from our paper implementation. In our released version, westandardized the data format anddropped duplicated text annotations. To reproduce our reported results, please use the scripts provided inscripts-v0 to produce the generalist weights.

bash scripts-v0/opt-1.3b/train.generalist.sh

Our code should supportany decoder-only LLMs (facebook/opt-1.3b,gpt2-xl,meta-llama/Llama-2-7b or even theLATESTQwen/Qwen1.5-1.8B andQwen/Qwen1.5-4B). Check out the following table for recommended LLMs in different scales!By default, the models are trained with eight GPUs.

<1B	1B-4B	~7B
`gpt2`(124m)	`TinyLlama-1.1B`(1.1b)	`facebook/opt-6.7b`(6.7b)
`facebook/opt-125m`(125m)	`facebook/opt-1.3b`(1.3b)	`meta-llama/Llama-2-7b-hf`(6.7b)
`gpt2-medium`(355m)	`gpt2-xl`(1.6b)	`Qwen/Qwen1.5-7B`(7.7b)
`Qwen/Qwen1.5-0.5B`(620m)	`Qwen/Qwen1.5-1.8B`(1.8b)	-
`gpt2-large`(774m)	`facebook/opt-2.7b`(2.7b)	-
-	`microsoft/phi-2`(2.8b)	-
-	`Qwen/Qwen1.5-4B`(3.9b)	-

We provide training scripts in thescripts folder with different LLM backends. Feel free to modify the hyper parameters in those commands.

For other LLM backends, please modify the commands manually by changing--vocab to other LLMs.

Training

To train the model as a 3D generalist: (We have also uploaded the pre-trained weights tohuggingface.)

bash scripts/opt-1.3b/train.generalist.sh

After the model is trained, you can tune the model on ScanQA for 3D Question Answering:

bash scripts/opt-1.3b/tuning.scanqa.sh

And, on ScanRefer / Nr3D for 3D Dense Captioning:

bash scripts/opt-1.3b/tuning.scanrefer.shbash scripts/opt-1.3b/tuning.nr3d.sh

You can also tune the model to predict bounding boxes for open vocabulary object detection!

bash scripts/opt-1.3b/tuning.ovdet.sh

Evaluation

To evaluate the model as a 3D generalist:

bash scripts/opt-1.3b/eval.generalist.sh

On ScanQA for 3D Question Answering:

bash scripts/opt-1.3b/eval.scanqa.sh

And, on ScanRefer / Nr3D for 3D Dense Captioning:

bash scripts/opt-1.3b/eval.scanrefer.shbash scripts/opt-1.3b/eval.nr3d.sh

📖 Citation

If you find our code or paper helpful, please consider starring ⭐ us and citing:

@misc{chen2023ll3da,    title={LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning},     author={Sijin Chen and Xin Chen and Chi Zhang and Mingsheng Li and Gang Yu and Hao Fei and Hongyuan Zhu and Jiayuan Fan and Tao Chen},    year={2023},    eprint={2311.18651},    archivePrefix={arXiv},    primaryClass={cs.CV}}

Acknowledgments

Thanks toVote2Cap-DETR,3D-LLM,Scan2Cap, and3DETR. We borrow some of their codes and data.

License

This code is distributed under anMIT LICENSE. If there are any problem regarding our paper and code, feel free to open an issue!

About

[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.

ll3da.github.io/

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Official repo for LL3DA

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

🏃 Intro LL3DA

🚩 News

⚡ Quick Start

💻 Train your own models

📖 Citation

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors2

Languages

Movatterモバイル変換

License

Open3DA/LL3DA

Folders and files

Latest commit

History

Repository files navigation

Official repo for LL3DA

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

🏃 Intro LL3DA

🚩 News

⚡ Quick Start

💻 Train your own models

📖 Citation

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Languages

Packages