NotificationsYou must be signed in to change notification settings
Fork7
Star176

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

176 stars 7 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
datasets		datasets
depth		depth
open6dor		open6dor
orientation		orientation
scripts		scripts
segmentation		segmentation
serve		serve
sofar_llava		sofar_llava
spatialbench		spatialbench
INSTALL.md		INSTALL.md
README.md		README.md
pyproject.toml		pyproject.toml

Repository files navigation

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

We present SoFar, the first 6-DoF system for spatial reasoning and robotic manipulation.

We introduce the concept ofsemantic orientation, representing the object orientation condition on open vocabulary language.

Zekun Qi *,Wenyao Zhang *,Yufei Ding *,Runpei Dong,Xinqiang Yu,Jingwen Li,Lingyun Xu,Baoyu Li,Xialin He,Guofan Fan,Jiazhao Zhang,Jiawei He,Jiayuan Gu,Xin Jin,Kaisheng Ma,Zhizheng Zhang,He Wang andLi Yi.

Quick-Start

Setup environment:

conda create -n sofar python=3.12 -yconda activate sofargit clone https://github.com/qizekun/SoFar.gitcd SoFarpip install -e.pip install -e segmentation/SAM

Download checkpoints:

mkdir checkpoints&cd checkpoints# Florence-2huggingface-cli download microsoft/Florence-2-base# Segment Anythingwget -c https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth# PointSOwget -c https://huggingface.co/qizekun/PointSO/resolve/main/small.pthwget -c https://huggingface.co/qizekun/PointSO/resolve/main/base_finetune.pth

More detailed installation instructions can be found inINSTALL.md.Note that CPU devices inference for SoFar are also supported, such asMacOS, Windows, etc.

SoFar

Our method is based on mature VLMs such as Qwen, ChatGPT, Gemini, etc., if you have an OpenAI key, you can obtain the service by setting the OpenAI key. Note that gemini-2.0-flash-exp is comparable and even better than the gpt-4o, especially the Open6DOR task.

export OPENAI_API_KEY=your_openai_key

Qwen-VL-2.5 can already handle embodied brain tasks.If you do not have an OpenAI-API Key, you can achieve comparable performance by loadingQwen:

pip install qwen-vl-utils[decord]==0.0.8 tritonpip install flash-attn --no-build-isolationpython scripts/qwen_demo.py

Demo

6-DoF Object Rearrangement Demo

python scripts/open6dor_demo.py

Object Manipulation Demo

python scripts/manipulation_demo.py

Spatial Visual Question Answering Demo

python scripts/vqa_demo.py

Evaluation

Object Manipulation on SimplerEnv

Google Robot Visual Matching

Method	Training Data	Pick Coke Can	Move Near	Open / Close Drawer	Average
Octo-Base	OXE	0.170	0.042	0.227	0.168
OpenVLA	OXE	0.163	0.462	0.356	0.277
RoboVLM	OXE	0.727	0.663	0.268	0.563
SpatialVLA	OXE	0.810	0.696	0.593	0.719
SoFar	-	0.923	0.917	0.403	0.749

Widow-X Visual Matching

Method	Training Data	Put Spoon on Towel	Put Carrot on Plate	Stack Green Block on Yellow Block	Put Eggplant in Yellow Basket	Average
Octo-Base	OXE	0.170	0.042	0.227	0.168	0.160
OpenVLA	OXE	0.000	0.000	0.000	0.041	0.010
RoboVLM	OXE	0.208	0.250	0.083	0.000	0.135
SpatialVLA	OXE	0.208	0.208	0.250	0.708	0.344
SoFar	-	0.583	0.667	0.708	0.375	0.583

We evaluate SoFar's performance on two tracks in SimplerEnv, and SoFar achieved SOTA performance in all cases. Due to the independent configuration of the environment, we provided detailed evaluation code inSimplerEnv-SOFAR.

6-DoF Object Rearrangement on Open6DOR V2

Method	Position Track		Rotation Track			6-DoF Track
	Level 0	Level 1	Level 0	Level 1	Level 2	Overall
Dream2Real	17.2	11.0	37.3	27.6	26.2	13.5
VoxPoser	35.6	21.7	-	-	-	-
Open6DOR-GPT	78.6	60.3	45.7	32.5	49.8	35.6
SoFar-LLaVA	86.3	57.9	62.5	30.2	67.1	40.3
SoFar	96.0	81.5	68.6	42.2	70.1	48.7

Download the refined dataset followingDATASET.md.

# Predict on Open6DOR datasetpython open6dor/open6dor_perception.py# Evaluate the metricspython open6dor/eval_open6dor.py

Note that Open6DOR uses theobserver's perspective, which means it is oriented relative to the robotic arm.This implies that the X-axis and Y-axis of the observer coordinate system are opposite to those of the robotic arm's base coordinate system.This is reflected in the system prompt: in the observer coordinate system, the Y-axis extends from left to right, and the X-axis extends from far to near.

Additionally, for the Open6DOR task, we recommend usingsmall_finetune.pth as the orientation model inpointso.py to achieve better performance.

Open6DOR V2 execution environment & evaluation is available atOpen6DOR-Libero, you can see the readme for more instructions.

6-DoF Spatial VQA on 6-DoF SpatialBench

Method	Position (rel.)	Position (abs.)	Orientation (rel.)	Orientation (abs.)	Total
GPT-4o	49.4	28.4	44.2	25.8	36.2
SpaceLLaVA	32.4	30.5	30.9	24.9	28.2
SpatialBot	50.9	21.6	39.6	22.9	33.7
RoboPoint	43.8	30.8	33.8	25.8	33.5
SoFar	59.6	33.8	54.6	31.3	43.9

Download the refined dataset followingDATASET.md.

python spatialbench/eval_spatialbench.py

PointSO

The pipeline of PointSO is as follows:

We generate high-quality, standardized, upright 3D asset datasetsOrientext300K through filtering and automatic annotating, and produce corresponding semantic orientations.
Train PointSO by adding random rotation, single-view interference, and Gaussian noise (set inconfig.yaml) to 3D assets.
Inference with PointSO, in the real world, most point cloud data are partial and in free orientations.

The released weights is onHuggingface PointSO, and the code is in theorientation folder.

Pretrain

Download the PointMAE as initialization.

wget https://github.com/Pang-Yatian/Point-MAE/releases/download/main/pretrain.pth -P orientation/

Perpare the OrienText300K dataset followingDATASET.md.

cd orientationsh train_ddp.sh

Finetune

Perpare the Open6DOR finetuning dataset followingDATASET.md.The dataset is generated from isaac sim with different assets from Open6DOR.Finetune PointSO will significantly improve the performance on Open6DOR rotation track & 6-DoF track.We recommend using the finetuned version of PointSO for the Open6DOR V2 evaluation.

cd orientationsh train_ddp_ft.sh

Datasets & Benchmarks

OrienText300K

We obtained the OrienText300K dataset by rendering multi-views of Objaverse and annotating with ChatGPT, including the filtering of Objaverse 1.0, 350K orientation-text pairs, and 8M multi-view images.The complete multi-view data will be uploaded.

In addition, if your work requires filtering 3D data, theattributes.zip we use to filter OrienText300K may be helpful for your work.We provide multi-view annotations for each object in Objaverse across multiple dimensions, removing low-quality, meaningless, noise, and 3D assets containing useless data.

OrienText300K samples, containing various objects and natural text for interaction.

Data open source onHuggingface OrienteText300K.

We also provide the code for rendering multi-views with Blender (version: 4.2.0) inrender_views.py, so that you can reproduce or use it on your own dataset.This rendering code has undergone very complex debugging and testing.We would appreciate it if this code is useful to you and cite our paper.

Open6DOR V2

A challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks.

We remove the erroneous data from Open6DOR V1 and eliminated parts that required manual judgment to facilitate replication.Open6DOR V2 contains ~4500 tasks for 6-DoF object rearrangement & spatial relationship evaluation.

Data open source onHuggingface Open6DOR V2.

6-DoF SpatialBench

Previous spatial perception LLMs mainly focused on operations of positional relationships, such as left-right, near-far, size, and counting, etc.In actual object manipulation, the orientation of the object is also a very important factor.Therefore, we proposed a new 6-DoF spatial perception benchmark dataset for evaluating the model's reasoning capabilities in position, orientation, and position-orientation relationships.We evaluated existing spatial perception models on this benchmark dataset.

Data open source onHuggingface 6-DoF SpatialBench.

TODO

Release the evaluation code for Simpler-Env for Google Robot & Widow-X.
Release the inference code with Qwen-VL-2.5.
Add cpu devices inference support, such as MacOS.
Release the evaluation code for Open6DOR-Libero.
Release the improved version of OrienText300K.
Release gradio demo for SoFar & PointSO.
Release the Objaverse-XL version dataset & PointSO.

Contact

If you have any questions related to the code or the paper, feel free to email Zekun (qizekun@gmail.com).

Acknowledgements

Citation

If you find SoFar, PointSO, OrienText300K, Open6DOR V2 or 6-DoF SpatialBench helpful for your research, please consider citing the following BibTeX entry.

@article{qi2025sofar,author ={Qi, Zekun and Zhang, Wenyao and Ding, Yufei and Dong, Runpei and Yu, Xinqiang and Li, Jingwen and Xu, Lingyun and Li, Baoyu and He, Xialin and Fan, Guofan and Zhang, Jiazhao and He, Jiawei and Gu, Jiayuan and Jin, Xin and Ma, Kaisheng and Zhang, Zhizheng and Wang, He and Yi, Li},title        ={SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation},journal      ={CoRR},volume       ={abs/2502.13143},year         ={2025},url          ={https://doi.org/10.48550/arXiv.2502.13143},doi          ={10.48550/ARXIV.2502.13143},eprinttype    ={arXiv},eprint       ={2502.13143}}

About

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

qizekun.github.io/sofar/

Movatterモバイル変換

qizekun/SoFar

Folders and files

Latest commit

History

Repository files navigation

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Contents

Quick-Start

SoFar

Demo

6-DoF Object Rearrangement Demo

Object Manipulation Demo

Spatial Visual Question Answering Demo

Evaluation

Object Manipulation on SimplerEnv

Google Robot Visual Matching

Widow-X Visual Matching

6-DoF Object Rearrangement on Open6DOR V2

6-DoF Spatial VQA on 6-DoF SpatialBench

PointSO

Pretrain

Finetune

Datasets & Benchmarks

OrienText300K

Open6DOR V2

6-DoF SpatialBench

TODO

Contact

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages