NotificationsYou must be signed in to change notification settings
Fork9
Star56

[IROS 2025] Human Demo Videos to Robot Action Plans

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
VLM_CaP		VLM_CaP
media		media
.gitignore		.gitignore
README.md		README.md
convert_video.py		convert_video.py
get_frame_by_hands.py		get_frame_by_hands.py
hand_landmarker.task		hand_landmarker.task
requirements.txt		requirements.txt
track_anything.py		track_anything.py
track_objects.py		track_objects.py
vlm.py		vlm.py

Repository files navigation

SeeDo: Human Demo Video to Robot Action Plan via Vision Language Model

VLM See, Robot Do (SeeDo) is a method that uses large vision models, tracking models and vision-language models to extract robot action plans from human demonstration videos, specifically focusing on long horizon pick-and-place tasks. The action plan is then implemented in real-world and PyBullet simulation environments.

News

[2025/06] SeeDo is accepted by IROS 2025! We will update the camera-ready version soon.

Setup Instructions

Note that SeeDo relies on GroundingDINO, SAM and SAM2. The code has only been tested on Ubuntu 20.04. The version of CUDA tested is 11.8, the Pytorch version is 2.3.1+cu118.

Install SeeDo and create a new environment

gitclonehttps://github.com/ai4ce/SeeDocondacreate--nameseedopython=3.10.14condaactivateseedocdSeeDopipinstall-rrequirements.txt

Install Pytorch (Only for CUDA 11.8 user)

pipinstalltorch==2.3.1+cu118torchvisiontorchaudio--index-urlhttps://download.pytorch.org/whl/cu118

Install GroundingDINO, SAM and SAM2 in the same environment

gitclonehttps://github.com/IDEA-Research/GroundingDINOgitclonehttps://github.com/facebookresearch/segment-anything.gitgitclonehttps://github.com/facebookresearch/segment-anything-2.git

Make sure these models are installed in editable packages

cdGroundingDINOpipinstall-e .

And do the same with segment-anything, segment-anything-2

We have slightly modified the GroundingDINO

InGroundingDINO/groundingdino/util/inference.py, we add a function to help inference on an array of images. Please paste the following function intoinference.py.

defload_image_from_array(image_array:np.array)->Tuple[np.array,torch.Tensor]:transform=T.Compose(        [T.RandomResize([800],max_size=1333),T.ToTensor(),T.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225]),        ]    )image_source=Image.fromarray(image_array)image_transformed,_=transform(image_source,None)returnimage_array,image_transformed

The code still uses one checkpoint from segment-anything.

Make sure you download it in the SeeDo folder.default orvit_h:ViT-H SAM model.

Obtain an OpenAI API key and create akey.py file underVLM_CaP/src

cdVLM_CaP/srctouchkey.pyecho'projectkey = "YOUR_OPENAI_API_KEY"'>key.py

Pipeline

There are mainly four parts of SeeDo. To ensure the video is successfully processed in subsequent steps, useconvert_video.py to convert the video to the appropriate encoding before inputting it. Theconvert_video.py script accepts two parameters:--input and--output, which specify the path of your original video and the path of the converted video, respectively.

Keyframe Selection Module
get_frame_by_hands.py: Theget_frame_by_hands.py script allows selecting key frames by tracking hand movements. It accepts two parameters.
--video_path, which specifies the path of the input video.
--output_dir, which designates the directory where the key frames will be saved. Ifoutput_dir is not specified, the keyframes will be saved to./output by default. For debugging purpose, the hand image and hand speed plot will also be saved in this directory.
Visual Perception Module
track_objects.py: Thetrack_objects.py script is used to track each object and add a visual prompt for the objects. It also returns a string containing the center coordinates of each object in the key frames. The script accepts three parameters.
--input is the video converted to the appropriate format.
--output specifies the output path for the video with the visual prompts.
--key_frames is the list of key frame indices obtained fromget_frames_by_hands.py.
This module will return abox_list string stored for useage in VLM Reasoning Module
VLM Reasoning Module
vlm.py: Thevlm.py script performs reasoning on the key frames and generates an action list for the video. It accepts three parameters.
--input is the video with visual prompts added by the Visual Perception Module.
--list is the keyframe index list obtained from the Keyframe Selection Module.
--bbx_list is thebox_list string obtained from the Visual Perception Module.
This module will return two strings:obj_list representing for the objects in the environment;action_list representing for the action list performed on these objects.
Robot Manipulation Module
simulation.py: Thesimulation.py script accepts three parameters:obj_list,action_list,output. It first initializes a random simulation scene based on theobj_list, and then executes pick-and-place tasks according to theaction_list, and finally write the video to output.
Example usage:python simulation.py --action_list "put chili on bowl and then put eggplant on glass" --obj_list chili carrot eggplant bowl glass --output demo2.mp4
Note that this part uses a modified version of the Code as Policies framework, and its successful execution depends heavily on whether the objects are already modeled and whether the corresponding execution functions for actions are present in the prompt. We provide a series of new object models and prompts that are compatible with our defined action list. If you want to operate on unseen objects, you will need to provide the corresponding object modeling, and modify the LMP and prompt file accordingly.
We provide some simple object modelings of vegetables on hugging face. Download fromhttps://huggingface.co/datasets/ai4ce/SeeDo/tree/main/SeeDo
There will be anassets.zip file, extract that file intoassets and make sure this folder is under the path of VLM_CaP.VLM_CaP/assets will then be used bysimulation.py for simulation.
It will write out a video of robot manipulation of a series of pick-and-place tasks in simulation.

About

[IROS 2025] Human Demo Videos to Robot Action Plans

ai4ce.github.io/SeeDo

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SeeDo: Human Demo Video to Robot Action Plan via Vision Language Model

News

Setup Instructions

Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

ai4ce/SeeDo

Folders and files

Latest commit

History

Repository files navigation

SeeDo: Human Demo Video to Robot Action Plan via Vision Language Model

News

Setup Instructions

Pipeline

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages