- Notifications
You must be signed in to change notification settings - Fork0
License
dawnmsg/TaskMatrix
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TaskMatrix connects ChatGPT and a series of Visual Foundation Models to enablesending andreceiving images during chatting.
See our paper:Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Now TaskMatrix supportsGroundingDINO andsegment-anything! Thanks@jordddan for his efforts. For the image editing case,
GroundingDINO
is first used to locate bounding boxes guided by given text, thensegment-anything
is used to generate the related mask, and finally stable diffusion inpainting is used to edit image based on the mask.- Firstly, run
python visual_chatgpt.py --load "Text2Box_cuda:0,Segmenting_cuda:0,Inpainting_cuda:0,ImageCaptioning_cuda:0"
- Then, say
find xxx in the image
orsegment xxx in the image
.xxx
is an object. TaskMatrix will return the detection or segmentation result!
- Firstly, run
Now TaskMatrix can support Chinese! Thanks to@Wang-Xiaodong1899 for his efforts.
We propose thetemplate idea in TaskMatrix!
- A template is apre-defined execution flow that assists ChatGPT in assembling complex tasks involving multiple foundation models.
- A template contains theexperiential solution to complex tasks as determined by humans.
- A template caninvoke multiple foundation models or evenestablish a new ChatGPT session
- To define atemplate, simply adding a class with attributes
template_model = True
Thanks to@ShengmingYin and@thebestannie for providing a template example in
InfinityOutPainting
class (see the following gif)- Firstly, run
python visual_chatgpt.py --load "Inpainting_cuda:0,ImageCaptioning_cuda:0,VisualQuestionAnswering_cuda:0"
- Secondly, say
extend the image to 2048x1024
to TaskMatrix! - By simply creating an
InfinityOutPainting
template, TaskMatrix can seamlessly extend images to any size through collaboration with existingImageCaptioning
,Inpainting
, andVisualQuestionAnswering
foundation models,without the need for additional training.
- Firstly, run
TaskMatrix needs the effort of the community! We crave your contribution to add new and interesting features!
On the one hand,ChatGPT (or LLMs) serves as ageneral interface that provides a broad and diverse understanding of awide range of topics. On the other hand,Foundation Models serve asdomain experts by providing deep knowledge in specific domains.By leveragingboth general and deep knowledge, we aim at building an AI that is capable of handling various tasks.
# clone the repogit clone https://github.com/microsoft/TaskMatrix.git# Go to directorycd visual-chatgpt# create a new environmentconda create -n visgpt python=3.8# activate the new environmentconda activate visgpt# prepare the basic environmentspip install -r requirements.txtpip install git+https://github.com/IDEA-Research/GroundingDINO.gitpip install git+https://github.com/facebookresearch/segment-anything.git# prepare your private OpenAI key (for Linux)export OPENAI_API_KEY={Your_Private_Openai_Key}# prepare your private OpenAI key (for Windows)set OPENAI_API_KEY={Your_Private_Openai_Key}# Start TaskMatrix !# You can specify the GPU/CPU assignment by "--load", the parameter indicates which # Visual Foundation Model to use and where it will be loaded to# The model and device are separated by underline '_', the different models are separated by comma ','# The available Visual Foundation Models can be found in the following table# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0# You can use: "ImageCaptioning_cpu,Text2Image_cuda:0"# Advice for CPU Userspython visual_chatgpt.py --load ImageCaptioning_cpu,Text2Image_cpu# Advice for 1 Tesla T4 15GB (Google Colab) python visual_chatgpt.py --load "ImageCaptioning_cuda:0,Text2Image_cuda:0" # Advice for 4 Tesla V100 32GB python visual_chatgpt.py --load "Text2Box_cuda:0,Segmenting_cuda:0, Inpainting_cuda:0,ImageCaptioning_cuda:0, Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1, Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2, InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2, SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2, Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu, NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3"
Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:
Foundation Model | GPU Memory (MB) |
---|---|
ImageEditing | 3981 |
InstructPix2Pix | 2827 |
Text2Image | 3385 |
ImageCaptioning | 1209 |
Image2Canny | 0 |
CannyText2Image | 3531 |
Image2Line | 0 |
LineText2Image | 3529 |
Image2Hed | 0 |
HedText2Image | 3529 |
Image2Scribble | 0 |
ScribbleText2Image | 3531 |
Image2Pose | 0 |
PoseText2Image | 3529 |
Image2Seg | 919 |
SegText2Image | 3529 |
Image2Depth | 0 |
DepthText2Image | 3531 |
Image2Normal | 0 |
NormalText2Image | 3529 |
VisualQuestionAnswering | 1495 |
We appreciate the open source of the following projects:
Hugging Face LangChain Stable Diffusion ControlNet InstructPix2Pix CLIPSeg BLIP
For help or issues using the TaskMatrix, please submit a GitHub issue.
For other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must followMicrosoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
About
Resources
License
Code of conduct
Contributing
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Languages
- Python80.5%
- HTML19.2%
- Dockerfile0.3%