- Notifications
You must be signed in to change notification settings - Fork5
AI agent that controls a computer
FareedKhan-dev/ai-desktop
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Simple AI Desktop that uses the OmniParser and Vision-Language Model to interact with the system. It can perform various tasks like opening applications, searching the web, and answering questions.
User Query:Open Google Chrome and search for google stock price
sample_result.mp4
graph TD; A[User Prompt: Open Chrome and buy me a milk] -->|User Input| B[VLMAgent]; B -->|Parse Screen Content| C[Omniparser]; C -->|Extracted Info| D[Computer]; B -->|Analyze Screen, Determine Action| E[LLM OpenAI]; E -->|Generate Action e.g., Mouse Move, Type| F[Action Execution]; F -->|Execute Action on Computer| D; D -->|Get Result/Feedback| B; F -->|Repeat until Task Complete| G[Task Complete];
It takes a user prompt and processes it through a vision-language model (VLMAgent). The agent analyzes the screen, extracts information, and determines the required actions using an AI model. These actions are then executed on the computer, repeating until the task is complete.
Clone the repository along with the OmniParser submodule
git clone --recursive https://github.com/FareedKhan-dev/ai-desktop
Or, if already cloned, toupdate
OmniParser submodule
git submodule update --init --recursive
To install the dependencies, run the following command:
cd ai-desktop/OmniParserpip install -r requirements.txt
AI-Dekstop does not require any additional dependencies.
Navigate to theOmniParser
directory
cd OmniParser
Download the model checkpoints:
# Download the model checkpoints to the local directory OmniParser/weights/mkdir -p weights/icon_detect weights/icon_caption_florenceforfilein icon_detect/{train_args.yaml,model.pt,model.yaml} \ icon_caption/{config.json,generation_config.json,model.safetensors};do huggingface-cli download microsoft/OmniParser-v2.0"$file" --local-dir weightsdonemv weights/icon_caption weights/icon_caption_florence
make sure the weights are downloaded in theweights
directory and it should be calledicon_detect
andicon_caption_florence
respectively.
To start the gradio api ofomniparser
, run the following command:
python gradio_demo.py
The gradio api will start atlocalhost:<port>
and live sharaing link will be generated.
Modify theconfig.py
file to set up the API URLs, model names, and authentication keys.
OMNIPARSER_API_URL="OMNIPARSER_Gradio_link"# Set the OmniParser Gradio API link (Follow the Usage section to get the link)VLM_MODEL_NAME="OPENAI/LOCAL_MODEL_NAME"# Define the vision-language modelBASE_URL="BASE_URL"# Set the base URL for the APIAPI_KEY="API_KEY"# Provide the API key
TheSYSTEM_PROMPT
inconfig.py
defines the AI agent behavior, guiding it to interact with the system using various actions like mouse movements, clicks, typing, and screenshots. Modify it as needed for custom AI interactions.
To start the AI Desktop, run the following command:
python main.py
you can modify theuser_query
inmain.py
to test different queries.
About
AI agent that controls a computer