Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

FareedKhan-dev/ai-desktop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple AI Desktop that uses the OmniParser and Vision-Language Model to interact with the system. It can perform various tasks like opening applications, searching the web, and answering questions.

User Query:Open Google Chrome and search for google stock price

sample_result.mp4

Table of Contents

How it works

graph TD;    A[User Prompt: Open Chrome and buy me a milk] -->|User Input| B[VLMAgent];    B -->|Parse Screen Content| C[Omniparser];    C -->|Extracted Info| D[Computer];    B -->|Analyze Screen, Determine Action| E[LLM OpenAI];    E -->|Generate Action e.g., Mouse Move, Type| F[Action Execution];    F -->|Execute Action on Computer| D;    D -->|Get Result/Feedback| B;        F -->|Repeat until Task Complete| G[Task Complete];
Loading

It takes a user prompt and processes it through a vision-language model (VLMAgent). The agent analyzes the screen, extracts information, and determines the required actions using an AI model. These actions are then executed on the computer, repeating until the task is complete.

Installation

Clone the repository along with the OmniParser submodule

git clone --recursive https://github.com/FareedKhan-dev/ai-desktop

Or, if already cloned, toupdate OmniParser submodule

git submodule update --init --recursive

To install the dependencies, run the following command:

cd ai-desktop/OmniParserpip install -r requirements.txt

AI-Dekstop does not require any additional dependencies.

OmniParser Setup

Navigate to theOmniParser directory

cd OmniParser

Download the model checkpoints:

# Download the model checkpoints to the local directory OmniParser/weights/mkdir -p weights/icon_detect weights/icon_caption_florenceforfilein icon_detect/{train_args.yaml,model.pt,model.yaml} \            icon_caption/{config.json,generation_config.json,model.safetensors};do    huggingface-cli download microsoft/OmniParser-v2.0"$file" --local-dir weightsdonemv weights/icon_caption weights/icon_caption_florence

make sure the weights are downloaded in theweights directory and it should be calledicon_detect andicon_caption_florence respectively.

To start the gradio api ofomniparser, run the following command:

python gradio_demo.py

The gradio api will start atlocalhost:<port> and live sharaing link will be generated.

Configuration

Modify theconfig.py file to set up the API URLs, model names, and authentication keys.

OMNIPARSER_API_URL="OMNIPARSER_Gradio_link"# Set the OmniParser Gradio API link  (Follow the Usage section to get the link)VLM_MODEL_NAME="OPENAI/LOCAL_MODEL_NAME"# Define the vision-language modelBASE_URL="BASE_URL"# Set the base URL for the APIAPI_KEY="API_KEY"# Provide the API key

TheSYSTEM_PROMPT inconfig.py defines the AI agent behavior, guiding it to interact with the system using various actions like mouse movements, clicks, typing, and screenshots. Modify it as needed for custom AI interactions.

Running the AI Desktop

To start the AI Desktop, run the following command:

python main.py

you can modify theuser_query inmain.py to test different queries.


[8]ページ先頭

©2009-2025 Movatter.jp