PRITHIVSAKTHIUR/Fara-7B-GUI-OperatorPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star5

A Gradio-based demonstration for the Microsoft Fara-7B model, designed as a computer use agent. Users upload UI screenshots (e.g., desktop or app interfaces), provide task instructions (e.g., "Click on the search bar"), and receive parsed actions with visualized indicators overlaid on the image.

huggingface.co/spaces/prithivMLmods/CUA-GUI-Operator

License

Apache-2.0 license

5 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Repository files navigation

Fara-7B GUI Operator

A Gradio-based demonstration for the Microsoft Fara-7B model, designed as a computer use agent. Users upload UI screenshots (e.g., desktop or app interfaces), provide task instructions (e.g., "Click on the search bar"), and receive parsed actions (clicks, types) with visualized indicators (circles and labels) overlaid on the image. Supports JSON-formatted tool calls for precise coordinate-based interactions.

Demo:https://huggingface.co/spaces/prithivMLmods/CUA-GUI-Operator

Screenshot 2025-12-07 at 11-41-16 CUA GUI Operator - a Hugging Face Space by prithivMLmods

Features

UI Image Processing: Upload screenshots; model analyzes and suggests actions like clicks or text input at specific coordinates.
Task-Driven Inference: Natural language instructions generate structured JSON actions (e.g., {"action": "click", "coordinate": [400, 300]}).
Action Visualization: Overlays red circles for clicks and blue for others, with labels (e.g., "Click" or "Type: 'Hello'") on the output image.
Response Parsing: Extracts tool calls from model output using regex; handles multiple actions per task.
Custom Theme: OrangeRedTheme with gradients for an intuitive interface.
Examples Integration: Pre-loaded samples for quick testing (e.g., Windows start menu, search box).
Queueing Support: Handles up to 50 concurrent inferences for efficient use.
Error Resilience: Fallbacks for model loading failures or invalid inputs; console logging for debugging.

Prerequisites

Python 3.10 or higher.
CUDA-compatible GPU (recommended for float16; falls back to CPU).
Git for cloning dependencies.
Hugging Face account (optional, for model caching viahuggingface_hub).

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Fara-7B-Action-Points-Demo.gitcd Fara-7B-Action-Points-Demo

Install dependencies:Create arequirements.txt file with the following content, then run:

pip install -r requirements.txt

requirements.txt content:

transformers==4.57.1webdriver-managerhuggingface_hubpython-dotenvsentencepieceqwen-vl-utilsgradio_modaltorchvisionmatplotlibacceleratenum2wordspydanticrequestspillowopenaispaceseinopstorchpeft

Start the application:
```
python app.py
```
The demo launches athttp://localhost:7860 (or the provided URL if using Spaces).

Usage

Upload Image: Provide a UI screenshot (e.g., PNG of a desktop or app window).
Enter Task: Describe the action in the textbox (e.g., "Click on the start menu" or "Type 'Hello World' in the search box").
Execute: Click "Execute Agent" to run inference.
View Results:
- Text: Raw model response with parsed JSON actions.
- Image: Annotated screenshot showing action points (circles with labels).

Example Workflow

Upload a Windows desktop image.
Task: "Click on the start menu."
Output: Response with click action at coordinates; image with red circle labeled "Click" on the start button.

Troubleshooting

Model Loading Errors: Ensure transformers 4.57.1; check CUDA withtorch.cuda.is_available(). Usetorch.float32 if float16 OOM occurs.
No Actions Parsed: Verify task clarity; raw output logged in console. Adjust max_new_tokens if truncated.
Visualization Issues: PIL font errors fallback to default; ensure images are RGB.
Queue Full: Increasemax_size indemo.queue() for higher traffic.
Vision Utils: Installqwen-vl-utils for image processing; test with examples.
UI Rendering: Setssr_mode=True if gradients fail; check CSS for custom styles.

Contributing

Contributions encouraged! Fork, create a feature branch (e.g., for multi-step tasks), and submit PRs with tests. Focus areas:

Support for video inputs or real-time GUI control.
Additional action types (e.g., scroll, drag).
Integration with browser automation.

Repository:https://github.com/PRITHIVSAKTHIUR/Fara-7B-Action-Points-Demo.git

License

Apache License 2.0. SeeLICENSE for details.

Built by Prithiv Sakthi. Report issues via the repository.

About

huggingface.co/spaces/prithivMLmods/CUA-GUI-Operator

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Fara-7B GUI Operator

Features

Prerequisites

Installation

Usage

Example Workflow

Troubleshooting

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Movatterモバイル変換

License

PRITHIVSAKTHIUR/Fara-7B-GUI-Operator

Folders and files

Latest commit

History

Repository files navigation

Fara-7B GUI Operator

Features

Prerequisites

Installation

Usage

Example Workflow

Troubleshooting

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages