OmniParser is a comprehensive approach to parsing user interface screenshots into structured and easily understandable elements, significantly enhancing the ability of GPT-4V to generate operations that accurately locate corresponding areas on the interface.
Project Address: https://github.com/microsoft/OmniParser
OmniParser includes two main components:
OmniTool: Control a Windows 11 virtual machine using OmniParser + your choice of vision model.
Supported Features:
cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence
python gradio_demo.py