Gradio

This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts. You can either provide a text prompt or an image as a visual prompt to detect objects in the target image. Additionally, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's transformers by default often underperforms because of how the visual prompt embedding is selected (see README.md for more details).

Annotated Image (HF default)

Annotated Image (Objectness × IoU variant)

Prompt Image with Selected Embeddings and Objectness Score