This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts. You can either provide a text prompt or an image as a visual prompt to detect objects in the target image. Additionally, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's transformers by default often underperforms because of how the visual prompt embedding is selected (see README.md for more details).

Model
0 1
0 1

Annotated Image (HF default)

Annotated Image (Objectness × IoU variant)

Prompt Image with Selected Embeddings and Objectness Score

Box Visual Prompt Examples