This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts.
You can either provide a text prompt or an image as a visual prompt to detect objects in the target image.
Additionally, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's transformers by default often underperforms because of how the visual prompt embedding is selected (see README.md for more details).
Model
0 1
0 1
Annotated Image (HF default)
Annotated Image (Objectness × IoU variant)
Prompt Image with Selected Embeddings and Objectness Score
Box Visual Prompt Examples