CV

Grounding DINO

Grounding DINO

Also known as: GroundingDINO, Open-vocabulary detector

Grounding DINO is an open-vocabulary detector: instead of a fixed class list, it accepts free-form text prompts like 'worker without hard hat' or 'oil leak' and locates matching regions in the image. It pairs a Swin transformer image encoder with BERT text encoder and cross-modal fusion. Zero-shot performance on novel…

Definition

Grounding DINO is an open-vocabulary detector: instead of a fixed class list, it accepts free-form text prompts like 'worker without hard hat' or 'oil leak' and locates matching regions in the image. It pairs a Swin transformer image encoder with BERT text encoder and cross-modal fusion. Zero-shot performance on novel industrial classes outperforms supervised YOLO when training data is scarce. FI Tech uses Grounding DINO as an autolabeler — bootstrap training sets for new classes (uncovered manhole, broken scaffold pipe, missing toe-board) without weeks of manual annotation. Once 5,000 labels accumulate, we distill into a faster YOLO student model.