DINOv3
DINOv3
DINOv3 (Meta, 2025) is a self-supervised vision foundation model trained on 1.7 B images. Unlike CLIP it learns purely from image structure — no captions — producing dense features that transfer brilliantly to detection, segmentation, and depth. A frozen DINOv3-Large backbone plus a small task head matches or beats fully…
Definition
DINOv3 (Meta, 2025) is a self-supervised vision foundation model trained on 1.7 B images. Unlike CLIP it learns purely from image structure — no captions — producing dense features that transfer brilliantly to detection, segmentation, and depth. A frozen DINOv3-Large backbone plus a small task head matches or beats fully supervised baselines on most CV benchmarks. FI Tech uses DINOv3 features as a starting point for low-data Saudi industrial niches — uniformed Aramco workers, distinctive NEOM-construction haul truck variants — where collecting 50,000 labels would take a year. Linear probes on DINOv3 hit usable accuracy with 100 to 500 labels per class.