1.什么是DINOv3?

请添加图片描述

​DINOv3(Distillation with No Labels v3)是一个最先进的自监督视觉变换器(Vision Transformer,ViT),能够在没有任何人工标注的情况下学习视觉表示。作为Meta AI开发的自监督视觉基础模型,DINOv3显著扩展了早期DINO/DINOv2方法,将模型规模扩大到约67亿参数,训练数据规模扩大到精选的17亿张图像。

2.模型权重获取和测试
2.1 模型权重

(1). 在 web 数据集上预训练的 ViT 模型(LVD-1689M):

请添加图片描述

(2). 在 web 数据集(LVD-1689M)上预训练的 ConvNeXt 模型:

请添加图片描述

(3). 在卫星数据集(SAT-493M)上预训练的 ViT 模型:

请添加图片描述

2.2 权重加载

​ 官网提供了两种 Pretrained backbones(via PyTorch Hub 和 Pretrained backbones (via Hugging Face Transformers)),迫于某些局限性,某些模型权重可能无法下载成功,在文章末尾提供了资源包,希望可以帮到共同学习的小伙伴

2.2.1 PyTorch Hub
  • 官网提供了如下代码用于模型的加载:
import torch

REPO_DIR = <PATH/TO/A/LOCAL/DIRECTORY/WHERE/THE/DINOV3/REPO/WAS/CLONED>

# DINOv3 ViT models pretrained on web images
dinov3_vits16 = torch.hub.load(REPO_DIR, 'dinov3_vits16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vits16plus = torch.hub.load(REPO_DIR, 'dinov3_vits16plus', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vitb16 = torch.hub.load(REPO_DIR, 'dinov3_vitb16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vitl16 = torch.hub.load(REPO_DIR, 'dinov3_vitl16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vith16plus = torch.hub.load(REPO_DIR, 'dinov3_vith16plus', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vit7b16 = torch.hub.load(REPO_DIR, 'dinov3_vit7b16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)

# DINOv3 ConvNeXt models pretrained on web images
dinov3_convnext_tiny = torch.hub.load(REPO_DIR, 'dinov3_convnext_tiny', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_convnext_small = torch.hub.load(REPO_DIR, 'dinov3_convnext_small', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_convnext_base = torch.hub.load(REPO_DIR, 'dinov3_convnext_base', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_convnext_large = torch.hub.load(REPO_DIR, 'dinov3_convnext_large', source='local', weights=<CHECKPOINT/URL/OR/PATH>)

# DINOv3 ViT models pretrained on satellite imagery
dinov3_vitl16 = torch.hub.load(REPO_DIR, 'dinov3_vitl16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)
dinov3_vit7b16 = torch.hub.load(REPO_DIR, 'dinov3_vit7b16', source='local', weights=<CHECKPOINT/URL/OR/PATH>)

(1). 通过 git clone克隆项目到本地。

(2). 下载保存权重到对应的目录。

import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image
REPO_DIR = "project/dinov3"
dinov3_vitb16 = torch.hub.load(REPO_DIR, 'dinov3_vitb16', source='local', weights="dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth")

(3). 执行代码测试,出现下面的进度条说明加载成功:

请添加图片描述

2.2.2 Hugging Face Transformers
  • 与上面的方式类似,不过需要去Hugging Face 上面下载对应的模型信息,同样的,文章末尾也为需要的小伙伴准备了相应的资源 Wink~
import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

pretrained_model_name = "facebook/dinov3-convnext-tiny-pretrain-lvd1689m"
processor = AutoImageProcessor.from_pretrained(pretrained_model_name)
model = AutoModel.from_pretrained(
    pretrained_model_name, 
    device_map="auto", 
)

inputs = processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output
print("Pooled output shape:", pooled_output.shape)

(1). 通过 git clone克隆项目到本地。

(2). 下载保存权重到对应的目录。

(3). 执行代码测试,其中 modelpretrained_model_name 可以是以下之一:

  • facebook/dinov3-vits16-pretrain-lvd1689m
  • facebook/dinov3-vits16plus-pretrain-lvd1689m
  • facebook/dinov3-vitb16-pretrain-lvd1689m
  • facebook/dinov3-vitl16-pretrain-lvd1689m
  • facebook/dinov3-vith16plus-pretrain-lvd1689m
  • facebook/dinov3-vit7b16-pretrain-lvd1689m
  • facebook/dinov3-convnext-base-pretrain-lvd1689m
  • facebook/dinov3-convnext-large-pretrain-lvd1689m
  • facebook/dinov3-convnext-small-pretrain-lvd1689m
  • facebook/dinov3-convnext-tiny-pretrain-lvd1689m
  • facebook/dinov3-vitl16-pretrain-sat493m
  • facebook/dinov3-vit7b16-pretrain-sat493m
2.3 图像变换
  • 对于使用 LVD-1689M 权重(在网页图像上预训练)的模型,请使用以下转换(标准的 ImageNet 评估转换):
import torchvision

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])
  • 对于使用 SAT-493M 权重(在卫星图像上预训练)的模型,请使用以下转换:
import torchvision

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.430, 0.411, 0.296),
        std=(0.213, 0.156, 0.143),
    )
    return transforms.Compose([to_tensor, resize, normalize])
3.下游任务拓展
3.1 图像分类(Image classification)

请添加图片描述

  • 提供全模型(backbone + linear head)权重,分类器模型可以通过 PyTorch Hub 加载:
import torch
# DINOv3
dinov3_vit7b16_lc = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_lc', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

请添加图片描述

3.2 在 SYNTHMIX 数据集上深度训练

请添加图片描述

  • 提供全模型(backbone + linear head)权重:
depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)
  • 图像深度上的完整示例代码:
from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps

def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

img_size = 1024
img = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        batch_img = batch_img
        depths = depther(batch_img)

plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")
plt.subplot(122)
plt.imshow(depths[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")

3.3 在 COCO2017 数据集上训练的检测器

请添加图片描述

detector = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_de', source="local", weights=<DETECTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)
3.4 在 ADE20K 数据集上训练的分割器

请添加图片描述

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<SEGMENTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)
  • 图像上分割的完整示例代码:
import sys
sys.path.append(REPO_DIR)

from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps
from functools import partial
from dinov3.eval.segmentation.inference import make_inference


def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<SEGMENTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

img_size = 896
img  = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        pred_vit7b = segmentor(batch_img)  # raw predictions  
        # actual segmentation map
        segmentation_map_vit7b = make_inference(
            batch_img,
            segmentor,
            inference_mode="slide",
            decoder_head_type="m2f",
            rescale_to=(img.size[-1], img.size[-2]),
            n_output_channels=150,
            crop_size=(img_size, img_size),
            stride=(img_size, img_size),
            output_activation=partial(torch.nn.functional.softmax, dim=1),
        ).argmax(dim=1, keepdim=True)
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")
plt.subplot(122)
plt.imshow(segmentation_map_vit7b[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")
3.5 dino.txt 零样本任务

请添加图片描述

import torch
# DINOv3
dinov3_vitl16_dinotxt_tet1280d20h24l, tokenizer = torch.hub.load(REPO_DIR, 'dinov3_vitl16_dinotxt_tet1280d20h24l', weights=<SEGMENTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)
4.线上运行

官网也提供了几种 jupyter 的方式,不过也是需要下载对应的模型权重进行执行,可以在本地搭建好环境之后修改路径进行测试。

请添加图片描述

5.权重资源

请添加图片描述
请添加图片描述
请添加图片描述

关注公众号【TensorLink】回复 DINOv3 即可获取所有权重信息。

Logo

纵情码海钱塘涌,杭州开发者创新动! 属于杭州的开发者社区!致力于为杭州地区的开发者提供学习、合作和成长的机会;同时也为企业交流招聘提供舞台!

更多推荐