ExPLoRA: Parameter-Efficient Extended Pre-training to Adapt Vision Transformers under Domain Shifts

Stanford University
*Correspondence to samarkhanna [at] cs.stanford.edu.

ExPLoRA creates state-of-the-art foundation models for new domains by extending unsupervised pre-training of ViTs (like DinoV2 and MAE) in a parameter-efficient manner.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks.

Motivation

ExPLoRA creates effective visual foundation models for new domains inexpensively, given existing pre-trained weights.

Consider two faily different image domains, DS and DT (such as natural images vs satellite images). On the left, the traditional approach is to pre-train foundation models from scratch for each domain, yielding weights WDS and WDT. Then, these weights are fine-tuned via supervised learning on target datasets i to yield weights Δsi and Δti for domains DS and DT, respectively. Pre-training for each new domain is very expensive, and can require large amounts of compute and data.

ExPLoRA challenges this paradigm. On the right, our method initializes with WDS and learns unsupervised weights ΔDT for domain DT in a parameter-efficient manner. These new weights ΔDT are then used for fine-tuning on specific datasets ti, resulting in even better downstream performance than WDT.

Our key insight is to find the right combination of parameter-efficient methods that work for unsupervised pre-training, rather than traditional supervised fine-tuning PEFT methods.

Method

ExPLoRA works on a ViT with L layers as follows:

  1. Initialize a frozen ViT with pre-trained weights WDS from source domains DS (e.g., DinoV2 or MAE weights from natural images).
  2. Unfreeze all parameters of a subset U of the L ViT blocks (usually just 1 or 2 blocks).
  3. Apply LoRA with rank r on the Q and V weights in attention layers of all the remaining L - U frozen blocks.
  4. Train these unfrozen parameters (collectively denoted as ΔDT) on an unlabeled dataset XDT of the target domain DT. Use the same unsupervised objective as what was used for WDS (e.g., DinoV2 or MAE).
The output of this process is a new pre-trained foundation model for the target domain DT, that can then be used for feature extraction or for further fine-tuning on downstream tasks!

Since we only train a small fraction (5-10%) of the original ViT weights, ExPLoRA can create powerful foundation models for new domains using only 4 A4000-16GB GPUs! For comparison, pre-training a ViT-L DinoV2 from scratch required 96 A100-80GB GPUs!

Results

Using ExPLoRA, we are able to create state-of-the-art foundation models for satellite images, outperforming all prior fully-pretrained models on the fMoW-RGB benchmark.

Method Backbone #Pre-train Params Top 1 Acc.
SatMAE ViT-L 303M 65.94
ScaleMAE ViT-B 86M 67.30
CrossScaleMAE ViT-B 86M 69.20
DinoV2 ViT-L - 69.00
Ours ViT-B 9M 75.11
Ours ViT-L 18M 77.48
Table 1: Linear probing results on fMoW-RGB.
Here, we show a sneak-peek of our results on the fMoW-RGB validation dataset. In Table 1, #Pre-train parameters refers to the number of trainable parameters used on the new domain (i.e. satellite images of fMoW-RGB). We show that our ExPLoRA-tuned ViT learns strong unsupervised representations-- with linear probing, we achieve a large 8.28% improvement in top 1 accuracy over prior state-of-the-art fully pre-trained backbones, while using a fraction of the ViT parameters.

For further results on satellite image datasets and on the WILDS benchmark, please read our paper!

Analysis

A key design choice of ExPLoRA is to fully train a small subset of the ViT layers (i.e. the unfrozen block) while applying low-rank updates to the remaining frozen layers. Why is this combination so effective? We conduct experiments to analyze the output feature maps of each block of the ViT on the target domain. On these feature maps, we:

  1. Apply PCA to calculate the mean and variance of the eigenvalues
  2. Train linear classifiers to predict the relative position of each patch
  3. Train linear classifiers to predict the image class from each patch

Mean Eigenvalues Linear probing for position Linear probing for image class

These results show that the spectral properties (mean eigenvalues) of a block's feature map and its ability to retrieve local information (like texture or patch position) are correlated. The middle layers of the Dino models are responsible for extracting local information from the input image patches as their feature map eigenvalues and localization accuracies are high (left and middle figures). Conversely, the final layers of the ViT are responsible for semantic understanding, as their feature maps contain more global information, seen by higher image-class accuracies (rightmost figure).

ExPLoRA amplifies the global information stored in each block's feature map towards the final layers of the ViT, while preserving strong localization in the middle layers. This can also be seen in the attention maps below, as ExPLoRA's attention highlights the central object more clearly.

For more detailed analysis and experiments, please read our paper!

BibTeX

@article{khanna2024explora,
  title={ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts},
  author={Khanna, Samar and Irgau, Medhanie and Lobell, David B and Ermon, Stefano},
  journal={arXiv preprint arXiv:2406.10973},
  year={2024}
}