ExPLoRA

Abstract

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks.

Motivation

ExPLoRA creates effective visual foundation models for new domains inexpensively, given existing pre-trained weights.

Consider two faily different image domains, D_S and D_T (such as natural images vs satellite images). On the left, the traditional approach is to pre-train foundation models from scratch for each domain, yielding weights W_{D_S} and W_{D_T}. Then, these weights are fine-tuned via supervised learning on target datasets i to yield weights Δ_{s_i} and Δ_{t_i} for domains D_S and D_T, respectively. Pre-training for each new domain is very expensive, and can require large amounts of compute and data.

ExPLoRA challenges this paradigm. On the right, our method initializes with W_{D_S} and learns unsupervised weights Δ_{D_T} for domain D_T in a parameter-efficient manner. These new weights Δ_{D_T} are then used for fine-tuning on specific datasets t_i, resulting in even better downstream performance than W_{D_T}.

Our key insight is to find the right combination of parameter-efficient methods that work for unsupervised pre-training, rather than traditional supervised fine-tuning PEFT methods.

Method

ExPLoRA works on a ViT with L layers as follows:

Initialize a frozen ViT with pre-trained weights W_{D_S} from source domains D_S (e.g., DinoV2 or MAE weights from natural images).
Unfreeze all parameters of a subset U of the L ViT blocks (usually just 1 or 2 blocks).
Apply LoRA with rank r on the Q and V weights in attention layers of all the remaining L - U frozen blocks.
Train these unfrozen parameters (collectively denoted as Δ_{D_T}) on an unlabeled dataset X_{D_T} of the target domain D_T. Use the same unsupervised objective as what was used for W_{D_S} (e.g., DinoV2 or MAE).

The output of this process is a new pre-trained foundation model for the target domain D_T, that can then be used for feature extraction or for further fine-tuning on downstream tasks!

Since we only train a small fraction (5-10%) of the original ViT weights, ExPLoRA can create powerful foundation models for new domains using only 4 A4000-16GB GPUs! For comparison, pre-training a ViT-L DinoV2 from scratch required 96 A100-80GB GPUs!

Results

Using ExPLoRA, we are able to create state-of-the-art foundation models for satellite images, outperforming all prior fully-pretrained models on the fMoW-RGB benchmark.

Method	Backbone	#Pre-train Params	Top 1 Acc.
SatMAE	ViT-L	303M	65.94
ScaleMAE	ViT-B	86M	67.30
CrossScaleMAE	ViT-B	86M	69.20
DinoV2	ViT-L	-	69.00
Ours	ViT-B	9M	75.11
Ours	ViT-L	18M	77.48

Table 1: Linear probing results on fMoW-RGB.

Here, we show a sneak-peek of our results on the fMoW-RGB validation dataset. In Table 1, #Pre-train parameters refers to the number of trainable parameters used on the new domain (i.e. satellite images of fMoW-RGB). We show that our ExPLoRA-tuned ViT learns strong unsupervised representations-- with linear probing, we achieve a large 8.28% improvement in top 1 accuracy over prior state-of-the-art fully pre-trained backbones, while using a fraction of the ViT parameters.

For further results on satellite image datasets and on the WILDS benchmark, please read our paper!

Analysis

A key design choice of ExPLoRA is to fully train a small subset of the ViT layers (i.e. the unfrozen block) while applying low-rank updates to the remaining frozen layers. Why is this combination so effective? We conduct experiments to analyze the output feature maps of each block of the ViT on the target domain. On these feature maps, we:

Apply PCA to calculate the mean and variance of the eigenvalues
Train linear classifiers to predict the relative position of each patch
Train linear classifiers to predict the image class from each patch

These results show that the spectral properties (mean eigenvalues) of a block's feature map and its ability to retrieve local information (like texture or patch position) are correlated. The middle layers of the Dino models are responsible for extracting local information from the input image patches as their feature map eigenvalues and localization accuracies are high (left and middle figures). Conversely, the final layers of the ViT are responsible for semantic understanding, as their feature maps contain more global information, seen by higher image-class accuracies (rightmost figure).

ExPLoRA amplifies the global information stored in each block's feature map towards the final layers of the ViT, while preserving strong localization in the middle layers. This can also be seen in the attention maps below, as ExPLoRA's attention highlights the central object more clearly.

For more detailed analysis and experiments, please read our paper!

BibTeX

@article{khanna2024explora,
  title={ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts},
  author={Khanna, Samar and Irgau, Medhanie and Lobell, David B and Ermon, Stefano},
  journal={arXiv preprint arXiv:2406.10973},
  year={2024}
}

ExPLoRA: Parameter-Efficient Extended Pre-training to Adapt Vision Transformers under Domain Shifts

ExPLoRA creates state-of-the-art foundation models for new domains by extending unsupervised pre-training of ViTs (like DinoV2 and MAE) in a parameter-efficient manner.