ExPLoRA: Parameter-Efficient Extended Pre-training to Adapt Vision Transformers under Domain Shifts

Stanford University
*Correspondence to samarkhanna [at] cs.stanford.edu.

ExPLoRA creates state-of-the-art foundation models for new domains by extending unsupervised pre-training of ViTs (like DinoV2 and MAE) in a parameter-efficient manner.

Abstract

Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available at https://github.com/samar-khanna/ExPLoRA.

Motivation

Pre-training foundation models on large, natural-image datasets such as DinoV2 or MAE is very expensive. For example, pre-training a ViT-L DinoV2 from scratch required 96 A100-80GB GPUs! Supervised learning via parameter-efficient fine-tuning (PEFT) updates a small fraction of the model's parameters to generalize pre-trained weights to downstream tasks. This paradigm works well if the downstream datasets are similar and sufficiently in-distribution with the pre-training dataset. However, PEFT struggles to adapt models to downstream datasets that exhibit a large domain gap from natural images, such as satellite or medical imagery.

To address large domain gaps, prevailing approaches (eg: SatMAE, ScaleMAE, RayDINO) spend similarly large amounts of compute to pre-train new foundation models from scratch on the new domain. This approach is expensive, difficult to scale to each new domain, and does not make use of the rich semantic information in the pre-trained weights from natural images.

Instead, can we build foundation models for new domains without pre-training from scratch? We propose ExPLoRA, to conduct parameter-efficient continual pre-training on unlabeled image data of the new domain. Our approach presents some key benefits: it is efficient and inexpensive in compute, and leverages knowledge transfer from natural images.

Method

ExPLoRA works on a ViT with L layers as follows (see also our top-most figure):

  1. Initialize a frozen ViT with pre-trained weights WS from source domains S (e.g., DinoV2 or MAE weights from natural images).
  2. Unfreeze all parameters of a subset U of the L ViT blocks (usually just 1 or 2 blocks).
  3. Apply LoRA with rank r on the Q and V weights in attention layers of all the remaining L - U frozen blocks.
  4. Train these unfrozen parameters (collectively denoted as ΔT) on an unlabeled dataset XT of the target domain T. Use the same unsupervised objective as what was used for WS (e.g., DinoV2 or MAE).
The output of this process is a new pre-trained foundation model for the target domain T, that can then be used for feature extraction or for further fine-tuning on downstream tasks!

Since we only train a small fraction (5-10%) of the original ViT weights, ExPLoRA can create powerful foundation models for new domains using only 4 A4000-16GB GPUs! For comparison, pre-training a ViT-L DinoV2 from scratch required 96 A100-80GB GPUs!

Results

Using ExPLoRA, we are able to create state-of-the-art foundation models for satellite images, outperforming all prior fully-pretrained models on the fMoW-RGB benchmark.


Figure 1: Fine-tuning results on fMoW-RGB.

In this figure, we demonstrate ExPLoRA's efficiency and performance on satellite image classification on the fMoW-RGB validation dataset. ExPLoRA uses 8x-10x less compute (in GPU hours) than full-pretraining methods and achieves a new state-of-the-art performance of 79.3% when pre-trained with the DinoV2 objective. Here, (LoRA-r8) stands for supervised fine-tuning a pre-trained ViT with LoRA rank 8.


Method Backbone #Pre-train Params Top 1 Acc.
SatMAE ViT-L 303M 65.94
ScaleMAE ViT-B 86M 67.30
CrossScaleMAE ViT-B 86M 69.20
DinoV2 ViT-L - 69.00
Ours ViT-B 9M 75.11
Ours ViT-L 18M 77.48
Table 1: Linear probing results on fMoW-RGB.

We also show that our ExPLoRA-tuned ViT learns strong unsupervised representations-- with linear probing, we achieve a large 8.28% improvement in top 1 accuracy over prior state-of-the-art fully pre-trained backbones, while using a fraction of the ViT parameters. Here in Table 1, #Pre-train parameters refers to the number of trainable parameters used on the new domain (i.e. satellite images of fMoW-RGB).

For further results on satellite image datasets and on the WILDS benchmark, please read our paper!

Analysis

A key design choice of ExPLoRA is to fully train a small subset of the ViT layers (i.e. the unfrozen block) while applying low-rank updates to the remaining frozen layers. Why is this combination so effective? We conduct experiments to analyze the output feature maps of each block of the ViT on the target domain. On these feature maps, we:

  1. Apply PCA to calculate the mean and variance of the eigenvalues
  2. Train linear classifiers to predict the relative position of each patch
  3. Train linear classifiers to predict the image class from each patch

Mean Eigenvalues Mean Eigenvalues
Linear probing for position Linear probing for image class

These results show that the spectral properties (mean eigenvalues) of a block's feature map and its ability to retrieve local information (like texture or patch position) are correlated. The middle layers of the Dino models are responsible for extracting local information from the input image patches as their feature map eigenvalues and localization accuracies are high (top-left and bottom-left plots). Conversely, the final layers of the ViT are responsible for semantic understanding, as their feature maps contain more global information, seen by higher image-class accuracies (bottom-right).

ExPLoRA amplifies the global information stored in each block's feature map towards the final layers of the ViT, while preserving strong localization in the middle layers. This can also be seen in the attention maps below, as ExPLoRA's attention highlights the central object more clearly.

For more detailed analysis and experiments, please read our paper!

BibTeX

@inproceedings{khanna2025explora,
  title={Ex{PL}o{RA}: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts},
  author={Samar Khanna and Medhanie Irgau and David B. Lobell and Stefano Ermon},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=OtxLhobhwb}
}