Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks.
ExPLoRA creates effective visual foundation models for new domains inexpensively, given existing pre-trained weights.
Consider two faily different image domains, DS and DT (such as natural images vs satellite images). On the left, the traditional approach is to pre-train foundation models from scratch for each domain, yielding weights WDS and WDT. Then, these weights are fine-tuned via supervised learning on target datasets i to yield weights Δsi and Δti for domains DS and DT, respectively. Pre-training for each new domain is very expensive, and can require large amounts of compute and data.
ExPLoRA challenges this paradigm. On the right, our method initializes with WDS and learns unsupervised weights ΔDT for domain DT in a parameter-efficient manner. These new weights ΔDT are then used for fine-tuning on specific datasets ti, resulting in even better downstream performance than WDT.
Our key insight is to find the right combination of parameter-efficient methods that work for unsupervised pre-training, rather than traditional supervised fine-tuning PEFT methods.
ExPLoRA works on a ViT with L layers as follows:
Since we only train a small fraction (5-10%) of the original ViT weights, ExPLoRA can create powerful foundation models for new domains using only 4 A4000-16GB GPUs! For comparison, pre-training a ViT-L DinoV2 from scratch required 96 A100-80GB GPUs!
Using ExPLoRA, we are able to create state-of-the-art foundation models for satellite images, outperforming all prior fully-pretrained models on the fMoW-RGB benchmark.
Method | Backbone | #Pre-train Params | Top 1 Acc. |
---|---|---|---|
SatMAE | ViT-L | 303M | 65.94 |
ScaleMAE | ViT-B | 86M | 67.30 |
CrossScaleMAE | ViT-B | 86M | 69.20 |
DinoV2 | ViT-L | - | 69.00 |
Ours | ViT-B | 9M | 75.11 |
Ours | ViT-L | 18M | 77.48 |
For further results on satellite image datasets and on the WILDS benchmark, please read our paper!
A key design choice of ExPLoRA is to fully train a small subset of the ViT layers (i.e. the unfrozen block) while applying low-rank updates to the remaining frozen layers. Why is this combination so effective? We conduct experiments to analyze the output feature maps of each block of the ViT on the target domain. On these feature maps, we:
These results show that the spectral properties (mean eigenvalues) of a block's feature map and its ability to retrieve local information (like texture or patch position) are correlated. The middle layers of the Dino models are responsible for extracting local information from the input image patches as their feature map eigenvalues and localization accuracies are high (left and middle figures). Conversely, the final layers of the ViT are responsible for semantic understanding, as their feature maps contain more global information, seen by higher image-class accuracies (rightmost figure).
ExPLoRA amplifies the global information stored in each block's feature map towards the final layers of the ViT, while preserving strong localization in the middle layers. This can also be seen in the attention maps below, as ExPLoRA's attention highlights the central object more clearly.
For more detailed analysis and experiments, please read our paper!
@article{khanna2024explora,
title={ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts},
author={Khanna, Samar and Irgau, Medhanie and Lobell, David B and Ermon, Stefano},
journal={arXiv preprint arXiv:2406.10973},
year={2024}
}