- Notifications
You must be signed in to change notification settings - Fork2
[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"
License
elkhouryk/RS-TransCLIP
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Welcome to the GitHub repository forEnhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification.
Authors:
K. El Khoury*,M. Zanella*,B. Gérin*,T. Godelaine*,B. Macq,S. Mahmoudi,C. De Vleeschouwer,I. Ben Ayed
*Denotes equal contribution
- Paper accepted to ICASSP 2025. [December 20, 2024]
- Paper uploaded on arXiv. [September 1, 2024]
We introduce RS-TransCLIP, a transductive approach inspired fromTransCLIP, that enhances Remote Sensing Vison-Language Modelswithout requiring any labels, only incurring a negligible computational cost to the overall inference time.

Figure 1: Top-1 accuracy of RS-TransCLIP, on ViT-L/14 Remote Sensing Vision-Language Models, for zero-shot scene classification across 10 benchmark datasets.
NB: the Python version used is 3.10.12.
Create a virtual environment and activate it:
# Example using the virtualenv package on linuxpython3 -m pip install --user virtualenvpython3 -m virtualenv RS-TransCLIP-venvsource RS-TransCLIP-venv/bin/activate.csh
Install Pytorch:
pip3 install torch==2.2.2 torchaudio==2.2.2 torchvision==0.17.2
Clone GitHub and move to the appropriate directory:
git clone https://github.com/elkhouryk/RS-TransCLIPcd RS-TransCLIPInstall the remaining Python packages requirements:
pip3 install -r requirements.txt
You are ready to start! 🎉
10 Remote Sensing Scene Classification datasets are already available for evaluation:
The WHURS19 dataset is already uploaded to the repository for reference and can be used directly.
The following 6 datasets (EuroSAT, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB256) will be automatically downloaded and formatted from Hugging Face using therun_dataset_download.py script.
# <dataset_name> can take the following values: EuroSAT, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB256python3 run_dataset_download.py --dataset_name<dataset_name>
Dataset directory structure should be as follows:
$datasets/└── <dataset_name>/ └── classes.txt └── class_changes.txt └── images/ └── <classname>_<id>.jpg └── ...- You must download the AID, MLRSNet and RSICB128 datasets manually from Kaggle and place them in '/datasets/' directory. You can format them manually to follow the dataset directory structure listed above and use them for evaluationOR you can use therun_dataset_formatting.py script by placing the .zip files from Kaggle in the '/datasets/' directory.
# <dataset_name> can take the following values: AID, MLRSNet, RSICB128python3 run_dataset_formatting.py --dataset_name<dataset_name>
- Download links:AID |RSICB128 |MLRSNet ---NB: On the Kaggle website, click on the downloadArrow in the center of the page instead of theDownload button to preserve the data structure needed to use the run_dataset_formatting.py_ script (check figure bellow).
Notes:
- The class_changes.txt file inserts a space between combined class names. For example, the class name "railwaystation" becomes "railway station." This change is applied consistently across all datasets.
- The WHURS19 dataset is already uploaded to the repository for reference.
Running RS-TransCLIP consist of three major steps:
- Generating Image and Text Embeddings
- Generating the Average Text Embedding
- Running Transductive Zero-Shot Classification
We consider 10 scene classification datasets (AID, EuroSAT, MLRSNet, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB128, RSICB256, WHURS19), 4 VLM models (CLIP, GeoRSCLIP, RemoteCLIP, SkyCLIP50) and 4 model architectures (RN50, ViT-B-32, ViT-L-14, ViT-H-14) for our experiments.
To generate Image embeddings for each dataset/VLM/architecture trio:
python3 run_featuregeneration.py --image_fg
To generate Text embeddings for each dataset/VLM/architecture trio:
python3 run_featuregeneration.py --text_fg
All results for each dataset/VLM/architecture trio will be stored as follows:
$results/└── <dataset_name>/ └── <model_name> └── <model_architecture> └── images.pt └── classes.pt └── texts_<prompt1>.pt └── .... └── texts_<prompt106>.ptNotes:
- Text embeddings will generate 106 individual text embeddings for each VLM/dataset combination, the exhaustive list of all text prompts can be found in run_featuregeneration.py.
- When generating Image embeddings, the run_featuregeneration.py script will also generate the ground truth labels and store them in "classes.pt". These labels will be used for evaluation.
- Please refer to run_featuregeneration.py to control all the respective arguments.
- The embeddings for the WHURS19 dataset are already uploaded to the repository for reference.
To generate the Average Text embedding each dataset/VLM/architecture trio:
python3 run_averageprompt.py
Notes:
- The run_averageprompt.py script will average out all embeddings with the following name structure "texts_*.pt" for each dataset/VLM/architecture trio and create a file called "texts_averageprompt.pt".
- The Average Text embeddings for the WHURS19 dataset are already uploaded to the repository for reference.
To run Transductive zero-shot classification using RS-TransCLIP:
python3 run_TransCLIP.py
Notes:
- The run_TransCLIP.py script will use the Image embeddings "images.pt", the Average Text embedding "texts_averageprompt.pt" and the class ground truth labels "classes.pt" to run Transductive zero-shot classification using RS-TransCLIP.
- The run_TransCLIP.py script will also generate the Inductive zero-shot classification for performance comparison.
- Both Inductive and Transductive results will be stored in "results/results_averageprompt.csv".
- The results for the WHURS19 dataset are already uploaded to the repository for reference.

Table 1: Top-1 accuracy for zero-shot scene classification without (white) and with (blue) RS-TransCLIP on 10 RS datasets.
Support our work by citing our paper if you use this repository:
@inproceedings{el2025enhancing, title={Enhancing remote sensing vision-language models for zero-shot scene classification}, author={El Khoury, Karim and Zanella, Maxime and G{\'e}rin, Beno{\^\i}t and Godelaine, Tiffanie and Macq, Beno{\^\i}t and Mahmoudi, Sa{\"\i}d and De Vleeschouwer, Christophe and Ben Ayed, Ismail}, booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2025}, organization={IEEE}}Please also consider citing the original TransCLIP paper:
@article{zanella2024boosting, title={Boosting vision-language models with transduction}, author={Zanella, Maxime and G{\'e}rin, Beno{\^\i}t and Ben Ayed, Ismail}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={62223--62256}, year={2024}}For more details on transductive inference in VLMs, visit the TransCLIP comprehensiverepository.
Feel free to open an issue or pull request if you have any questions or suggestions.
You can also contact us by Email:
karim.elkhoury@uclouvain.bemaxime.zanella@uclouvain.bebenoit.gerin@uclouvain.betiffanie.godelaine@uclouvain.beAbout
[ICASSP 2025] Open-source code for the paper "Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification"
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
