BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
Srikumar Sastry, Subash Khanal, Aayush Dhakal, Di Huang, Nathan Jacobs
WACV 2024
Abstract
We propose a metadata-aware self-supervised learning~(SSL)~framework useful for fine-grained classification and ecological mapping of bird species around the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL) and Masked Image Modeling~(MIM), while also enriching the embedding space with metadata available with ground-level imagery of birds. We separately train uni-modal and cross-modal ViT on a novel cross-view global bird species dataset containing ground-level imagery, metadata (location, time), and corresponding satellite imagery. We demonstrate that our models learn fine-grained and geographically conditioned features of birds, by evaluating on two downstream tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval. Pre-trained models learned using our framework achieve SotA performance on FGVC of iNAT-2021 birds as well as in transfer learning settings for CUB-200-2011 and NABirds datasets. Moreover, the impressive cross-modal retrieval performance of our model enables the creation of species distribution maps across any geographic region. The dataset and source code will be released on Github.
🦢 Dataset Released: Cross-View iNAT Birds 2021
This cross-view bird species dataset consists of paired ground-level bird images and satellite images, along with meta-information associated with the iNaturalist-2021 dataset. This dataset can serve as benchmark for following tasks: 1. Fine-Grained image classification 2. Satellite-to-bird image retrieval 3. Bird-to-satellite image retrieval 4. Geolocalization of Bird Species
![](https://sites.wustl.edu/srikumarsastry/files/2023/10/data5-2-1-1024x490.png)
Example of Bird-to-satellite image retrieval:
![](https://sites.wustl.edu/srikumarsastry/files/2023/10/ret_results-1024x752.jpg)
Method
We systematically evaluate various cross-view training strategies of masked autoencoders on ground-level bird images, satellite images and metadata. Further, we compare the performance of our models with the state-of-the-art on fine-grained image classification and cross-modal retrieval tasks.
![](https://sites.wustl.edu/srikumarsastry/files/2023/10/arch-2-1024x461.png)
Generated Species Distribution Map
![](https://sites.wustl.edu/srikumarsastry/files/2023/10/Screenshot-2023-10-29-at-10.23.15 AM-1024x739.png)