InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Yijie Zheng^1,2 Weijie Wu^1,2 Qingyun Li³ Xuehui Wang⁴ Xu Zhou⁵
Aiai Ren⁵ Jun Shen⁵ Long Zhao² Guoqing Li^✉️2 Xue Yang⁴

¹University of Chinese Academy of Sciences ²Aerospace Information Research Institute
³Harbin Institute of Technology ⁴Shanghai Jiao Tong University ⁵University of Wollongong

• [Project] • [arXiv] • [Colab] •

Introduction

InstructSAM is a training-free framework for Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS). We construct EarthInstruct, an InstructCDS benchmark for remote sensing. The three instruction settings in EarthInstruct includes:

Open-Vocabulary: Recognition with user-specified categories (e.g."soccer feld", "football field", "parking lot").
Open-Ended: Recognition of all visible objects without specifying categories.
Open-Subclass: Recognition of objects within a super-category.

InstructSAM addresses InstructCDS by decomposing them into several tractable steps.

Prerequisites

Installation

Create a new conda environment. python>=3.10 and torch>=2.5.1 are recommended for compatibility with SAM2.

conda create -n insam python=3.10
conda activate insam
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121

Install SAM2.

cd InstructSAM/third_party/sam2
pip install -e ".[notebooks]"

Install instruct_sam and its dependencies.

cd ../../
pip install -e .

Checkpoints Download

For double-blind review, this code only references publicly available third-party resources.

Download the pretrained models to the ./checkpoint directory or any preferred location.

[Important] Please store the checkpoint path of CLIP models in CLIP checkpoints config, so that the CLIP model can be switched to different variants by modifying the model name.

LVLM Counter

Download Qwen2.5-VL or use an API request in OpenAI format.

huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
                         --local-dir ./checkpoints/Qwen2.5-VL-7B-Instruct \
                         --resume-download

Mask Proposer
- sam2_hiera_large.pt
CLIP Model

Dataset Preparation

We provide a subset of DIOR with 4 images for quickly going through the entire pipeline. The detailed benchmark category split is stored in the dataset config.

Dataset	Image	Annotation
DIOR-mini	datasets/dior/JPEGImages-trainval	dior_mini_ann.json
DIOR	Google Drive	dior_val_ann.json
NWPU-VHR-10	One Drive	nwpu_ann.json

Getting Started

See the example notebooks for more details. After initializing pretrained models, inference can be executed clearly. See inference_demo.ipynb or for more details.

instruct_sam = InstructSAM()
instruct_sam.count_objects(prompt, gpt_model="gpt-4o-2024-11-20", json_output=True)
print(f'response: \n{instruct_sam.response}')
>>> {
>>>      "car": 23,
>>>      "building": 5,
>>>      "basketball_court": 2,
>>>      "tennis_court": 1
>>> }

instruct_sam.segment_anything(mask_generator, max_masks=200)
instruct_sam.calculate_pred_text_features(model, tokenizer, use_vocab=False)
instruct_sam.match_boxes_and_labels(model, preprocess, show_similarities=True)
visualize_prediction(instruct_sam.img_array, instruct_sam.boxes_final,
                     instruct_sam.labels_final, instruct_sam.segmentations_final)

Inference

Please prepare the metadata in COCO annotation format. For unannotated datasets, simply leave the "annotations" field blank. Replace the config files for datasets and CLIP checkpoints with your local paths. These config files are used by default in the following scripts.

Object Counting

First, prepare your prompts in the prompts folder.

The counting result for each image is saved in JSON format, following the structure:

{
    "category_name1": number1 (int),
    "category_name2": number2 (int),
    ...
}

For API inference, it is recommended to perform asynchronous requests for faster results.

python inference_tools/async_count.py --dataset_name dior_mini \
                                      --dataset_config datasets/config.json \
                                      --base_url your_base_url \
                                      --api_key your_api_key \
                                      --model gpt-4o-2024-11-20 \
                                      --prompt_path prompts/dior/open_vocabulary.txt

Add --skip_existing if requests occasionally fail. The script will skip images that have already been processed.

Alternatively, use a locally deployed LVLM for inference. For example, use Qwen2.5-VL-7B-Instruct to count objects.

python inference_tools/qwen_count.py --dataset_name dior_mini \
                                     --pretrained_model_name_or_path ./checkpoints/Qwen2.5-VL-7B-Instruct \
                                     --prompt_path prompts/dior/open_ended.txt

Generate Mask Proposals Using SAM2

The format of region (mask) proposals (e.g., sam2_hiera_l.json) follows a simple structure: The mask is stored in RLE format and can be decoded to a binary mask using the COCO API.

{
    "img_name":{
        "bboxes": [...
        ],
        "labels": [
            "region_proposals",
            ...
        ],
        "scores": [...
        ],
        "segmentations": [...
        ],
    },
    "img_name2":{...
    },
    ...
}

Generate mask proposals with your own checkpoint and config path (Please add '/' before the absolute cfg path):

python inference_tools/propose_regions.py --dataset_name dior_mini \
                                          --sam2_checkpoint path_to_the_ckpt \
                                          --sam2_cfg path_to_the_cfg

Mask-Label Matching

The results are saved in COCO format. For open-ended and open-subclass settings, the prediction does not have a "category_id" and instead uses the "label" field to represent its category_name. See the predictions on dior_mini under open-ended setting as an example.

# Open-Vocabulary
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
                                              --dataset_config datasets/config.json \
                                              --checkpoint_config checkpoints/config.yaml \
                                              --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_vocabulary \
                                              --rp_path ./region_proposals/dior_mini/sam2_hiera_l.json \
                                              --clip_model georsclip \
                                              --setting open_vocabulary

# Open-Ended
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
                                              --dataset_config datasets/config.json \
                                              --checkpoint_config checkpoints/config.yaml \
                                              --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_ended \
                                              --rp_path region_proposals/dior_mini/sam2_hiera_l.json \
                                              --clip_model georsclip \
                                              --setting open_ended

# Open-Subclass
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
                                              --dataset_config datasets/config.json \
                                              --checkpoint_config checkpoints/config.yaml \
                                              --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_subclass_means_of_transports \
                                              --rp_path region_proposals/dior_mini/sam2_hiera_l.json \
                                              --clip_model georsclip \
                                              --setting open_subclass

Evaluation

The IoU threshold to determine whether a predicted box/mask is a true positive (TP) is set at 0.5.

Evaluating Object Counting

Object counting can be evaluated either via counting results or via detection/segmentation predictions.

Evaluate using counting results:

# Open-Vocabulary
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_vocabulary \
                                         --dataset_name dior_mini \
                                         --setting open_vocabulary

# Open-Ended
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_ended \
                                         --dataset_name dior_mini \
                                         --setting open_ended

# Open-Subclass
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_subclass_means_of_transports \
                                         --dataset_name dior_mini \
                                         --setting open_subclass \
                                         --extra_classes means_of_transport

Evaluate using recognition results:

# Open-Vocabulary
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
                                         --dataset_name dior_mini \
                                         --setting open_vocabulary

# Open-Ended
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_ended/coco_preds/gpt-4o-2024-11-20_open_ended_sam2_hiera_l_georsclip_preds_coco.json \
                                         --dataset_name dior_mini \
                                         --setting open_ended

# Open-Subclass
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_subclass/coco_preds/gpt-4o-2024-11-20_open_subclass_means_of_transports_sam2_hiera_l_georsclip_preds_coco.json \
                                         --dataset_name dior_mini \
                                         --setting open_subclass \
                                         --extra_classes means_of_transport

Evaluating Recall of Mask Proposals

python evaluating_tools/eval_proposal_recall.py --mask_proposal region_proposals/dior_mini/sam2_hiera_l.json
                                                --dataset_name dior_mini

Evaluating Object Detection and Segmentation

Evaluating Confidence-Free Methods

# Open-Vocabulary
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
                                            --dataset_name dior_mini \
                                            --setting open_vocabulary \
                                            --extra_class unseen_classes

# Open-Ended
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_ended/coco_preds/gpt-4o-2024-11-20_open_ended_sam2_hiera_l_georsclip_preds_coco.json \
                                            --dataset_name dior_mini \
                                            --setting open_ended

# Open-Subclass
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_subclass/coco_preds/gpt-4o-2024-11-20_open_subclass_means_of_transports_sam2_hiera_l_georsclip_preds_coco.json \
                                            --dataset_name dior_mini \
                                            --setting open_subclass \
                                            --extra_class means_of_transport

Evaluating Confidence-Based Methods

To evaluate methods with confidence scores, the confidence threshold is swept from 0 to 1 (step 0.02). The threshold maximizing mF1 across categories is selected, and the corresponding cusp score is reported.

Add --score_sweeping to enable this confidence threshold sweeping:

python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
                                            --dataset_name dior_mini \
                                            --setting open_subclass \
                                            --extra_class means_of_transport \
                                            --score_sweeping

Citation

@inproceedings{
zheng2025instructsam,
title={Instruct{SAM}: A Training-free Framework for Instruction-Oriented Remote Sensing Object Recognition},
author={Yijie Zheng and Weijie Wu and Qingyun Li and Xuehui Wang and Xu Zhou and Aiai Ren and Jun Shen and Long Zhao and Guoqing Li and Xue Yang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

• [Project] • [arXiv] • [Colab] •

Introduction

Prerequisites

Installation

Checkpoints Download

Dataset Preparation

Getting Started

Inference

Object Counting

Generate Mask Proposals Using SAM2

Mask-Label Matching

Evaluation

Evaluating Object Counting

Evaluating Recall of Mask Proposals

Evaluating Object Detection and Segmentation

Evaluating Confidence-Free Methods

Evaluating Confidence-Based Methods

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
checkpoints		checkpoints
datasets		datasets
demo		demo
docs		docs
evaluating_tools		evaluating_tools
inference_tools		inference_tools
instruct_sam.egg-info		instruct_sam.egg-info
instruct_sam		instruct_sam
object_counts/dior_mini		object_counts/dior_mini
prompts		prompts
region_proposals/dior_mini		region_proposals/dior_mini
results/dior_mini		results/dior_mini
third_party		third_party
.gitignore		.gitignore
LISCENSE.txt		LISCENSE.txt
README.md		README.md
setup.py		setup.py

VoyagerXvoyagerx/InstructSAM

Folders and files

Latest commit

History

Repository files navigation

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

• [Project] • [arXiv] • [Colab] •

Introduction

Prerequisites

Installation

Checkpoints Download

Dataset Preparation

Getting Started

Inference

Object Counting

Generate Mask Proposals Using SAM2

Mask-Label Matching

Evaluation

Evaluating Object Counting

Evaluating Recall of Mask Proposals

Evaluating Object Detection and Segmentation

Evaluating Confidence-Free Methods

Evaluating Confidence-Based Methods

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages