Aiai Ren5 Jun Shen5 Long Zhao2 Guoqing Li✉️2 Xue Yang4
3Harbin Institute of Technology 4Shanghai Jiao Tong University 5University of Wollongong
InstructSAM is a training-free framework for Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS). We construct EarthInstruct, an InstructCDS benchmark for remote sensing. The three instruction settings in EarthInstruct includes:
- Open-Vocabulary: Recognition with user-specified categories (e.g."soccer feld", "football field", "parking lot").
- Open-Ended: Recognition of all visible objects without specifying categories.
- Open-Subclass: Recognition of objects within a super-category.
InstructSAM addresses InstructCDS by decomposing them into several tractable steps.

- Create a new conda environment.
python>=3.10andtorch>=2.5.1are recommended for compatibility with SAM2.
conda create -n insam python=3.10
conda activate insam
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121- Install
SAM2.
cd InstructSAM/third_party/sam2
pip install -e ".[notebooks]"- Install
instruct_samand its dependencies.
cd ../../
pip install -e .For double-blind review, this code only references publicly available third-party resources.
Download the pretrained models to the ./checkpoint directory or any preferred location.
[Important] Please store the checkpoint path of CLIP models in CLIP checkpoints config, so that the CLIP model can be switched to different variants by modifying the model name.
- LVLM Counter
Download Qwen2.5-VL or use an API request in OpenAI format.
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
--local-dir ./checkpoints/Qwen2.5-VL-7B-Instruct \
--resume-download-
Mask Proposer
-
CLIP Model
We provide a subset of DIOR with 4 images for quickly going through the entire pipeline. The detailed benchmark category split is stored in the dataset config.
| Dataset | Image | Annotation |
|---|---|---|
| DIOR-mini | datasets/dior/JPEGImages-trainval | dior_mini_ann.json |
| DIOR | Google Drive | dior_val_ann.json |
| NWPU-VHR-10 | One Drive | nwpu_ann.json |
See the example notebooks for more details. After initializing pretrained models, inference can be executed clearly. See inference_demo.ipynb or
for more details.
instruct_sam = InstructSAM()
instruct_sam.count_objects(prompt, gpt_model="gpt-4o-2024-11-20", json_output=True)
print(f'response: \n{instruct_sam.response}')
>>> {
>>> "car": 23,
>>> "building": 5,
>>> "basketball_court": 2,
>>> "tennis_court": 1
>>> }instruct_sam.segment_anything(mask_generator, max_masks=200)
instruct_sam.calculate_pred_text_features(model, tokenizer, use_vocab=False)
instruct_sam.match_boxes_and_labels(model, preprocess, show_similarities=True)
visualize_prediction(instruct_sam.img_array, instruct_sam.boxes_final,
instruct_sam.labels_final, instruct_sam.segmentations_final)Please prepare the metadata in COCO annotation format. For unannotated datasets, simply leave the "annotations" field blank. Replace the config files for datasets and CLIP checkpoints with your local paths. These config files are used by default in the following scripts.
First, prepare your prompts in the prompts folder.
The counting result for each image is saved in JSON format, following the structure:
{
"category_name1": number1 (int),
"category_name2": number2 (int),
...
}
- For API inference, it is recommended to perform asynchronous requests for faster results.
python inference_tools/async_count.py --dataset_name dior_mini \
--dataset_config datasets/config.json \
--base_url your_base_url \
--api_key your_api_key \
--model gpt-4o-2024-11-20 \
--prompt_path prompts/dior/open_vocabulary.txtAdd --skip_existing if requests occasionally fail. The script will skip images that have already been processed.
- Alternatively, use a locally deployed LVLM for inference. For example, use
Qwen2.5-VL-7B-Instructto count objects.
python inference_tools/qwen_count.py --dataset_name dior_mini \
--pretrained_model_name_or_path ./checkpoints/Qwen2.5-VL-7B-Instruct \
--prompt_path prompts/dior/open_ended.txtThe format of region (mask) proposals (e.g., sam2_hiera_l.json) follows a simple structure: The mask is stored in RLE format and can be decoded to a binary mask using the COCO API.
{
"img_name":{
"bboxes": [...
],
"labels": [
"region_proposals",
...
],
"scores": [...
],
"segmentations": [...
],
},
"img_name2":{...
},
...
}
Generate mask proposals with your own checkpoint and config path (Please add '/' before the absolute cfg path):
python inference_tools/propose_regions.py --dataset_name dior_mini \
--sam2_checkpoint path_to_the_ckpt \
--sam2_cfg path_to_the_cfgThe results are saved in COCO format. For open-ended and open-subclass settings, the prediction does not have a "category_id" and instead uses the "label" field to represent its category_name. See the predictions on dior_mini under open-ended setting as an example.
# Open-Vocabulary
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
--dataset_config datasets/config.json \
--checkpoint_config checkpoints/config.yaml \
--count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_vocabulary \
--rp_path ./region_proposals/dior_mini/sam2_hiera_l.json \
--clip_model georsclip \
--setting open_vocabulary
# Open-Ended
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
--dataset_config datasets/config.json \
--checkpoint_config checkpoints/config.yaml \
--count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_ended \
--rp_path region_proposals/dior_mini/sam2_hiera_l.json \
--clip_model georsclip \
--setting open_ended
# Open-Subclass
python inference_tools/mask_label_matching.py --dataset_name dior_mini \
--dataset_config datasets/config.json \
--checkpoint_config checkpoints/config.yaml \
--count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_subclass_means_of_transports \
--rp_path region_proposals/dior_mini/sam2_hiera_l.json \
--clip_model georsclip \
--setting open_subclassThe IoU threshold to determine whether a predicted box/mask is a true positive (TP) is set at 0.5.
Object counting can be evaluated either via counting results or via detection/segmentation predictions.
- Evaluate using counting results:
# Open-Vocabulary
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_vocabulary \
--dataset_name dior_mini \
--setting open_vocabulary
# Open-Ended
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_ended \
--dataset_name dior_mini \
--setting open_ended
# Open-Subclass
python evaluating_tools/eval_counting.py --count_dir object_counts/dior_mini/gpt-4o-2024-11-20_open_subclass_means_of_transports \
--dataset_name dior_mini \
--setting open_subclass \
--extra_classes means_of_transport- Evaluate using recognition results:
# Open-Vocabulary
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_vocabulary
# Open-Ended
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_ended/coco_preds/gpt-4o-2024-11-20_open_ended_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_ended
# Open-Subclass
python evaluating_tools/eval_counting.py --coco_pred_path results/dior_mini/open_subclass/coco_preds/gpt-4o-2024-11-20_open_subclass_means_of_transports_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_subclass \
--extra_classes means_of_transportpython evaluating_tools/eval_proposal_recall.py --mask_proposal region_proposals/dior_mini/sam2_hiera_l.json
--dataset_name dior_mini# Open-Vocabulary
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_vocabulary \
--extra_class unseen_classes
# Open-Ended
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_ended/coco_preds/gpt-4o-2024-11-20_open_ended_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_ended
# Open-Subclass
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_subclass/coco_preds/gpt-4o-2024-11-20_open_subclass_means_of_transports_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_subclass \
--extra_class means_of_transportTo evaluate methods with confidence scores, the confidence threshold is swept from 0 to 1 (step 0.02). The threshold maximizing mF1 across categories is selected, and the corresponding cusp score is reported.
Add --score_sweeping to enable this confidence threshold sweeping:
python evaluating_tools/eval_recognition.py --predictions results/dior_mini/open_vocabulary/coco_preds/gpt-4o-2024-11-20_open_vocabulary_sam2_hiera_l_georsclip_preds_coco.json \
--dataset_name dior_mini \
--setting open_subclass \
--extra_class means_of_transport \
--score_sweeping@inproceedings{
zheng2025instructsam,
title={Instruct{SAM}: A Training-free Framework for Instruction-Oriented Remote Sensing Object Recognition},
author={Yijie Zheng and Weijie Wu and Qingyun Li and Xuehui Wang and Xu Zhou and Aiai Ren and Jun Shen and Long Zhao and Guoqing Li and Xue Yang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}

