Prompt-Aware Region Proposal Networks and CLIP Based Zero Shot Object Detection
| International Journal of Computer Science and Engineering |
| © 2025 by SSRG - IJCSE Journal |
| Volume 12 Issue 11 |
| Year of Publication : 2025 |
| Authors : Sriram K. V, Sweta Kulkarni |
How to Cite?
Sriram K. V, Sweta Kulkarni, "Prompt-Aware Region Proposal Networks and CLIP Based Zero Shot Object Detection," SSRG International Journal of Computer Science and Engineering , vol. 12, no. 11, pp. 9-14, 2025. Crossref, https://doi.org/10.14445/23488387/IJCSE-V12I11P102
Abstract:
Conventional object detection models are fully supervised and rely on large-scale labeled datasets with bounding box annotations for each object category. However, collecting such labelled datasets for every possible class is costly and impractical. To address this drawback, the proposed method uses a novel Zero-Shot Object Detection (ZSD) framework that detects unseen object categories using only natural language descriptions, without requiring additional training or labelled data. The method integrates CLIP, a vision-language model trained on image-text pairs, with a prompt-aware Region Proposal Network (RPN). The RPN is conditioned on CLIP’s text embeddings, enabling it to generate proposals that are semantically aligned with the given text prompt. During inference, the Model compares text and image features in a shared embedding space, allowing it to localize and classify previously unseen objects. Experimental results on the COCO and LVIS datasets demonstrate that our approach achieves competitive performance under zero-shot settings, effectively generalizing to novel object classes based solely on textual input.
Keywords:
Zero-Shot Detection, CLIP, Region Proposal Network, Vision-Language Models, Prompt-Guided Detection.
References:
[1] K.V. Sriram, and R.H. Havaldar. “Analytical Review and Study on Object Detection Techniques in the Image,” International Journal of Modeling, Simulation, and Scientific Computing, vol. 12, no. 5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Zhicheng Tang, Yang Liu, and Yi Shang, “A New GNN-Based Object Detection Method for Multiple Small Objects in Aerial Images,” IEEE/ACIS 23rd International Conference on Computer and Information Science, Wuxi, China, 2023, pp. 14-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Kazutoshi Akita, and Norimichi Ukit, “Context-Aware Region-Dependent Scale Proposals for Scale-Optimized Object Detection Using Super-Resolution,” IEEE Access, vol. 11, pp. 122141-122153, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Min Qiao et al., “Salient Object Detection: An Accurate and Efficient Method for Complex Shape Objects,” IEEE Access, vol. 9, pp. 169220-169230, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Keong-Hun Choi, and Jong-Eun Ha, “Object Detection Method Using Image and Number of Objects on Image as Label,” IEEE Access, vol. 12, pp. 121915-121931, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Shripad Bhatlawande et al., “Study of Object Detection with Faster RCNN,” 2nd International Conference on Intelligent Technologies, Hubli, India, pp. 1-6, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] N. Murali Krishna et al., “Object Detection and Tracking Using Yolo,” Third International Conference on Inventive Research in Computing Applications, Coimbatore, India, pp. 1-7, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Yiwu Zhong et al., “RegionCLIP: Region-Based Language-Image Pretraining,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793-16803, 2022.
[Google Scholar] [Publisher Link]
[9] Chengjian Feng et al., “PromptDet: Towards Open-vocabulary Detection using Uncurated Images,” Computer Vision – ECCV, pp. 701 717, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[10] H. Song and J. Bang, “Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection,” arXiv preprint arXiv:2303.14386, pp. 1-12, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

10.14445/23488387/IJCSE-V12I11P102