Review on latest approaches used in Natural Language Processing for generation of Image Captioning

International Journal of Computer Science and Engineering
© 2017 by SSRG - IJCSE Journal
Volume 4 Issue 6
Year of Publication : 2017
Authors : M. A. Bhalekar , Dr. M. V. Bedekar

How to Cite?

M. A. Bhalekar , Dr. M. V. Bedekar, "Review on latest approaches used in Natural Language Processing for generation of Image Captioning," SSRG International Journal of Computer Science and Engineering , vol. 4,  no. 6, pp. 41-48, 2017. Crossref,


Recently the area of image captioning has received a lot of attention from researchers and academia. Image caption generation area has received attentions since the development of Deep Learning. Automatically generating caption from an image is done by integrating the domain of computer vision and natural language processing. Describing the content of an image is inherently a natural language processing and computer vision task. Many image captioning systems have shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. This paper gives a survey about different recent approaches that have been used for image captioning with discussing the datasets been used.


Image captioning, Computer Vision, Natural language processing, Deep Learning


[1] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”, Neural computation, 9(8):1735–1780, 1997.
[2] Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, Changshui Zhang, “Aligning Where to See and What to Tell: Image Captioning with Region-based Attention and Scene-specific Contexts” IEEE Transaction on Pattern analysis and machine intelligence, 2016.
[3] Andrej Karpathy, Li Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, in journal of Latex class files, Vol. 14, No. 8, August 2015
[4] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S-C. Zhu. “I2T: Image parsing to text description”, Proceedings of the IEEE, 98(8), 2010.
[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator”, Computer Vision and Pattern Recognition, 2015.
[6] Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel, ”Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, arXiv:1411.2539v1 [cs.LG]10 Nov 2014,
[7] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks”, NIPS, 2014.
[8] M. Hodosh, P. Young, and J. Hockenmaier. “Framing image description as a ranking task: data, models and evaluation metrics”, Journal of Artificial Intelligence Research, 2013.
[9] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg, “Baby talk: Understanding and generating simple image descriptions”, CVPR, 2011.
[10] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth, “Every picture tells a story: Generating sentences from images”, In ECCV, 2010.
[11] Siming Li, Girish Kulkarni, Tamara L Berg, Alexander C Berg, and Yejin Choi, “Composing simple image descriptions using web-scale grams”, In CONLL, 2011.
[12] Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi, “Collective generation of natural image descriptions”, ACL, 2012.
[13] Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi, “Treetalk : Composition and compression of trees for image descriptions”, TACL, 2014.
[14] Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov., “Multimodal neural language models”, ICML, 2014.
[15] Justin Johnson, Andrej Karpathy, and Li Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning” InarXiv:1511.07571, 2015.
[16] Curran, J., Clark, S., Bos, J, “Linguistically motivated largescale NLP with C&C and Boxer”, Proceedings of the ACL 2007 Demo and Poster Sessions, pages 33–36, 2007.
[17] Felzenszwalb, P, Mcallester, D, Ramanan, D., “A discriminatively trained, multi-scale, deformable part model”, in CVPR, 2008.
[18] Hoiem, D., Divvala, S., Hays, J., “ Pascal voc 2009 challenge. In: PASCAL challenge workshop” in ECCV, 2009.
[19] Oliva, A., Torralba, “ Building the Gist of a scene: the role of global image features in recognition” In: Progress in Brain Research, 2006.
[20] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Alexander C. Berg, Tamara L. Berg‟ “BabyTalk: Understanding and Generating Simple Image Descriptions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 12, December 2013
[21] [PASCAL sentences – UIUC: Available data Set]
[22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei., “Visual genome: Connecting language and vision using crowd sourced dense image annotations”, 2016.
[23] B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland, D. Poland, D. Borth, and L.-J. Li. “YFCC100M: The new data in multimedia research”, Communications of the ACM, 59(2):64– 73, 2016.
[24] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context.” arXiv preprint arXiv:1405.0312, 2014.
[25] Jason Weston, Samy Bengio, and Nicolas Usunier. “Large scale image annotation: learning to rank with joint word-image embeddings”, Machine learning, 2010.
[26] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, and Tomas Mikolov MarcAurelio Ranzato. Devise: A deep visual-semantic embedding model. NIPS, 2013.
[27] Richard Socher, Q Le, C Manning, and A. Ng. “Grounded compositional semantics for finding and describing images with sentences”, In TACL, 2014.
[28] Andrej Karpathy, Armand Joulin, and Li Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping”, NIPS, 2014.
[30] neural-networkstutorial- part-1introduction-tornns/
[31] Xinlei Chen, C. Lawrence Zitnick, “Mind's Eye: A Recurrent Visual Representation for Image Caption Generation”, The 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[32] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting Image Annotations Using Amazon‟s Mechanical Turk,” Proc. NAACL HLT Workshop Creating Speech and Language Data with Amazon‟s Mechanical Turk, 2010.
[33] Understanding-Convolutional-Neural-Networks-Part-2/