Towards Mapping Images to Text Using Deep-Learning Architectures
Daniela Onita,
Adriana Birlutiu and
Liviu P. Dinu
Additional contact information
Daniela Onita: Department of Computer Science, University of Bucharest 90, Panduri Street, Sector 5, 050663 Bucharest, Romania
Adriana Birlutiu: Department of Computer Science and Engineering, “1 Decembrie 1918” University of Alba Iulia 5, Gabriel Bethlen, 515900 Alba Iulia, Romania
Liviu P. Dinu: Department of Computer Science, University of Bucharest 90, Panduri Street, Sector 5, 050663 Bucharest, Romania
Mathematics, 2020, vol. 8, issue 9, 1-18
Abstract:
Images and text represent types of content that are used together for conveying a message. The process of mapping images to text can provide very useful information and can be included in many applications from the medical domain, applications for blind people, social networking, etc. In this paper, we investigate an approach for mapping images to text using a Kernel Ridge Regression model. We considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches. We investigated several neural network architectures for image feature extraction: VGG16, Inception V3, ResNet50, Xception. The experimental evaluation was performed on three data sets from different domains. The texts associated with images represent objective descriptions for two of the three data sets and subjective descriptions for the other data set. The experimental results show that the more complex deep-learning approaches that were used for feature extraction perform better than simple RGB pixel-value approaches. Moreover, the ResNet50 network architecture performs best in comparison to the other three deep network architectures considered for extracting image features. The model error obtained using the ResNet50 network is less by approx. 0.30 than other neural network architectures. We extracted natural language descriptors of images and we made a comparison between original and generated descriptive words. Furthermore, we investigated if there is a difference in performance between the type of text associated with the images: subjective or objective. The proposed model generated more similar descriptions to the original ones for the data set containing objective descriptions whose vocabulary is simpler, bigger and clearer.
Keywords: kernel ridge regression; image captioning; image description; deep learning; convolutional neural network (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2020
References: View complete reference list from CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
https://www.mdpi.com/2227-7390/8/9/1606/pdf (application/pdf)
https://www.mdpi.com/2227-7390/8/9/1606/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:8:y:2020:i:9:p:1606-:d:415345
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().