{"id":72771,"date":"2019-11-14T12:23:48","date_gmt":"2019-11-14T06:53:48","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=72771"},"modified":"2020-08-07T00:00:27","modified_gmt":"2020-08-06T18:30:27","slug":"python-based-project-image-caption-generator-cnn","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/","title":{"rendered":"Python based Project &#8211; Learn to Build Image Caption Generator with CNN &amp; LSTM"},"content":{"rendered":"<p><strong>Project based on Python &#8211; Image Caption Generator\u00a0<\/strong><\/p>\n<p>You saw an image and your brain can easily tell what the image is about, but can a computer tell what the image is representing? Computer vision researchers worked on this a lot and they considered it impossible until now! With the advancement in Deep learning techniques, availability of huge datasets and computer power, we can build models that can generate captions for an image.<\/p>\n<p>This is what we are going to implement in this Python based project where we will use deep learning techniques of Convolutional Neural Networks and a type of Recurrent Neural Network (LSTM) together.<\/p>\n<p>Below are some of the Python Data Science projects on which you can work later on:<\/p>\n<ol>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/advanced-python-project-detecting-fake-news\/\">Fake News Detection Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-machine-learning-project-detecting-parkinson-disease\/\">Parkinson\u2019s Disease Detection Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/project-in-python-colour-detection\/\">Color Detection Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-mini-project-speech-emotion-recognition\/\">Speech Emotion Recognition Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/project-in-python-breast-cancer-classification\/\">Breast Cancer Classification Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-project-gender-age-detection\/\">Age and Gender Detection Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-deep-learning-project-handwritten-digit-recognition\/\">Handwritten Digit Recognition Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-chatbot-project\/\">Chatbot Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-project-driver-drowsiness-detection-system\/\">Driver Drowsiness Detection Python Project<\/a><\/li>\n<li><a href=\"https:\/\/data-flair.training\/blogs\/python-project-traffic-signs-recognition\/\">Traffic Signs Recognition Python Project<\/a><\/li>\n<li>Image Caption Generator Python Project<\/li>\n<\/ol>\n<p>Now, let&#8217;s quickly start the Python based project by defining the image caption generator.<\/p>\n<h3>What is Image Caption Generator?<\/h3>\n<p>Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English.<\/p>\n<h3>Image Caption Generator with CNN &#8211; About the Python based Project<\/h3>\n<p>The objective of our project is to learn the concepts of a CNN and LSTM model and build a working model of Image caption generator by implementing CNN with LSTM.<\/p>\n<p>In this Python project, we will be implementing the caption generator using <em><a href=\"https:\/\/data-flair.training\/blogs\/convolutional-neural-networks-tutorial\/\"><strong>CNN (Convolutional Neural Networks)<\/strong> <\/a><\/em>and LSTM (Long short term memory). The image features will be extracted from Xception which is a CNN model trained on the imagenet dataset and then we feed the features into the LSTM model which will be responsible for generating the image captions.<\/p>\n<h3>The Dataset of Python based Project<\/h3>\n<p>For the image caption generator, we will be using the Flickr_8K dataset. There are also other big datasets like Flickr_30K and MSCOCO dataset but it can take weeks just to train the network so we will be using a small Flickr8k dataset. The advantage of a huge dataset is that we can build better models.<\/p>\n<p>Thanks to Jason Brownlee for providing a direct link to download the dataset (Size: 1GB).<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/jbrownlee\/Datasets\/releases\/download\/Flickr8k\/Flickr8k_Dataset.zip\">Flicker8k_Dataset\u00a0<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/jbrownlee\/Datasets\/releases\/download\/Flickr8k\/Flickr8k_text.zip\">Flickr_8k_text\u00a0<\/a><\/li>\n<\/ul>\n<p>The Flickr_8k_text folder contains file Flickr8k.token which is the main file of our dataset that contains image name and their respective captions separated by newline(\u201c\\n\u201d).<\/p>\n<h3>Pre-requisites<\/h3>\n<p>This project requires good knowledge of Deep learning, Python, working on Jupyter notebooks, Keras library, Numpy, and <a href=\"https:\/\/data-flair.training\/blogs\/nlp-natural-language-processing\/\"><em><strong>Natural language processing<\/strong><\/em><\/a>.<\/p>\n<p>Make sure you have installed all the following necessary libraries:<\/p>\n<ul>\n<li>pip install tensorflow<\/li>\n<li>keras<\/li>\n<li>pillow<\/li>\n<li>numpy<\/li>\n<li>tqdm<\/li>\n<li>jupyterlab<\/li>\n<\/ul>\n<h2>Image Caption Generator &#8211; Python based Project<\/h2>\n<h3>What is CNN?<\/h3>\n<p>Convolutional Neural networks are specialized deep neural networks which can process the data that has input shape like a 2D matrix. Images are easily represented as a 2D matrix and CNN is very useful in working with images.<\/p>\n<p>CNN is basically used for image classifications and identifying if an image is a bird, a plane or Superman, etc.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-72798 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project.png\" alt=\"working of Deep CNN - Python based project\" width=\"876\" height=\"318\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project.png 876w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project-150x54.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project-300x109.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project-768x279.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/working-of-Deep-CNN-Python-project-520x189.png 520w\" sizes=\"auto, (max-width: 876px) 100vw, 876px\" \/><\/a><\/p>\n<p>It scans images from left to right and top to bottom to pull out important features from the image and combines the feature to classify images. It can handle the images that have been translated, rotated, scaled and changes in perspective.<\/p>\n<p class=\"df-text-bold df-text-red\" style=\"text-align: center;\">Practise the important Python topics<\/p>\n<p class=\"df-text-bold\" style=\"text-align: center;\">Check out the <a href=\"https:\/\/data-flair.training\/blogs\/python-tutorials-home\/\">240+ Python Tutorials<\/a><\/p>\n<h3>What is LSTM?<\/h3>\n<p>LSTM stands for <strong>Long short term memory<\/strong>, they are a type of RNN (<strong>recurrent neural network<\/strong>) which is well suited for sequence prediction problems. Based on the previous text, we can predict what the next word will be. It has proven itself effective from the traditional RNN by overcoming the limitations of RNN which had short term memory. LSTM can carry out relevant information throughout the processing of inputs and with a forget gate, it discards non-relevant information.<\/p>\n<p>This is what an LSTM cell looks like &#8211;<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72800\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python.png\" alt=\"LSTM Cell Structure - simple python project\" width=\"1300\" height=\"853\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python.png 1300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python-150x98.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python-300x197.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python-768x504.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python-1024x672.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/LSTM-Cell-Structure-project-in-python-520x341.png 520w\" sizes=\"auto, (max-width: 1300px) 100vw, 1300px\" \/><\/a><\/p>\n<h3>Image Caption Generator Model<\/h3>\n<p>So, to make our image caption generator model, we will be merging these architectures. It is also called a CNN-RNN model.<\/p>\n<ul>\n<li>CNN is used for extracting features from the image. We will use the pre-trained model Xception.<\/li>\n<li>LSTM will use the information from CNN to help generate a description of the image.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72812\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project.png\" alt=\"Model of Image Caption Generator - python based project\" width=\"1648\" height=\"868\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project.png 1648w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project-150x79.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project-300x158.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project-768x405.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project-1024x539.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/Model-of-Image-Caption-Generator-python-project-520x274.png 520w\" sizes=\"auto, (max-width: 1648px) 100vw, 1648px\" \/><\/a><\/p>\n<h3>Project File Structure<\/h3>\n<p>Downloaded from dataset:<\/p>\n<ul>\n<li><strong>Flicker8k_Dataset &#8211;<\/strong> Dataset folder which contains 8091 images.<\/li>\n<li><strong>Flickr_8k_text &#8211;<\/strong> Dataset folder which contains text files and captions of images.<\/li>\n<\/ul>\n<p>The below files will be created by us while making the project.<\/p>\n<ul>\n<li><strong>Models &#8211;<\/strong> It will contain our trained models.<\/li>\n<li><strong>Descriptions.txt &#8211;<\/strong> This text file contains all image names and their captions after preprocessing.<\/li>\n<li><strong>Features.p &#8211;<\/strong> Pickle object that contains an image and their feature vector extracted from the Xception pre-trained CNN model.<\/li>\n<li><strong>Tokenizer.p &#8211;<\/strong> Contains tokens mapped with an index value.<\/li>\n<li><strong>Model.png &#8211;<\/strong> Visual representation of dimensions of our project.<\/li>\n<li><strong>Testing_caption_generator.py &#8211;<\/strong> Python file for generating a caption of any image.<\/li>\n<li><strong>Training_caption_generator.ipynb &#8211;<\/strong> Jupyter notebook in which we train and build our image caption generator.<\/li>\n<\/ul>\n<p>You can download all the files from the link:<\/p>\n<p><a href=\"https:\/\/drive.google.com\/open?id=13oJ_9jeylTmW7ivmuNmadwraWceHoQbK\"><strong>Image Caption Generator &#8211; Python Project Files<\/strong><\/a><\/p>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72801\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project.png\" alt=\"structure - python based project\" width=\"761\" height=\"429\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project.png 761w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project-150x85.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project-300x169.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/structure-python-data-science-project-520x293.png 520w\" sizes=\"auto, (max-width: 761px) 100vw, 761px\" \/><\/a><\/p>\n<p class=\"df-text-bold df-text-red\" style=\"text-align: center;\">Want to become a Python expert?<\/p>\n<p class=\"df-text-bold\" style=\"text-align: center;\">Enroll for the <a href=\"https:\/\/data-flair.training\/python-course\/\">Certified Python Training Course<\/a><\/p>\n<h3>Building the Python based Project<\/h3>\n<p>Let\u2019s start by initializing the jupyter notebook server by typing jupyter lab in the console of your project folder. It will open up the interactive Python notebook where you can run your code. Create a Python3 notebook and name it <strong>training_caption_generator.ipynb<\/strong><\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-74143 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1.jpg\" alt=\"jupyter lab - python based project \" width=\"846\" height=\"517\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1.jpg 846w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1-150x92.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1-300x183.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1-768x469.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/jupyter-lab-advanced-python-project-1-520x318.jpg 520w\" sizes=\"auto, (max-width: 846px) 100vw, 846px\" \/><\/a><\/p>\n<p><strong>1. First, we import all the necessary packages<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import string\r\nimport numpy as np\r\nfrom PIL import Image\r\nimport os\r\nfrom pickle import dump, load\r\nimport numpy as np\r\n\r\nfrom keras.applications.xception import Xception, preprocess_input\r\nfrom keras.preprocessing.image import load_img, img_to_array\r\nfrom keras.preprocessing.text import Tokenizer\r\nfrom keras.preprocessing.sequence import pad_sequences\r\nfrom keras.utils import to_categorical\r\nfrom keras.layers.merge import add\r\nfrom keras.models import Model, load_model\r\nfrom keras.layers import Input, Dense, LSTM, Embedding, Dropout\r\n\r\n# small library for seeing the progress of loops.\r\nfrom tqdm import tqdm_notebook as tqdm\r\ntqdm().pandas()<\/pre>\n<p><strong>2. Getting and performing data cleaning<\/strong><\/p>\n<p>The main text file which contains all image captions is <strong>Flickr8k.token<\/strong> in our <strong>Flickr_8k_text<\/strong> folder.<\/p>\n<p>Have a look at the file &#8211;<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72803\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python.png\" alt=\"token file - project in python\" width=\"903\" height=\"440\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python.png 903w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python-150x73.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python-300x146.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python-768x374.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/token-file-project-in-python-520x253.png 520w\" sizes=\"auto, (max-width: 903px) 100vw, 903px\" \/><\/a><\/p>\n<p>The format of our file is image and caption separated by a new line (\u201c\\n\u201d).<\/p>\n<p>Each image has 5 captions and we can see that #(0 to 5)number is assigned for each caption.<\/p>\n<p>We will define 5 functions:<\/p>\n<ul>\n<li><strong>load_doc( filename ) &#8211;<\/strong> For loading the document file and reading the contents inside the file into a string.<\/li>\n<li><strong>all_img_captions( filename ) &#8211;<\/strong> This function will create a <strong>descriptions<\/strong> dictionary that maps images with a list of 5 captions. The descriptions dictionary will look something like this:<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-72834 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1.png\" alt=\"descriptions - python based project \" width=\"1148\" height=\"639\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1.png 1148w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1-150x83.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1-300x167.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1-768x427.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1-1024x570.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/descriptions-python-project-1-520x289.png 520w\" sizes=\"auto, (max-width: 1148px) 100vw, 1148px\" \/><\/a><\/p>\n<ul>\n<li><strong>cleaning_text( descriptions) &#8211;<\/strong> This function takes all descriptions and performs data cleaning. This is an important step when we work with textual data, according to our goal, we decide what type of cleaning we want to perform on the text. In our case, we will be removing punctuations, converting all text to lowercase and removing words that contain numbers.<br \/>\nSo, a caption like \u201cA man riding on a three-wheeled wheelchair\u201d will be transformed into \u201cman riding on three wheeled wheelchair\u201d<\/li>\n<li><strong>text_vocabulary( descriptions ) &#8211;<\/strong> This is a simple function that will separate all the unique words and create the vocabulary from all the descriptions.<\/li>\n<li><strong>save_descriptions( descriptions, filename ) &#8211;<\/strong> This function will create a list of all the descriptions that have been preprocessed and store them into a file. We will create a descriptions.txt file to store all the captions. It will look something like this:<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72805\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project.png\" alt=\"save descriptions - python project\" width=\"690\" height=\"291\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project.png 690w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project-150x63.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project-300x127.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/save-descriptions-python-project-520x219.png 520w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/a><\/p>\n<p><strong>Code :<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\"># Loading a text file into memory\r\ndef load_doc(filename):\r\n    # Opening the file as read only\r\n    file = open(filename, 'r')\r\n    text = file.read()\r\n    file.close()\r\n    return text\r\n\r\n# get all imgs with their captions\r\ndef all_img_captions(filename):\r\n    file = load_doc(filename)\r\n    captions = file.split('\\n')\r\n    descriptions ={}\r\n    for caption in captions[:-1]:\r\n        img, caption = caption.split('\\t')\r\n        if img[:-2] not in descriptions:\r\n            descriptions[img[:-2]] = [ caption ]\r\n        else:\r\n            descriptions[img[:-2]].append(caption)\r\n    return descriptions\r\n\r\n#Data cleaning- lower casing, removing puntuations and words containing numbers\r\ndef cleaning_text(captions):\r\n    table = str.maketrans('','',string.punctuation)\r\n    for img,caps in captions.items():\r\n        for i,img_caption in enumerate(caps):\r\n\r\n            img_caption.replace(\"-\",\" \")\r\n            desc = img_caption.split()\r\n\r\n            #converts to lowercase\r\n            desc = [word.lower() for word in desc]\r\n            #remove punctuation from each token\r\n            desc = [word.translate(table) for word in desc]\r\n            #remove hanging 's and a \r\n            desc = [word for word in desc if(len(word)&gt;1)]\r\n            #remove tokens with numbers in them\r\n            desc = [word for word in desc if(word.isalpha())]\r\n            #convert back to string\r\n\r\n            img_caption = ' '.join(desc)\r\n            captions[img][i]= img_caption\r\n    return captions\r\n\r\ndef text_vocabulary(descriptions):\r\n    # build vocabulary of all unique words\r\n    vocab = set()\r\n\r\n    for key in descriptions.keys():\r\n        [vocab.update(d.split()) for d in descriptions[key]]\r\n\r\n    return vocab\r\n\r\n#All descriptions in one file \r\ndef save_descriptions(descriptions, filename):\r\n    lines = list()\r\n    for key, desc_list in descriptions.items():\r\n        for desc in desc_list:\r\n            lines.append(key + '\\t' + desc )\r\n    data = \"\\n\".join(lines)\r\n    file = open(filename,\"w\")\r\n    file.write(data)\r\n    file.close()\r\n\r\n\r\n# Set these path according to project folder in you system\r\ndataset_text = \"D:\\dataflair projects\\Project - Image Caption Generator\\Flickr_8k_text\"\r\ndataset_images = \"D:\\dataflair projects\\Project - Image Caption Generator\\Flicker8k_Dataset\"\r\n\r\n#we prepare our text data\r\nfilename = dataset_text + \"\/\" + \"Flickr8k.token.txt\"\r\n#loading the file that contains all data\r\n#mapping them into descriptions dictionary img to 5 captions\r\ndescriptions = all_img_captions(filename)\r\nprint(\"Length of descriptions =\" ,len(descriptions))\r\n\r\n#cleaning the descriptions\r\nclean_descriptions = cleaning_text(descriptions)\r\n\r\n#building vocabulary \r\nvocabulary = text_vocabulary(clean_descriptions)\r\nprint(\"Length of vocabulary = \", len(vocabulary))\r\n\r\n#saving each description to file \r\nsave_descriptions(clean_descriptions, \"descriptions.txt\")<\/pre>\n<p><strong>3. Extracting the feature vector from all images\u00a0<\/strong><\/p>\n<p>This technique is also called transfer learning, we don\u2019t have to do everything on our own, we use the pre-trained model that have been already trained on large datasets and extract the features from these models and use them for our tasks. We are using the Xception model which has been trained on imagenet dataset that had 1000 different classes to classify. We can directly import this model from the keras.applications . Make sure you are connected to the internet as the weights get automatically downloaded. Since the Xception model was originally built for imagenet, we will do little changes for integrating with our model. One thing to notice is that the Xception model takes 299*299*3 image size as input. We will remove the last classification layer and get the 2048 feature vector.<\/p>\n<p>model = Xception( include_top=False, pooling=&#8217;avg&#8217; )<\/p>\n<p>The function <strong>extract_features()<\/strong> will extract features for all images and we will map image names with their respective feature array. Then we will dump the features dictionary into a \u201cfeatures.p\u201d pickle file.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">def extract_features(directory):\r\n        model = Xception( include_top=False, pooling='avg' )\r\n        features = {}\r\n        for img in tqdm(os.listdir(directory)):\r\n            filename = directory + \"\/\" + img\r\n            image = Image.open(filename)\r\n            image = image.resize((299,299))\r\n            image = np.expand_dims(image, axis=0)\r\n            #image = preprocess_input(image)\r\n            image = image\/127.5\r\n            image = image - 1.0\r\n\r\n            feature = model.predict(image)\r\n            features[img] = feature\r\n        return features\r\n\r\n#2048 feature vector\r\nfeatures = extract_features(dataset_images)\r\ndump(features, open(\"features.p\",\"wb\"))<\/pre>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72806\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features.png\" alt=\"extracting features - python based project\" width=\"805\" height=\"641\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features.png 805w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features-150x119.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features-300x239.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features-768x612.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/extracting_features-520x414.png 520w\" sizes=\"auto, (max-width: 805px) 100vw, 805px\" \/><\/a><\/p>\n<p>This process can take a lot of time depending on your system. I am using an Nvidia 1050 GPU for training purpose so it took me around 7 minutes for performing this task. However, if you are using CPU then this process might take 1-2 hours. You can comment out the code and directly load the features from our pickle file.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">features = load(open(\"features.p\",\"rb\"))<\/pre>\n<p><strong>4. Loading dataset for Training the model<\/strong><\/p>\n<p>In our <strong>Flickr_8k_test<\/strong> folder, we have <strong>Flickr_8k.trainImages.txt<\/strong> file that contains a list of 6000 image names that we will use for training.<\/p>\n<p>For loading the training dataset, we need more functions:<\/p>\n<ul>\n<li><strong>load_photos( filename ) &#8211;<\/strong> This will load the text file in a string and will return the list of image names.<\/li>\n<li><strong>load_clean_descriptions( filename, photos ) &#8211;<\/strong> This function will create a dictionary that contains captions for each photo from the list of photos. We also append the &lt;start&gt; and &lt;end&gt; identifier for each caption. We need this so that our LSTM model can identify the starting and ending of the caption.<\/li>\n<li><strong>load_features(photos) &#8211;<\/strong> This function will give us the dictionary for image names and their feature vector which we have previously extracted from the Xception model.<\/li>\n<\/ul>\n<p><strong>Code :<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">#load the data \r\ndef load_photos(filename):\r\n    file = load_doc(filename)\r\n    photos = file.split(\"\\n\")[:-1]\r\n    return photos\r\n\r\n\r\ndef load_clean_descriptions(filename, photos): \r\n    #loading clean_descriptions\r\n    file = load_doc(filename)\r\n    descriptions = {}\r\n    for line in file.split(\"\\n\"):\r\n\r\n        words = line.split()\r\n        if len(words)&lt;1 :\r\n            continue\r\n\r\n        image, image_caption = words[0], words[1:]\r\n\r\n        if image in photos:\r\n            if image not in descriptions:\r\n                descriptions[image] = []\r\n            desc = '&lt;start&gt; ' + \" \".join(image_caption) + ' &lt;end&gt;'\r\n            descriptions[image].append(desc)\r\n\r\n    return descriptions\r\n\r\n\r\ndef load_features(photos):\r\n    #loading all features\r\n    all_features = load(open(\"features.p\",\"rb\"))\r\n    #selecting only needed features\r\n    features = {k:all_features[k] for k in photos}\r\n    return features\r\n\r\n\r\nfilename = dataset_text + \"\/\" + \"Flickr_8k.trainImages.txt\"\r\n\r\n#train = loading_data(filename)\r\ntrain_imgs = load_photos(filename)\r\ntrain_descriptions = load_clean_descriptions(\"descriptions.txt\", train_imgs)\r\ntrain_features = load_features(train_imgs)<\/pre>\n<p><strong>5. Tokenizing the vocabulary\u00a0<\/strong><\/p>\n<p>Computers don\u2019t understand English words, for computers, we will have to represent them with numbers. So, we will map each word of the vocabulary with a unique index value. Keras library provides us with the tokenizer function that we will use to create tokens from our vocabulary and save them to a <strong>\u201ctokenizer.p\u201d<\/strong> pickle file.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">#converting dictionary to clean list of descriptions\r\ndef dict_to_list(descriptions):\r\n    all_desc = []\r\n    for key in descriptions.keys():\r\n        [all_desc.append(d) for d in descriptions[key]]\r\n    return all_desc\r\n\r\n#creating tokenizer class \r\n#this will vectorise text corpus\r\n#each integer will represent token in dictionary\r\n\r\nfrom keras.preprocessing.text import Tokenizer\r\n\r\ndef create_tokenizer(descriptions):\r\n    desc_list = dict_to_list(descriptions)\r\n    tokenizer = Tokenizer()\r\n    tokenizer.fit_on_texts(desc_list)\r\n    return tokenizer\r\n\r\n# give each word an index, and store that into tokenizer.p pickle file\r\ntokenizer = create_tokenizer(train_descriptions)\r\ndump(tokenizer, open('tokenizer.p', 'wb'))\r\nvocab_size = len(tokenizer.word_index) + 1\r\nvocab_size<\/pre>\n<p>Our vocabulary contains 7577 words.<\/p>\n<p>We calculate the maximum length of the descriptions. This is important for deciding the model structure parameters. Max_length of description is 32.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">#calculate maximum length of descriptions\r\ndef max_length(descriptions):\r\n    desc_list = dict_to_list(descriptions)\r\n    return max(len(d.split()) for d in desc_list)\r\n    \r\nmax_length = max_length(descriptions)\r\nmax_length<\/pre>\n<p><strong>6. Create Data generator<\/strong><\/p>\n<p>Let us first see how the input and output of our model will look like. To make this task into a supervised learning task, we have to provide input and output to the model for training. We have to train our model on 6000 images and each image will contain 2048 length feature vector and caption is also represented as numbers. This amount of data for 6000 images is not possible to hold into memory so we will be using a generator method that will yield batches.<\/p>\n<p>The generator will yield the input and output sequence.<\/p>\n<p><strong>For example:<\/strong><\/p>\n<p>The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of that image, x2 is the input text sequence and y is the output text sequence that the model has to predict.<\/p>\n<table>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">x1(feature vector)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">x2(Text sequence)<\/span><\/td>\n<td><span style=\"font-weight: 400;\">y(word to predict)<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">feature<\/span><\/td>\n<td><span style=\"font-weight: 400;\">start,<\/span><\/td>\n<td><span style=\"font-weight: 400;\">two<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">feature<\/span><\/td>\n<td><span style=\"font-weight: 400;\">start, two<\/span><\/td>\n<td><span style=\"font-weight: 400;\">dogs<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">feature<\/span><\/td>\n<td><span style=\"font-weight: 400;\">start, two, dogs<\/span><\/td>\n<td><span style=\"font-weight: 400;\">drink<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">feature<\/span><\/td>\n<td><span style=\"font-weight: 400;\">start, two, dogs, drink<\/span><\/td>\n<td><span style=\"font-weight: 400;\">water<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">feature<\/span><\/td>\n<td><span style=\"font-weight: 400;\">start, two, dogs, drink, water<\/span><\/td>\n<td><span style=\"font-weight: 400;\">end<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">#create input-output sequence pairs from the image description.\r\n\r\n#data generator, used by model.fit_generator()\r\ndef data_generator(descriptions, features, tokenizer, max_length):\r\n    while 1:\r\n        for key, description_list in descriptions.items():\r\n            #retrieve photo features\r\n            feature = features[key][0]\r\n            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)\r\n            yield [[input_image, input_sequence], output_word]\r\n\r\ndef create_sequences(tokenizer, max_length, desc_list, feature):\r\n    X1, X2, y = list(), list(), list()\r\n    # walk through each description for the image\r\n    for desc in desc_list:\r\n        # encode the sequence\r\n        seq = tokenizer.texts_to_sequences([desc])[0]\r\n        # split one sequence into multiple X,y pairs\r\n        for i in range(1, len(seq)):\r\n            # split into input and output pair\r\n            in_seq, out_seq = seq[:i], seq[i]\r\n            # pad input sequence\r\n            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]\r\n            # encode output sequence\r\n            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]\r\n            # store\r\n            X1.append(feature)\r\n            X2.append(in_seq)\r\n            y.append(out_seq)\r\n    return np.array(X1), np.array(X2), np.array(y)\r\n\r\n#You can check the shape of the input and output for your model\r\n[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))\r\na.shape, b.shape, c.shape\r\n#((47, 2048), (47, 32), (47, 7577))<\/pre>\n<p><strong>7. Defining the CNN-RNN model<\/strong><\/p>\n<p>To define the structure of the model, we will be using the Keras Model from Functional API. It will consist of three major parts:<\/p>\n<ul>\n<li><strong>Feature Extractor &#8211;<\/strong> The feature extracted from the image has a size of 2048, with a dense layer, we will reduce the dimensions to 256 nodes.<\/li>\n<li><strong>Sequence Processor &#8211;<\/strong> An embedding layer will handle the textual input, followed by the LSTM layer.<\/li>\n<li><strong>Decoder &#8211;<\/strong> By merging the output from the above two layers, we will process by the dense layer to make the final prediction. The final layer will contain the number of nodes equal to our vocabulary size.<\/li>\n<\/ul>\n<p>Visual representation of the final model is given below &#8211;<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72807\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project.png\" alt=\"final model - python data science project\" width=\"829\" height=\"737\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project.png 829w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project-150x133.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project-300x267.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project-768x683.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/model-python-machine-learning-project-520x462.png 520w\" sizes=\"auto, (max-width: 829px) 100vw, 829px\" \/><\/a><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">from keras.utils import plot_model\r\n\r\n# define the captioning model\r\ndef define_model(vocab_size, max_length):\r\n\r\n    # features from the CNN model squeezed from 2048 to 256 nodes\r\n    inputs1 = Input(shape=(2048,))\r\n    fe1 = Dropout(0.5)(inputs1)\r\n    fe2 = Dense(256, activation='relu')(fe1)\r\n\r\n    # LSTM sequence model\r\n    inputs2 = Input(shape=(max_length,))\r\n    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)\r\n    se2 = Dropout(0.5)(se1)\r\n    se3 = LSTM(256)(se2)\r\n\r\n    # Merging both models\r\n    decoder1 = add([fe2, se3])\r\n    decoder2 = Dense(256, activation='relu')(decoder1)\r\n    outputs = Dense(vocab_size, activation='softmax')(decoder2)\r\n\r\n    # tie it together [image, seq] [word]\r\n    model = Model(inputs=[inputs1, inputs2], outputs=outputs)\r\n    model.compile(loss='categorical_crossentropy', optimizer='adam')\r\n\r\n    # summarize model\r\n    print(model.summary())\r\n    plot_model(model, to_file='model.png', show_shapes=True)\r\n\r\n    return model<\/pre>\n<p><strong>8. Training the model<\/strong><\/p>\n<p>To train the model, we will be using the 6000 training images by generating the input and output sequences in batches and fitting them to the model using model.fit_generator() method. We also save the model to our models folder. This will take some time depending on your system capability.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\"># train our model\r\nprint('Dataset: ', len(train_imgs))\r\nprint('Descriptions: train=', len(train_descriptions))\r\nprint('Photos: train=', len(train_features))\r\nprint('Vocabulary Size:', vocab_size)\r\nprint('Description Length: ', max_length)\r\n\r\nmodel = define_model(vocab_size, max_length)\r\nepochs = 10\r\nsteps = len(train_descriptions)\r\n# making a directory models to save our models\r\nos.mkdir(\"models\")\r\nfor i in range(epochs):\r\n    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)\r\n    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)\r\n    model.save(\"models\/model_\" + str(i) + \".h5\")<\/pre>\n<p><strong>9. Testing the model<\/strong><\/p>\n<p>The model has been trained, now, we will make a separate file testing_caption_generator.py which will load the model and generate predictions. The predictions contain the max length of index values so we will use the same tokenizer.p pickle file to get the words from their index values.<\/p>\n<p><strong>Code:<\/strong><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"null\">import numpy as np\r\nfrom PIL import Image\r\nimport matplotlib.pyplot as plt\r\nimport argparse\r\n\r\n\r\nap = argparse.ArgumentParser()\r\nap.add_argument('-i', '--image', required=True, help=\"Image Path\")\r\nargs = vars(ap.parse_args())\r\nimg_path = args['image']\r\n\r\ndef extract_features(filename, model):\r\n        try:\r\n            image = Image.open(filename)\r\n\r\n        except:\r\n            print(\"ERROR: Couldn't open image! Make sure the image path and extension is correct\")\r\n        image = image.resize((299,299))\r\n        image = np.array(image)\r\n        # for images that has 4 channels, we convert them into 3 channels\r\n        if image.shape[2] == 4: \r\n            image = image[..., :3]\r\n        image = np.expand_dims(image, axis=0)\r\n        image = image\/127.5\r\n        image = image - 1.0\r\n        feature = model.predict(image)\r\n        return feature\r\n\r\ndef word_for_id(integer, tokenizer):\r\nfor word, index in tokenizer.word_index.items():\r\n     if index == integer:\r\n         return word\r\nreturn None\r\n\r\n\r\ndef generate_desc(model, tokenizer, photo, max_length):\r\n    in_text = 'start'\r\n    for i in range(max_length):\r\n        sequence = tokenizer.texts_to_sequences([in_text])[0]\r\n        sequence = pad_sequences([sequence], maxlen=max_length)\r\n        pred = model.predict([photo,sequence], verbose=0)\r\n        pred = np.argmax(pred)\r\n        word = word_for_id(pred, tokenizer)\r\n        if word is None:\r\n            break\r\n        in_text += ' ' + word\r\n        if word == 'end':\r\n            break\r\n    return in_text\r\n\r\n\r\n#path = 'Flicker8k_Dataset\/111537222_07e56d5a30.jpg'\r\nmax_length = 32\r\ntokenizer = load(open(\"tokenizer.p\",\"rb\"))\r\nmodel = load_model('models\/model_9.h5')\r\nxception_model = Xception(include_top=False, pooling=\"avg\")\r\n\r\nphoto = extract_features(img_path, xception_model)\r\nimg = Image.open(img_path)\r\n\r\ndescription = generate_desc(model, tokenizer, photo, max_length)\r\nprint(\"\\n\\n\")\r\nprint(description)\r\nplt.imshow(img)\r\n\r\n<\/pre>\n<p><strong>Results:<\/strong><\/p>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72819\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock.png\" alt=\"image caption generator - man standing on rock\" width=\"1366\" height=\"522\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock.png 1366w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock-150x57.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock-300x115.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock-768x293.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock-1024x391.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-standing-on-rock-520x199.png 520w\" sizes=\"auto, (max-width: 1366px) 100vw, 1366px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72820\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing.png\" alt=\"image caption generator - girls playing\" width=\"1366\" height=\"532\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing.png 1366w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing-150x58.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing-300x117.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing-768x299.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing-1024x399.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-girls-playing-520x203.png 520w\" sizes=\"auto, (max-width: 1366px) 100vw, 1366px\" \/><\/a><\/p>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-72821\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak.png\" alt=\"python project on image caption generator - man on kayak\" width=\"1366\" height=\"531\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak.png 1366w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak-150x58.png 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak-300x117.png 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak-768x299.png 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak-1024x398.png 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/image-caption-generator-man-on-kayak-520x202.png 520w\" sizes=\"auto, (max-width: 1366px) 100vw, 1366px\" \/><\/a><\/p>\n<h2>Summary<\/h2>\n<p>In this advanced Python project, we have implemented a CNN-RNN model by building an image caption generator. Some key points to note are that our model depends on the data, so, it cannot predict the words that are out of its vocabulary. We used a small dataset consisting of 8000 images. For production-level models, we need to train on datasets larger than 100,000 images which can produce better accuracy models.<\/p>\n<p class=\"df-text-bold df-text-red\" style=\"text-align: center;\">Rock the Python interview round<\/p>\n<p class=\"df-text-bold\" style=\"text-align: center;\">Practise <a href=\"https:\/\/data-flair.training\/blogs\/top-python-interview-questions-answer\/\">150+ Python Interview Questions<\/a><\/p>\n<p>Hope you enjoyed making this Python based project with us. You can ask your doubts in the comment section below.<span hidden class=\"__iawmlf-post-loop-links\" data-iawmlf-links=\"[{&quot;id&quot;:1330,&quot;href&quot;:&quot;https:\\\/\\\/github.com\\\/jbrownlee\\\/Datasets\\\/releases\\\/download\\\/Flickr8k\\\/Flickr8k_Dataset.zip&quot;,&quot;archived_href&quot;:&quot;&quot;,&quot;redirect_href&quot;:&quot;https:\\\/\\\/release-assets.githubusercontent.com\\\/github-production-release-asset\\\/124585957\\\/47f52b80-3501-11e9-8f49-4515a2a3339b?sp=r\\u0026sv=2018-11-09\\u0026sr=b\\u0026spr=https\\u0026se=2025-12-09T06%3A03%3A34Z\\u0026rscd=attachment%3B+filename%3DFlickr8k_Dataset.zip\\u0026rsct=application%2Foctet-stream\\u0026skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0\\u0026sktid=398a6654-997b-47e9-b12b-9515b896b4de\\u0026skt=2025-12-09T05%3A02%3A35Z\\u0026ske=2025-12-09T06%3A03%3A34Z\\u0026sks=b\\u0026skv=2018-11-09\\u0026sig=JNxU3VJmsn13MeT8swsNhb0iVqxxKcypfUhXFdeDuhQ%3D\\u0026jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2NTI2MTI2MiwibmJmIjoxNzY1MjU3NjYyLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.NvIPwAYswkxfnsSKv0IeYq4ytog-O8CU34Lq-XH9Ooo\\u0026response-content-disposition=attachment%3B%20filename%3DFlickr8k_Dataset.zip\\u0026response-content-type=application%2Foctet-stream&quot;,&quot;checks&quot;:[],&quot;broken&quot;:false,&quot;last_checked&quot;:null,&quot;process&quot;:&quot;done&quot;},{&quot;id&quot;:1331,&quot;href&quot;:&quot;https:\\\/\\\/github.com\\\/jbrownlee\\\/Datasets\\\/releases\\\/download\\\/Flickr8k\\\/Flickr8k_text.zip&quot;,&quot;archived_href&quot;:&quot;&quot;,&quot;redirect_href&quot;:&quot;https:\\\/\\\/release-assets.githubusercontent.com\\\/github-production-release-asset\\\/124585957\\\/47f52b80-3501-11e9-8d2e-dd69a21a4362?sp=r\\u0026sv=2018-11-09\\u0026sr=b\\u0026spr=https\\u0026se=2025-12-09T06%3A02%3A24Z\\u0026rscd=attachment%3B+filename%3DFlickr8k_text.zip\\u0026rsct=application%2Foctet-stream\\u0026skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0\\u0026sktid=398a6654-997b-47e9-b12b-9515b896b4de\\u0026skt=2025-12-09T05%3A01%3A53Z\\u0026ske=2025-12-09T06%3A02%3A24Z\\u0026sks=b\\u0026skv=2018-11-09\\u0026sig=v8TctuqzyGCBBY9FiMpFNmbW69GH%2BBH6ongnKv7i4Ao%3D\\u0026jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2NTI1Nzk2NywibmJmIjoxNzY1MjU3NjY3LCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.tWrvlqxEF3THKlDH_JzkdXngjrSHtb2U8nkLNlacTVA\\u0026response-content-disposition=attachment%3B%20filename%3DFlickr8k_text.zip\\u0026response-content-type=application%2Foctet-stream&quot;,&quot;checks&quot;:[],&quot;broken&quot;:false,&quot;last_checked&quot;:null,&quot;process&quot;:&quot;done&quot;},{&quot;id&quot;:1332,&quot;href&quot;:&quot;https:\\\/\\\/drive.google.com\\\/open?id=13oJ_9jeylTmW7ivmuNmadwraWceHoQbK&quot;,&quot;archived_href&quot;:&quot;&quot;,&quot;redirect_href&quot;:&quot;&quot;,&quot;checks&quot;:[],&quot;broken&quot;:false,&quot;last_checked&quot;:null,&quot;process&quot;:&quot;done&quot;}]\"><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Project based on Python &#8211; Image Caption Generator\u00a0 You saw an image and your brain can easily tell what the image is about, but can a computer tell what the image is representing? Computer&#46;&#46;&#46;<\/p>\n","protected":false},"author":7,"featured_media":72832,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[46],"tags":[21075,21444,21442,21443,21082],"class_list":["post-72771","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","tag-advanced-python-project","tag-image-caption-generator","tag-python-based-project","tag-python-data-science-project","tag-python-project"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Python based Project - Learn to Build Image Caption Generator with CNN &amp; LSTM - DataFlair<\/title>\n<meta name=\"description\" content=\"Python based project on image caption generator - Learn to build a working model of image caption generator by implementing CNN &amp; a type of RNN (LSTM) together.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Python based Project - Learn to Build Image Caption Generator with CNN &amp; LSTM - DataFlair\" \/>\n<meta property=\"og:description\" content=\"Python based project on image caption generator - Learn to build a working model of image caption generator by implementing CNN &amp; a type of RNN (LSTM) together.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2019-11-14T06:53:48+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-08-06T18:30:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"802\" \/>\n\t<meta property=\"og:image:height\" content=\"420\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Python based Project - Learn to Build Image Caption Generator with CNN &amp; LSTM - DataFlair","description":"Python based project on image caption generator - Learn to build a working model of image caption generator by implementing CNN & a type of RNN (LSTM) together.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/","og_locale":"en_US","og_type":"article","og_title":"Python based Project - Learn to Build Image Caption Generator with CNN &amp; LSTM - DataFlair","og_description":"Python based project on image caption generator - Learn to build a working model of image caption generator by implementing CNN & a type of RNN (LSTM) together.","og_url":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2019-11-14T06:53:48+00:00","article_modified_time":"2020-08-06T18:30:27+00:00","og_image":[{"width":802,"height":420,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/beb0cab24b7aa54423a3b50e669a9dcd"},"headline":"Python based Project &#8211; Learn to Build Image Caption Generator with CNN &amp; LSTM","datePublished":"2019-11-14T06:53:48+00:00","dateModified":"2020-08-06T18:30:27+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/"},"wordCount":2211,"commentCount":146,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg","keywords":["Advanced python project","Image Caption Generator","python based project","Python data science project","Python project"],"articleSection":["Python Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/","url":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/","name":"Python based Project - Learn to Build Image Caption Generator with CNN &amp; LSTM - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg","datePublished":"2019-11-14T06:53:48+00:00","dateModified":"2020-08-06T18:30:27+00:00","description":"Python based project on image caption generator - Learn to build a working model of image caption generator by implementing CNN & a type of RNN (LSTM) together.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2019\/11\/python-project-image-caption-generator-with-CNN-and-LSTM.jpg","width":802,"height":420,"caption":"python based project - image caption generator with CNN and LSTM"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/python-based-project-image-caption-generator-cnn\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"Python Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/python\/"},{"@type":"ListItem","position":3,"name":"Python based Project &#8211; Learn to Build Image Caption Generator with CNN &amp; LSTM"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/beb0cab24b7aa54423a3b50e669a9dcd","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c322416204232f4dd97ef3901b0a499a5d34d7ba7fe333f4bfe53a907873d293?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"DataFlair Team specializes in creating clear, actionable content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Backed by industry expertise, we make learning easy and career-oriented for beginners and pros alike.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam3\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/72771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=72771"}],"version-history":[{"count":31,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/72771\/revisions"}],"predecessor-version":[{"id":78617,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/72771\/revisions\/78617"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/72832"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=72771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=72771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=72771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}