70+ Machine Learning Datasets – Gain real-world experience with Data Science projects
Finding the right dataset while researching for machine learning or data science projects is a quite difficult task. And, to build accurate models, you need a huge amount of data. But don’t worry, there are many researchers, organizations, and individuals who have shared their work and we can use their datasets in our projects. In this article, we will discuss more than 70 machine learning datasets that you can use to build your next data science project.
Stay updated with the latest technology trends while you're on the move - Join DataFlair's Telegram Channel
Machine Learning Datasets
These are the datasets that you will probably use while working on any data science or machine learning project:
Machine Learning Datasets for Data Science Beginners
1. Mall Customers Dataset
The Mall customers dataset contains information about people visiting the mall. The dataset has gender, customer id, age, annual income, and spending score. It collects insights from the data and group customers based on their behaviors.
1.1 Data Link: mall customers dataset
1.2 Data Science Project Idea: Segment the customers based on the age, gender, interest. Customer segmentation is an important practise of dividing customers base into individual groups that are similar. It is useful in customised marketing.
1.3 Source Code: Customer Segmentation Project with Machine Learning
2. Iris Dataset
The iris dataset is a simple and beginner-friendly dataset that contains information about the flower petal and sepal sizes. The dataset has 3 classes with 50 instances in each class, therefore, it contains 150 rows with only 4 columns.
2.1 Data Link: Iris dataset
2.2 Data Science Project Idea: Implement a machine learning classification or regression model on the dataset. Classification is the task of separating items into its corresponding class.
3. MNIST Dataset
This is a database of handwritten digits. It contains 60,000 training images and 10,000 testing images. This is a perfect dataset to start implementing image classification where you can classify a digit from 0 to 9.
3.1 Data Link: MNIST dataset
3.2 Data Science Project Idea: Implement a machine learning classification algorithm on image to recognize handwritten digits from a paper.
4. The Boston Housing Dataset
This is a popular dataset used in pattern recognition. It contains information about the different houses in Boston based on crime rate, tax, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this dataset to predict house prices.
4.1 Data Link: Boston dataset
4.2 Data Science Project Idea: Predict the housing prices of a new house using linear regression. Linear regression is used to predict values of unknown input when the data has some linear relationship between input and output variables.
5. Fake News Detection Dataset
It is a CSV file that has 7796 rows with 4 columns. The first column identifies news, second for the title, third for news text and fourth is the label TRUE or FAKE.
5.1 Data Link: Fake news detection dataset
5.2 Data Science Project Idea: Build a fake news detection model with Passive Aggressive Classifier algorithm. The Passive Aggressive algorithm can classify massive streams of data, it can be implemented quickly.
5.3 Source Code: Fake News Detection Python Project
6. Wine quality dataset
The dataset contains different chemical information about wine. It has 4898 instances with 14 variables each. The dataset is good for classification and regression tasks. The model can be used to predict wine quality.
6.1 Data Link: Wine quality dataset
6.2 Data Science Project Idea: Perform various different machine learning algorithms like regression, decision tree, random forests, etc and differentiate between the models and analyse their performances.
7. SOCR data – Heights and Weights Dataset
This is a simple dataset to start with. It contains only the height (inches) and weights (pounds) of 25,000 different humans of 18 years of age. This dataset can be used to build a model that can predict the heights or weights of a human.
7.1 Data Link: Heights & weights dataset
7.2 Data Science Project Idea: Build a predictive model for determining height or weight of a person. Implement a linear regression model that will be used for predicting height or weight.
8. Parkinson Dataset
Parkinson is a nervous system disorder that affects movement. The dataset contains 195 records of people with 23 different attributes which contain biomedical measurements. The data is used to separate healthy people from people with Parkinson’s disease.
8.1 Data Link: Parkinson dataset
8.2 Data Science Project Idea: The model can be used to differentiate healthy people from people having Parkinson’s disease. The algorithm that is useful for this purpose is XGboost which stands for extreme gradient boosting, it is based on decision trees.
8.3 Source Code: Machine Learning Project on Detecting Parkinson’s Disease
9. Titanic Dataset
On 15 April 1912, the unsinkable Titanic ship sank and killed 1502 passengers out of 2224. The dataset contains information like name, age, sex, number of siblings aboard, etc of about 891 passengers in the training set and 418 passengers in the testing set.
9.1 Data Link: Titanic dataset
9.2 Data Science Project Idea: Build a fun model to predict whether a person would have survived on the Titanic or not. You can use linear regression for this purpose.
10. Uber Pickups Dataset
The dataset has information of about 4.5 million uber pickups in New York City from April 2014 to September 2014 and 14million more from January 2015 to June 2015. Users can perform data analysis and gather insights from the data.
10.1 Data Link: Uber pickups dataset
10.2 Data Science Project Idea: To analyse the data of the customer rides and visualise the data to find insights which can help improve business. Data analysis and visualization is an important part of the data science. They are used to gather insights from the data and with visualisation you can get a quick information from the data.
10.3 Source Code: Uber Data Analysis Project in R
11. Chars74k Dataset
The dataset contains images of character symbols used in the English and Kannada languages. It has 64 classes (0-9, A-Z, a-z), 7.7k characters from natural images, 3.4k hand-drawn characters, and 62k computer-synthesized fonts.
11.1 Data Link: Chars 74k dataset
11.2 Data Science Project Idea: Implement a character recognition in natural languages. Character recognition is the process of automatically identifying characters from written papers or printed texts.
12. Credit Card Fraud Detection Dataset
The dataset contains transactions made by credit cards, they are labeled as fraudulent or genuine. This is important for companies that have transaction systems to build a model for detecting fraudulent activities.
12.1 Data Link: Credit card fraud detection dataset
12.2 Data Science Project Idea: Implement different algorithm like decision trees, logistic regression and artificial neural networks to see which gives better accuracy. Compare the results of each algorithm and understand the behaviour of models.
12.3 Source Code: Credit Card Fraud Detection Machine Learning Project
Machine Learning Datasets for Natural Language Processing
1. Enron Email Dataset
This Enron dataset is popular in natural language processing. It contains around 0.5 million emails of over 150 users out of which most of the users are the senior management of Enron. The size of the data is around 432Mb.
1.1 Data Link: Enron email dataset
1.2 Machine Learning Project Idea: Use k-means clustering to build a model to detect fraudulent activities. K-means clustering is a popular unsupervised learning algorithm. It partitions the observations into k number of clusters by observing similar patterns in the data.
2. The Yelp Dataset
The yelp made their dataset publicly available but you have to fill a form first to access the data. It contains 1.2 million tips by 1.6 million users, over 1.2 million business attributes and photos for natural language processing tasks.
2.1 Data Link: Yelp dataset
2.2 Machine Learning Project Idea: You can build a model which can detect whether a restaurant’s review is fake or real. With text processing and additional features in dataset you can build a SVM model that can classify reviews as fake or real.
3. Jeopardy Dataset
Jeopardy! is an American television game show in which general knowledge questions are asked with a twist. The dataset contains 200k+ questions and answers in a CSV or JSON file.
3.1 Data Link: Jeopardy dataset
3.2 Machine Learning Project Idea: We Build a question answering system and implement in a bot that can play the game of jeopardy with users. The bot can be used on any platform like Telegram, discord, reddit, etc.
4. Recommender Systems Dataset
This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.
4.1 Data Link: Recommender systems dataset
4.2 Machine Learning Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest you products, movies, etc based on your interests and the things you like and have used earlier.
4.3 Source Code: Movie Recommendation System Project in R
5. UCI Spambase Dataset
Classifying emails as spam or non-spam is a very common and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
5.1 Data Link: UCI spambase dataset
5.2 Machine Learning Project Idea: You can build a model that can identify your emails as spam or non-spam.
6. Flickr 30k Dataset
The Flickr 30k dataset is similar to the Flickr 8k dataset and it contains more labeled images. This has over 30,000 images and their captions. This dataset is used to build more accurate models than the Flickr 8k dataset.
6.1 Data Link: Flickr image dataset
6.2 Machine Learning Project Idea: Use the same model from Flickr 8k and make it more accurate with more training data. The CNN model is great for extracting features from the image and then we feed the features to a recurrent neural network that will generate caption.
7. IMDB reviews
The large movie review dataset consists of movie reviews from IMDB website with over 25,000 reviews for training and 25,000 for the testing set.
7.1 Data Link: IMDB reviews dataset
7.2 Machine Learning Project Idea: Perform Sentiment analysis on the data to see the statistics of what type of movie do users like. Sentiment analysis is the process of analysing the textual data and identifying the emotion of the user, Positive or Negative.
7.3 Source Code: Sentiment Analysis Data Science Project
8. MS COCO dataset
Microsoft’s COCO is a huge database for object detection, segmentation and image captioning tasks. It has around 1.5 million labeled images. The dataset is great for building production-ready models.
8.1 Data Link: MS COCO dataset
8.2 Machine Learning Project Idea: Detect objects from the image and then generate captions for them. LSTM (Long short term memory) network is responsible for generating sentences in English and CNN is used to extract features from image. To build a caption generator we have to combine these two models.
9. Flickr 8k Dataset
The Flickr 8k dataset contains 8000 images and each image is labeled with 5 different captions. The dataset is used to build an image caption generator.
9.1 Data Link: Flickr 8k dataset
9.2 Machine Learning Project Idea: Build an image caption generator using CNN-RNN model. An image caption generator model is able to analyse features of the image and generate english like sentence that describes the image.
9.3 Source Code: Image Caption Generator Python Project
Machine Learning Datasets for Computer Vision and Image Processing
1. CIFAR-10 and CIFAR-100 dataset
These are two datasets, the CIFAR-10 dataset contains 60,000 tiny images of 32*32 pixels. They are labeled from 0-9 and each digit is representing a class. The CIFAR-100 is similar to the CIFAR-10 dataset but the difference is that it has 100 classes instead of 10. This dataset is good for implementing image classification.
1.1 Data Link: CIFAR dataset
1.2 Artificial Intelligence Project Idea: Perform image classification on different objects and build a model. In image classification, we take image as an input and the goal is to classify in which category the image belongs to.
2. GTSRB (German traffic sign recognition benchmark) Dataset
The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification.
2.1 Data Link: GTSRB dataset
2.2 Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs and also recognises the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then take appropriate actions.
2.3 Source Code: Traffic Signs Recognition Python Project
3. ImageNet dataset
ImageNet is a large image database that is organized according to the wordnet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. The size exceeds 150 GB. It is suitable for image recognition, face recognition, object detection, etc. It also hosts a challenging competition named ILSVRC for people to build more and more accurate models.
3.1 Data Link: Imagenet Dataset
3.2 Artificial Intelligence Project Idea: To implement image classification on this huge database and recognise objects. CNN model (Convolutional neural networks) are necessary for this project to get accurate results.
4. Breast Histopathology Images Dataset
This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.
4.1 Data Link: Breast histopathology dataset
4.2 Artificial Intelligence Project Idea: To build a model that can classify breast cancer. You build an image classification model with Convolutional neural networks.
4.3 Source Code: Breast Cancer Classification Python Project
5. Cityscapes Dataset
This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
5.1 Data Link: Cityscapes dataset
5.2 Artificial Intelligence Project Idea: To perform image segmentation and detect different objects from a video on the road. Image segmentation is the process of digitally partitioning an image into various different categories like cars, buses, people, trees, roads, etc.
6. Kinetics Dataset
There are three different datasets for Kinetics: Kinetics 400, Kinetics 600 and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5Million high-quality videos.
6.1 Data Link: Kinetics dataset
6.2 Artificial Intelligence Project Idea: Build a human action recognition model and detect the action of a human. Human action recognition is recognized by a series of observations.
7. MPII human pose dataset
The MPII human pose dataset contains 25,000 images with over 40,000 people with annotated body joints. The overall dataset covers over 410 human activities. The dataset is 12.9 GB in size.
7.1 Data Link: MPII human pose dataset
7.2 Artificial Intelligence Project Idea: To detect different human poses based on the alignment of a person’s body joints. Human pose detection tracks every movement of the body. It is also known as the localization of human joints.
8. 20BN-something-something dataset v2
This is a huge high-quality video clips dataset that shows human performing actions like picking something, putting something down, opening something, closing something, etc.
It has 220,847 total number of videos.
8.1 Data Link: Something-something dataset
8.2 Artificial Intelligence Project Idea: To implement a human action recognition model and detect different activities performed by a human. The activities can be used in detecting activities while driving, surveillance activities, etc.
9. Object 365 Dataset
The object 365 dataset is a large collection of high-quality images with bounding boxes of objects. It has 365 objects, 600k images, and 10 million bounding boxes. This is good for making object detection models.
9.1 Data Link: Object 365 dataset
9.2 Artificial Intelligence Project Idea: Classify images captured from the camera and detect objects present in the image. Object detection deals with recognizing which object is present in the image along with the coordinates of the object.
10. Photo sketching dataset
The dataset contains images paired with their contour drawings. It has 1000 outdoor drawings, each image has 5 rough contour drawings that represent the outline of the image.
10.1 Data Link: Photo sketching dataset
10.2 Artificial Intelligence Project Idea: Build a model that can develop sketches automatically from the images. This will take an image as an input and generate a sketch image using computer vision techniques.
11. CQ500 Dataset
This dataset is publicly available that has 491 head CT scans with 193,317 slices. It contains opinions of three different radiologists on each image. The dataset can be used to build models that can detect bleeding, fractures and mass effect on the head.
11.1 Data Link: CQ 500 dataset
11.2 Artificial Intelligence Project Idea: Make a model for hospitals that can automatically generate a report of a fracture, bleeding or other things by analyzing the CT scan dataset.
12. IMDB-Wiki dataset
The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.
12.1 Data Link: IMDB wiki dataset
12.2 Artificial Intelligence Project Idea: Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.
12.3 Source Code: Gender & Age Detection Python Project
Machine Learning Datasets for Deep Learning
1. Youtube 8M Dataset
The youtube 8M dataset is a large scale labeled video dataset that has 6.1millions of Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes and 3avg labels per video. It is used for video classification purposes.
1.1 Data Link: Youtube 8M
1.2 Machine Learning Project Idea: Video classification can be done by using the dataset and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.
2. Urban Sound 8K dataset
The urban sound dataset contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. The dataset is popular for urban sound classification problems.
2.1 Data Link: Urban Sound 8K dataset
2.2 Machine Learning Project Idea: We can build a sound classification system to detect the type of urban sound playing in the background. This will help you get started with audio data and understand how to work with unstructured data.
3. LSUN Dataset
Large scale scene understanding (LSUN) is a dataset of millions of colored images of scenes and objects. It is much bigger than imagenet dataset. There are around 59 million images, 10 different scenes categories, and 20 different object categories.
3.1 Data Link: LSUN dataset
3.2 Machine Learning Project Idea: Build a model to detect what scene is in the image. For example – a classroom, bridge, bedroom, curch_outdoor, etc. The goal of scene understanding is to gather as much knowledge of a given scene image as possible. It includes categorization, object detection, object segmentation.
4. RAVDESS Dataset
RAVDESS is the acronym of The Ryerson Audio-Visual Database of Emotional Speech and Song. It contains audio files of 24 actors (12 male, 12 female ) with different emotions like calm, angry, sad, happy, fearful, etc. The expressions have two intensity normal and strong. The dataset is useful for speech emotion recognition.
4.1 Data Link: RAVDESS dataset
4.2 Machine Learning Project Idea: Build a speech emotion recognition classifier to detect the emotion of the speaker. The audio clips of people are classified into emotions like anger, happy, sad, etc.
4.3 Source Code: Speech Emotion Recognition Python Project
5. Librispeech Dataset
This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English read speech in various accents. It is used for speech recognition projects.
5.1 Data Link: Librispeech dataset
5.2 Machine Learning Project Idea: Build a speech recognition model to detect what is been said and convert it into text. The objective of speech recognition is to automatically identify what is being said in audio.
6. Baidu Apolloscape Dataset
The dataset is designed to promote the development of self-driving technologies. It contains high-resolution color videos with hundreds of thousands of frames and their pixel annotations, stereo image, dense point cloud, etc. The dataset has 25 different semantic items like cars, pedestrians, cycles, street lights, etc.
6.1 Data Link: Baidu apolloscape dataset
6.2 Machine Learning Project Idea: Build a self-driving robot that can identify different objects on the road and take action accordingly. The model can segment the objects in the image that will help in preventing collisions and make their own path.
Machine Learning Datasets for Finance and Economics
1. quandl Data Portal
The quandl is a vast repository for economic and financial data. Some of the datasets are free while there are also some datasets that need to be purchased. The large quantity and good data make this platform best for finding datasets for production-ready models.
1.1 Data Link: quandl datasets
2. The World Bank Open Data Portal
The World Bank is a global development organization that offers loans to developing countries. It contains huge data for all its program and it is publicly available to us. It has many missing values and you can get knowledge of real-world data.
2.1 Data Link: World bank open datasets
3. IMF Data Portal
IMF is the international monetary fund that publishes data on international finances, debt rates, investments, and foreign exchange reserves and commodities.
3.1 Data Link: IMF datasets
4. American Economic Association (AEA) Data Portal
The American economic association has wealthy data that is available online and is a great resource to find US macroeconomic data.
4.1 Data Link: AEA datasets
5. Google Trends Data Portal
Google trends data can be used to examine and analyze the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for.
Data Link: Google trends datasets
6. Financial Times Market Data Portal
The financial times market data is a good resource to find up to date information on financial markets from all over the world. You can find the stock prices indexes, commodities, and foreign exchange
Data Link: Financial times market datasets
Machine Learning Datasets for Public Government
1. Data.gov Portal
This site is the home of the US government’s open data. You can find data on various domains like agriculture, health, climate, education, energy, finance, science, and research, etc. Many software applications are using the website to collect data and building consumer products.
1.1 Data Link: Data.gov datasets
2. Data Portal: Open government data (India)
The open government data platform gives us access to government-owned shareable data. It’s part of the digital India initiative and developed by open source stack. It publishes many datasets, tools, APIs, etc.
2.1 Data Link: Open government datasets
3. Food environment Atlas Data Portal
The platform contains data on US food and how local US food affects the diet of the people. It contains information about the research on food choices and diet quality which will help in determining the accessibility to healthy food choices.
3.1 Data Link: Food environment atlas datasets
4. Health Data Portal
This is a portal of the US Department of Health and Human Services. It has over 3000 plus valuable datasets available. They also have an API for us.
4.1 Data Link: Health datasets
5. Centers for Disease Control and Prevention Data Portal
The CDC has a wide variety of datasets related to health like diabetes, cancer, obesity, etc. There are more resources where you can find data on health diseases.
5.1 Data Link: CDC statistics datasets
6. London Datastore Portal
This contains data about the life of people in London. For example – how much the population has increased in 5 years or the number of tourists visiting London. They have over 700 datasets to get insights into the London city.
6.1 Data Link: London datastore datasets
7. Canada Government Open Data Portal
This is a portal to the data related to Canadians. You can find datasets related to subjects like agriculture, art, music, education, government, health, etc.
7.1 Data Link: Canada government open datasets
In this article, we saw more than 70 machine learning datasets that you can use to practice machine learning or data science. Creating a dataset on your own is expensive so we can use other people’s datasets to get our work done. But we should read the documents of the dataset carefully because some datasets are free, while for some datasets you have to give credit to the owner as stated by them.
March ahead of everyone by practicing 130+ Data Science Interview Questions
If you would like to add any other machine learning datasets, do share them in the comment section. Hope this article was resourceful to you.