Machine Learning with Small Dataset

FREE Online Courses: Enroll Now, Thank us Later!

The underlying technology of machine learning is inherently predicated on the collection and processing of data. Usually, a model requires a significant amount of training data in order to effectively process it and learn to recognize patterns within future samples. Additionally, it is most commonly known that presenting a model with more data to learn from will make it learn more efficiently. However, neither of these concepts is always the case.

Consider a scenario where acquiring and processing a large amount of data is not possible. This could be due to data regulations regarding privacy and safety, or it simply may not be practical to acquire and annotate a large dataset because of time and resource restrictions. When faced with this problem, data for machine learning cannot be acquired through the usual means, such as web scraping, public datasets, or data collection.

This doesn’t mean that productive machine learning is not possible in these situations. On the contrary, models can learn very effectively from small datasets. But it is important to understand how to use small data in this case.

When Small is Bad: Overfitting vs Underfitting

The principal objective of any machine learning algorithm is to identify patterns in a given set of data. The model then uses those patterns to make an attempt at predicting and identifying similar patterns from a new, unknown set of data.

But what happens when a machine learning model has too little data to make accurate generalizations? There are two possible causes that could make recognition inaccurate. These are underfitting and overfitting.

Underfitting is when the model is not able to recognize patterns. It creates poor results that don’t match well with the goal. This is because the model is unable to learn enough to identify the dominant trend of the data.

On the other hand, overfitting is when a machine learning model learns too well. The model incorrectly identifies trends that don’t actually exist. Therefore, the output of the model is useless.

To help counteract the possibilities of underfitting and overfitting, an N-shot approach can be taken.

When Less is More: N-shot Learning Explained

N-shot learning is a type of machine learning that works with only a few data samples, known as support sets. These support sets are different from normal training sets used by deep neural networks. While a training set must have many different samples for each class of object, a support set only has a few different samples for each class.

Based on the size of the support sets used, N-shot learning can be classified into several categories. These are few-shot learning, one-shot learning, or even zero-shot learning.

Meta-learning vs Supervised Learning

The concept of meta-learning involves teaching a machine how to learn on its own. Rather than present the machine with a set of training data to learn from, as in traditional supervised learning, meta-learning presents the model with a previously-unknown query sample belonging to an unknown class. The model must then learn how to learn from this data. In this way, N-shot learning is a kind of meta-learning.

For example, say you bring a small child to the farmer’s market. Amazed by the colorful displays of fresh fruits and vegetables, the child’s curiosity is piqued by some that he has never seen before. He wants to learn to identify the unknown produce in the market, and he is smart enough to learn them himself.

However, being a child he does need some help. So you give the child a deck of flashcards. Each card has a picture of a fruit or vegetable along with its name. Even though the child has never seen the produce in the stands or the produce on the cards, he is able to identify each real vegetable’s name by matching it to the most similar-looking card. The cards are helping the child perform meta-learning.

In this example, the cards represent the support set. The unknown vegetables in the farmer’s market collectively make up the query sample. Meta-learning uses support sets like this in order to identify query samples. In this case, if there is only one card per fruit or vegetable then this is an example of one-shot learning.

But what about the case of zero-shot learning? In these applications, a program is able to correctly identify the query sample with no support sets. How is that possible? Consider another example, where you try to help the same child identify Mars in the night sky.

You tell the child that he will see a bright, reddish object in the sky at night if he turns to face toward the moon. By doing so, the child is able to spot Mars. Even though he has never seen a picture of it before, he is able to correctly identify the query sample.

How Much (Less) Data is Enough

It’s great to know that you can use less data for machine learning applications, but you still need to know exactly how much is necessary for your own specific use case. Therefore, every machine learning project should perform its own stage of Proof of Concept (PoC) development to test if enough data is being employed for accurate results.

You may be surprised at how few samples are needed. For example, research indicates that for an image classification task only four samples per class are needed to reach 70% accuracy.

This experiment was done using images of coins. Your own application may require more or less data in the support sets. This is why it is important to perform your own iterative experiments for your model. If the hypothesis fails, you have to go back and start a new cycle with a new hypothesis.

Small Data Machine Learning: Benefits and Use Cases

N-shot learning can be used in computer vision, natural language processing, healthcare, and IoT applications.

The main benefit of N-shot learning for these applications is that it reduces the resources necessary to perform accurate machine learning. Smaller datasets require less time to collect and annotate. N-shot learning also allows datasets to be more easily reused for similar applications.

Logistic and Warehouse Applications

Object detection is an important task powered by machine learning. This can be applied to warehouses, where automation, inventory, security, and inspection can all be automated using AI. Object detection is a prerequisite capability for item recognition, because if you cannot detect a certain type of object then you cannot identify that specific object. Using small datasets to accomplish this can help companies introduce their automation efficiently.

Manufacturing and supply chain also benefit greatly from machine learning automation. Inventory, sorting, and assembly can all be improved with recognition. Human error in counting and tracking items can be eliminated while improving the speed at which items are checked. This also has benefits for quality assurance, where defects can be quickly identified and removed from production lines.

Linguistic Applications

Language modelling is one application that usually requires very large datasets. Using pre-trained models for NLP tasks can be a solution. For example, the GPT-2 model was trained on 40 GB of textual data. Being given a prefix text, it can generate the next word, phrase, or sentence.

Translation tasks require a model to be trained on a set of samples for each language you wish to translate to. This can become expensive in terms of time and resources, especially if you are trying to reach a broad audience that speaks many different languages. Fortunately, a study has shown that translation can be accomplished with zero-shot learning. By training a model on one language (English), researchers were able to successfully translate data to 93 different languages.

Other research has identified linguistic use cases for few-shot learning. This study used a few-shot learning approach to perform sentiment analysis of Twitter posts. This is an impressive accomplishment because these posts are shot and often use informal structure and syntax.

Finally, both one-shot and few-shot learning have been used to identify Chinese characters. This was successful at identifying both printed and handwritten characters.

Medical Applications

The advancements made with applying machine learning to medical applications are incredible. For such an important industry, it is vital that any information used is accurate. For this reason, machine learning programs must operate at the highest levels of efficiency in order to save lives.

Unfortunately, deep neural networks require significantly large sets of data in order to correctly learn to identify patterns. This data must all be annotated, which is not only costly but sometimes impossible without input from experts and strict attention to health data privacy regulations.

However, an experiment on abdominal CT/MRI images applied few-shot learning to medical applications. The researchers successfully trained the model to identify generalized data from a small support set. This is a promising achievement for the continuing advancement of medical AI technology.

Wrap up

As you can see, the answer to the question: “How much data is enough?” depends on a number of factors like the diversity of production data, the availability of open-source datasets, the expected performance of the system. When we have a limited dataset, the N-shot learning approach may be useful for achieving acceptable levels of accuracy.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *