

{"id":111332,"date":"2023-01-07T09:00:53","date_gmt":"2023-01-07T03:30:53","guid":{"rendered":"https:\/\/data-flair.training\/blogs\/?p=111332"},"modified":"2023-01-07T09:42:17","modified_gmt":"2023-01-07T04:12:17","slug":"pytorch-datasets-and-dataloaders","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/","title":{"rendered":"PyTorch Datasets and Dataloaders"},"content":{"rendered":"<p><span style=\"font-weight: 400\">Datasets are the most important part of any deep learning algorithm. Most of the time of a model building process is consumed by data. Before feeding the data we have collected to the model, several operations like imputing the missing values, encoding the text data into numerical form, etc need to be performed so that our model can infer a meaningful conclusion.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Processing of data may sometimes require a lot of code. Therefore, it is preferred to separate these codes from our model for better readability. Fortunately, PyTorch has got us covered. It provides two classes Datasets and Dataloaders which helps us use the data available efficiently. Datasets allow us to use data, pre-loaded or any other custom-made\u00a0 data, and Dataloaders makes it convenient for us to access these data by wrapping an iterable around the data.<\/span><\/p>\n<h3><span style=\"font-weight: 400\">Creating a custom Dataset in PyTorch<\/span><\/h3>\n<p><span style=\"font-weight: 400\">PyTorch\u2019s Dataset class enables us to make our own dataset inheriting it\u2019s properties which makes referring to individual samples easy. We can then use Dataloaders to iterate through these datasets and train our model.<\/span><\/p>\n<h4><span style=\"font-weight: 400\">a. Importing the required libraries:<\/span><\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import torch\r\nfrom torch.utils.data import Dataset,Dataloader\r\n<\/pre>\n<h4><span style=\"font-weight: 400\">b. Creating our own dataset class:<\/span><\/h4>\n<p><span style=\"font-weight: 400\">We will create a class to construct our dataset which inherits from PyTorch\u2019s Dataset class due to which we can perform any operation on the custom dataset with ease.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">class DataFlair_dataset(Dataset):<\/pre>\n<p><span style=\"font-weight: 400\">Firstly we will define a constructor with default values to build our dataset.<\/span><\/p>\n<p><span style=\"font-weight: 400\">i.\u00a0 __init__<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">def __init__(self, length = 100, transform = None):\r\n    \t    self.len = length\r\n    \t    self.x = 2 * torch.ones(length, 2)\r\n    \t    self.y = torch.ones(length, 1)\r\n    \t    self.transform = transform<\/pre>\n<p><span style=\"font-weight: 400\">Now, we can define a getter method to retrieve the data required using proper indexing.<\/span><\/p>\n<p><span style=\"font-weight: 400\">i<\/span><span style=\"font-weight: 400\">i.__getitem__<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">def __getitem__(self, index):\r\n    \t    sample = self.x[index], self.y[index]\r\n    \t    if self.transform:\r\n        \tsample = self.transform(sample)\t \r\n    \t    return sample<\/pre>\n<p><span style=\"font-weight: 400\">i<\/span><span style=\"font-weight: 400\">ii.__len__<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"># Get Length\r\ndef __len__(self):\r\n\t    return self.len<\/pre>\n<p><span style=\"font-weight: 400\">Finally, an instance of our custom_dataset class can be created.\u00a0<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">our_dataset=DataFlair_dataset()<\/pre>\n<h4><span style=\"font-weight: 400\">c. Printing our dataset:<\/span><\/h4>\n<p><span style=\"font-weight: 400\">To see if our dataset has been constructed or not, we will try to print first few samples of the dataset.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">j=0\r\nfor i in our_dataset:\r\n    print(\"x: \",i[0],\"y: \",i[1])\r\n    j+=1\r\n    if j==5:\r\n    \tbreak<\/pre>\n<h4><\/h4>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloaders-printing-dataset.webp\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-111376\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloaders-printing-dataset.webp\" alt=\"dataloaders printing dataset\" width=\"1920\" height=\"138\" \/><\/a><\/p>\n<h4><span style=\"font-weight: 400\">d. Preprocessing the dataset using collate_fn:<\/span><\/h4>\n<p>Collate function is a preprocessing parameter that can be referenced to a function while loading a dataset using Dataloaders. To demonstrate the functionality of the collate_fn we will build a function that simply divides the value of x by 2 and for y it computes its modulus with 5.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\"> \r\ndef collate_fun(batch):\r\n  for x,y in batch:\r\n    x\/=10\r\n    y%=5\r\n \r\n  return x,y\r\n<\/pre>\n<p><span style=\"font-weight: 400\">Once we have made the preprocessing required, we can now load the dataset using a dataloader and set the collate_fn parameter to the function we have built.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">DLoader = DataLoader(our_dataset, batch_size=2, collate_fn=collate_fun)<\/pre>\n<h4><span style=\"font-weight: 400\">\u00a0<\/span><span style=\"font-weight: 400\">e. Printing the processed data using the dataloader<\/span><\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">i=0\r\nfor data in DLoader:\r\n  print(data)\r\n  i+=1\r\n  if i&gt;5:\r\n    break<\/pre>\n<h3><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloader-preprocessed-data.webp\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-111374\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloader-preprocessed-data.webp\" alt=\"dataloader preprocessed data\" width=\"1920\" height=\"185\" \/><\/a><\/h3>\n<h3><span style=\"font-weight: 400\">3. Using torchvision inbuilt Datasets and wrapping around an iterable using DataLoaders.<\/span><\/h3>\n<p><span style=\"font-weight: 400\">We will load the MNIST dataset and play around with it to see how to load a dataset. This dataset contains images of handwritten numbers which we can use to train a deep learning model which will be able to identify numbers in new images.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">import torch\r\nfrom torch.utils.data import Dataset\r\nfrom torchvision import datasets,transforms\r\nfrom torchvision.transforms import ToTensor\r\nimport matplotlib.pyplot as plt\r\n<\/pre>\n<h4><span style=\"font-weight: 400\">a. Wrapping around an iterator around our dataset\u00a0<\/span><\/h4>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">train=datasets.MNIST(\"\", train=True,download=True,transform=transforms.Compose([transforms.ToTensor()]))\r\ntest=datasets.MNIST(\"\", train=False,download=True,transform=transforms.Compose([transforms.ToTensor()]))\r\n<\/pre>\n<p><span style=\"font-weight: 400\">In the above command, we have loaded the MNIST dataset.<\/span><\/p>\n<p><b>train=True\/False&#8211;<\/b><span style=\"font-weight: 400\"> differentiates the training and test datasets,<\/span><\/p>\n<p><b>download=True-<\/b><span style=\"font-weight: 400\"><strong>&#8211;<\/strong> download the dataset if it is not already available in the disk. <\/span><b>transform=transforms.Compose([transforms.ToTensor()])&#8211;<\/b><span style=\"font-weight: 400\"> transforms the dataset into tensors so that it could be loaded on a gpu if needed.<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">trainset = torch.utils.data.DataLoader(train, batch_size=20, shuffle=True)\r\ntestset = torch.utils.data.DataLoader(test, batch_size=20, shuffle=False)\r\n<\/pre>\n<p><span style=\"font-weight: 400\">We are specifying how we are going to iterate over the dataset. Here, batch size is 20. This means we are passing only 20 samples at once. This facilitates generalisation of the model<\/span><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">for data in trainset:\r\n    print(data)\r\n<\/pre>\n<p><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloader-loaded-data.webp\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-111375\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/dataloader-loaded-data.webp\" alt=\"dataloader loaded data\" width=\"1920\" height=\"752\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400\">There are many more datasets available in torchvision like FashionMNIST, Caltech, Cityscapes etc. We can do the same operations we have done above on any of these datasets. Without Datasets and Dataloaders it would have taken us a few more lines of code to load, set batch size, convert the samples to tensors etc making our code complicated and difficult to read.<\/span><\/p>\n<h3><span style=\"font-weight: 400\">Summary<\/span><\/h3>\n<p><span style=\"font-weight: 400\">PyTorch does a great job in helping us in building our own datasets and also refer to it efficiently. With the added advantage of DataLoaders, a lot of our coding efforts can be saved and can even be more efficient.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Datasets are the most important part of any deep learning algorithm. Most of the time of a model building process is consumed by data. Before feeding the data we have collected to the model,&#46;&#46;&#46;<\/p>\n","protected":false},"author":5,"featured_media":111378,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26498],"tags":[27186],"class_list":["post-111332","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pytorch-tutorials","tag-pytorch-datasets-and-dataloaders"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>PyTorch Datasets and Dataloaders - DataFlair<\/title>\n<meta name=\"description\" content=\"PyTorch helps us in building our datasets and refer to it efficiently. DataLoaders save our coding efforts. learn more about them.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"PyTorch Datasets and Dataloaders - DataFlair\" \/>\n<meta property=\"og:description\" content=\"PyTorch helps us in building our datasets and refer to it efficiently. DataLoaders save our coding efforts. learn more about them.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2023-01-07T03:30:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-01-07T04:12:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"PyTorch Datasets and Dataloaders - DataFlair","description":"PyTorch helps us in building our datasets and refer to it efficiently. DataLoaders save our coding efforts. learn more about them.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/","og_locale":"en_US","og_type":"article","og_title":"PyTorch Datasets and Dataloaders - DataFlair","og_description":"PyTorch helps us in building our datasets and refer to it efficiently. DataLoaders save our coding efforts. learn more about them.","og_url":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2023-01-07T03:30:53+00:00","article_modified_time":"2023-01-07T04:12:17+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp","type":"image\/webp"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/7f83c342f5d1632d6f7b4b0b0f447823"},"headline":"PyTorch Datasets and Dataloaders","datePublished":"2023-01-07T03:30:53+00:00","dateModified":"2023-01-07T04:12:17+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/"},"wordCount":641,"commentCount":0,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp","keywords":["pytorch datasets and dataloaders"],"articleSection":["PyTorch Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/","url":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/","name":"PyTorch Datasets and Dataloaders - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp","datePublished":"2023-01-07T03:30:53+00:00","dateModified":"2023-01-07T04:12:17+00:00","description":"PyTorch helps us in building our datasets and refer to it efficiently. DataLoaders save our coding efforts. learn more about them.","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2023\/01\/pytorch-datasets-and-dataloaders.webp","width":1200,"height":628,"caption":"pytorch datasets and dataloaders"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/pytorch-datasets-and-dataloaders\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"PyTorch Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/pytorch-tutorials\/"},{"@type":"ListItem","position":3,"name":"PyTorch Datasets and Dataloaders"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/7f83c342f5d1632d6f7b4b0b0f447823","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/4cf3a74600d131330b8c481d519afd1574093ed89f6d3396a95393ad223eb7cd?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/4cf3a74600d131330b8c481d519afd1574093ed89f6d3396a95393ad223eb7cd?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4cf3a74600d131330b8c481d519afd1574093ed89f6d3396a95393ad223eb7cd?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"DataFlair Team creates expert-level guides on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our goal is to empower learners with easy-to-understand content. Explore our resources for career growth and practical learning.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam1\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/111332","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=111332"}],"version-history":[{"count":4,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/111332\/revisions"}],"predecessor-version":[{"id":111377,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/111332\/revisions\/111377"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/111378"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=111332"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=111332"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=111332"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}