

{"id":2183,"date":"2017-04-22T06:35:26","date_gmt":"2017-04-22T06:35:26","guid":{"rendered":"http:\/\/data-flair.training\/blogs\/?p=2183"},"modified":"2018-11-14T16:38:35","modified_gmt":"2018-11-14T11:08:35","slug":"data-locality-in-hadoop-mapreduce","status":"publish","type":"post","link":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/","title":{"rendered":"Data locality in Hadoop: The Most Comprehensive Guide"},"content":{"rendered":"<h2>1. Data Locality in Hadoop &#8211; Objective<\/h2>\n<p>In <a href=\"http:\/\/data-flair.training\/blogs\/hadoop-tutorial-for-beginners\/\"><strong>Hadoop<\/strong><\/a>, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system. This feature of Hadoop we will discuss in detail in this tutorial. We will learn <strong>what is<\/strong> <strong>data locality in Hadoop<\/strong>, data locality definition, how Hadoop exploits Data Locality, what is the need of Hadoop Data Locality, various types of data locality in Hadoop <a href=\"http:\/\/data-flair.training\/blogs\/hadoop-mapreduce-tutorial\/\"><strong>MapReduce<\/strong><\/a>, Data locality optimization in Hadoop and various advantages of Hadoop data locality.<\/p>\n<div id=\"attachment_42042\" style=\"width: 1210px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-42042\" class=\"size-full wp-image-42042\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg\" alt=\"Data locality in Hadoop\" width=\"1200\" height=\"628\" srcset=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg 1200w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01-150x79.jpg 150w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01-300x157.jpg 300w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01-768x402.jpg 768w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01-1024x536.jpg 1024w, https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01-520x272.jpg 520w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/a><p id=\"caption-attachment-42042\" class=\"wp-caption-text\">Data locality in Hadoop: The Most Comprehensive Guide<\/p><\/div>\n<div class=\"mceTemp\"><\/div>\n<h2>2. The Concept of Data locality in Hadoop<\/h2>\n<p>Let us understand <strong>Data Locality concept<\/strong> and what is Data Locality in MapReduce?<br \/>\nThe major <a href=\"http:\/\/data-flair.training\/blogs\/limitations-of-hadoop\/\"><strong>drawback of Hadoop<\/strong><\/a> was cross-switch network traffic due to the huge volume of data. To overcome this drawback,<strong> Data Locality<\/strong> in Hadoop came into the picture. Data locality in MapReduce refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.<br \/>\nIn Hadoop, datasets are stored in <strong><a href=\"http:\/\/data-flair.training\/blogs\/comprehensive-hdfs-guide-introduction-architecture-data-read-write-tutorial\/\">HDFS<\/a><\/strong>. Datasets are divided into blocks and stored across the datanodes in <a href=\"http:\/\/data-flair.training\/blogs\/install-hadoop-2-x-ubuntu-hadoop-multi-node-cluster\/\"><strong>Hadoop <\/strong><strong>cluster<\/strong><\/a><a href=\"http:\/\/data-flair.training\/blogs\/install-hadoop-1-x-on-multi-node-cluster\/\">.<\/a> When a user runs the <a href=\"http:\/\/data-flair.training\/blogs\/how-hadoop-mapreduce-works\/\"><strong>MapReduce job<\/strong><\/a> then NameNode sent this MapReduce code to the datanodes on which data is available related to MapReduce job.<\/p>\n<div id=\"attachment_2189\" style=\"width: 605px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/data-locality-hadoop-mapreduce.gif\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2189\" class=\"wp-image-2189 size-full\" src=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/data-locality-hadoop-mapreduce.gif\" alt=\"data locality in hadoop mapreduce\" width=\"595\" height=\"313\" \/><\/a><p id=\"caption-attachment-2189\" class=\"wp-caption-text\">Data Locality in Hadoop &#8211; MapReduce<\/p><\/div>\n<h2>3. Requirements for Data locality in MapReduce<\/h2>\n<p>Our system architecture needs to satisfy the following conditions, in order to get the benefits of all the advantages of data locality:<\/p>\n<ul>\n<li>First of all the cluster should have the appropriate topology. Hadoop code must have the ability to read data locality.<\/li>\n<li>Second, Hadoop must be aware of the topology of the nodes where tasks are executed. And Hadoop must know where the data is located.<\/li>\n<\/ul>\n<h2>4. Categories of Data Locality in Hadoop<\/h2>\n<p>Below are the various categories in which Data Locality in Hadoop is categorized:<\/p>\n<h3>i. Data local data locality in Hadoop<\/h3>\n<p>When the data is located on the same node as the mapper working on the data it is known as data local data locality. In this case, the proximity of data is very near to computation. This is the most preferred scenario.<\/p>\n<h3>ii. Intra-Rack data locality in Hadoop<\/h3>\n<p>It is not always possible to execute the <strong><a href=\"http:\/\/data-flair.training\/blogs\/mapper-in-hadoop-mapreduce\/\">mapper<\/a><\/strong> on the same datanode due to resource constraints. In such case, it is preferred to run the mapper on the different node but on the same rack.<\/p>\n<h3>iii. Inter-Rack data locality in Hadoop<\/h3>\n<p>Sometimes it is not possible to execute mapper on a different node in the same rack due to resource constraints. In such a case, we will execute the mapper on the nodes on different racks. This is the least preferred scenario.<\/p>\n<h2>5. Hadoop Data Locality Optimization<\/h2>\n<p>Although Data locality in Hadoop MapReduce is the main advantage of Hadoop MapReduce as map code is executed on the same data node where data resides. But this is not always true in practice due to various reasons like <strong><a href=\"http:\/\/data-flair.training\/blogs\/speculative-execution-in-hadoop-mapreduce\/\">speculative execution in Hadoop<\/a><\/strong>, Heterogeneous cluster, Data distribution and placement, and Data Layout and Input Splitter.<br \/>\nChallenges become more prevalent in large clusters, because more the number of data nodes and data, less will be the locality. In larger clusters, some nodes are newer and faster than the other, creating the data to compute ratio out of balance thus, large clusters tend not to be completely homogenous. In speculative execution even though the data might not be local, but it uses the computing power. The root cause also lies in the data layout\/placement and the used Input Splitter. Non-local data processing puts a strain on the network which creates problem to scalability. Thus the network becomes the bottleneck.<br \/>\nWe can improve data locality by first detecting which jobs has the data locality problem or degrade over time. Problem-solving is more complex and involves changing the data placement and data layout, using a different scheduler or by simply changing the number of mapper and<a href=\"http:\/\/data-flair.training\/blogs\/hadoop-reducer-in-mapreduce\/\"><strong> reducer<\/strong> <\/a>slots for a job. Then we have to verify whether a new execution of the same workload has a better data locality ratio.<\/p>\n<h2>6. Advantages of Hadoop Data locality<\/h2>\n<p>There are two benefits of data Locality in MapReduce. Let&#8217;s discuss them one by one-<\/p>\n<h3>i. Faster Execution<\/h3>\n<p>In data locality, the program is moved to the node where data resides instead of moving large data to the node, this makes Hadoop faster. Because the size of the program is always lesser than the size of data, so moving data is a bottleneck of network transfer.<\/p>\n<h3>ii. High Throughput<\/h3>\n<p>Data locality increases the overall throughput of the system.<\/p>\n<h2>7. Data Locality in Hadoop &#8211; Conclusion<\/h2>\n<p>In conclusion, we can say that, Data locality improves the overall execution of the system and makes Hadoop faster. It reduces the network congestion. Hope this blog helped you to understand the core concept of Hadoop, which empowers the Hadoop functionality. Now we hope you are clear with the Data locality concept. If you find any question related to Hadoop Data Locality, So feel free to share with us. Hope we will help you.<br \/>\n<strong>See Also-<\/strong><\/p>\n<ul>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/how-hadoop-mapreduce-works\/\">How Hadoop Mapreduce Works?<\/a><\/li>\n<li><a href=\"http:\/\/data-flair.training\/blogs\/mapreduce-interview-questions\/\">Top 50 MapReduce Interview Questions and Answers<\/a><\/li>\n<\/ul>\n<p>In case of any queries or feedback feel free to drop a comment in the comment section below and we will get back to you.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Data Locality in Hadoop &#8211; Objective In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to&#46;&#46;&#46;<\/p>\n","protected":false},"author":6,"featured_media":42042,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[3326,3327,3328,3329,5238],"class_list":["post-2183","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-mapreduce","tag-data-locality-in-hadoop","tag-data-locality-in-hadoop-mapreduce","tag-data-locality-in-mapreduce","tag-data-locality-optimization","tag-hadoop-data-locality"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Data locality in Hadoop: The Most Comprehensive Guide - DataFlair<\/title>\n<meta name=\"description\" content=\"What is Hadoop Data Locality definition,MapReduce data locality optimization,type of data locality in MapReduce hadoop,advantages of data locality in Hadoop\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data locality in Hadoop: The Most Comprehensive Guide - DataFlair\" \/>\n<meta property=\"og:description\" content=\"What is Hadoop Data Locality definition,MapReduce data locality optimization,type of data locality in MapReduce hadoop,advantages of data locality in Hadoop\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/\" \/>\n<meta property=\"og:site_name\" content=\"DataFlair\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/DataFlairWS\/\" \/>\n<meta property=\"article:published_time\" content=\"2017-04-22T06:35:26+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-11-14T11:08:35+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"DataFlair Team\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:site\" content=\"@DataFlairWS\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"DataFlair Team\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data locality in Hadoop: The Most Comprehensive Guide - DataFlair","description":"What is Hadoop Data Locality definition,MapReduce data locality optimization,type of data locality in MapReduce hadoop,advantages of data locality in Hadoop","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/","og_locale":"en_US","og_type":"article","og_title":"Data locality in Hadoop: The Most Comprehensive Guide - DataFlair","og_description":"What is Hadoop Data Locality definition,MapReduce data locality optimization,type of data locality in MapReduce hadoop,advantages of data locality in Hadoop","og_url":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/","og_site_name":"DataFlair","article_publisher":"https:\/\/www.facebook.com\/DataFlairWS\/","article_published_time":"2017-04-22T06:35:26+00:00","article_modified_time":"2018-11-14T11:08:35+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg","type":"image\/jpeg"}],"author":"DataFlair Team","twitter_card":"summary_large_image","twitter_creator":"@DataFlairWS","twitter_site":"@DataFlairWS","twitter_misc":{"Written by":"DataFlair Team","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#article","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/"},"author":{"name":"DataFlair Team","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89"},"headline":"Data locality in Hadoop: The Most Comprehensive Guide","datePublished":"2017-04-22T06:35:26+00:00","dateModified":"2018-11-14T11:08:35+00:00","mainEntityOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/"},"wordCount":934,"commentCount":3,"publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg","keywords":["Data Locality in hadoop","data locality in Hadoop MapReduce","data locality in MapReduce","data locality optimization","Hadoop data locality"],"articleSection":["MapReduce Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/","url":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/","name":"Data locality in Hadoop: The Most Comprehensive Guide - DataFlair","isPartOf":{"@id":"https:\/\/data-flair.training\/blogs\/#website"},"primaryImageOfPage":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#primaryimage"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#primaryimage"},"thumbnailUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg","datePublished":"2017-04-22T06:35:26+00:00","dateModified":"2018-11-14T11:08:35+00:00","description":"What is Hadoop Data Locality definition,MapReduce data locality optimization,type of data locality in MapReduce hadoop,advantages of data locality in Hadoop","breadcrumb":{"@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#primaryimage","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2017\/04\/Data-Locality-in-Hadoop-MapReduce-01.jpg","width":1200,"height":628,"caption":"Data locality in Hadoop: The Most Comprehensive Guide"},{"@type":"BreadcrumbList","@id":"https:\/\/data-flair.training\/blogs\/data-locality-in-hadoop-mapreduce\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Blog Home","item":"https:\/\/data-flair.training\/blogs\/"},{"@type":"ListItem","position":2,"name":"MapReduce Tutorials","item":"https:\/\/data-flair.training\/blogs\/category\/mapreduce\/"},{"@type":"ListItem","position":3,"name":"Data locality in Hadoop: The Most Comprehensive Guide"}]},{"@type":"WebSite","@id":"https:\/\/data-flair.training\/blogs\/#website","url":"https:\/\/data-flair.training\/blogs\/","name":"DataFlair","description":"Learn Today. Lead Tomorrow.","publisher":{"@id":"https:\/\/data-flair.training\/blogs\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data-flair.training\/blogs\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/data-flair.training\/blogs\/#organization","name":"DataFlair","url":"https:\/\/data-flair.training\/blogs\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/","url":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","contentUrl":"https:\/\/data-flair.training\/blogs\/wp-content\/uploads\/sites\/2\/2016\/07\/Data-Flair.png","width":106,"height":48,"caption":"DataFlair"},"image":{"@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/DataFlairWS\/","https:\/\/x.com\/DataFlairWS","https:\/\/www.linkedin.com\/company\/dataflair-web-services-pvt-ltd\/","https:\/\/www.youtube.com\/user\/DataFlairWS"]},{"@type":"Person","@id":"https:\/\/data-flair.training\/blogs\/#\/schema\/person\/2c58ecb4f73a39f0ef993f1ddfcd7b89","name":"DataFlair Team","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1ce4a0e3e542444fc73bbebf83e89e8b73e2d95ccb1fcee64da9945f078b97c5?s=96&d=mm&r=g","caption":"DataFlair Team"},"description":"The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.","url":"https:\/\/data-flair.training\/blogs\/author\/dfteam2\/"}]}},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2183","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/comments?post=2183"}],"version-history":[{"count":5,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2183\/revisions"}],"predecessor-version":[{"id":42043,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/posts\/2183\/revisions\/42043"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media\/42042"}],"wp:attachment":[{"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/media?parent=2183"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/categories?post=2183"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data-flair.training\/blogs\/wp-json\/wp\/v2\/tags?post=2183"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}