Project in R – Uber Data Analysis Project

Interactive Online Courses: Elevate Skills & Succeed Enroll Now!

Welcome to part 2 of R and Data Science Projects designed by DataFlair. In our series of R projects, we are trying to use all the concepts related to Machine learning, AI and Data Science.

We recommend you to follow all the steps given in the projects so that you will master the technology rapidly. In today’s R project, we will analyze the Uber Pickups in New York City dataset. This is more of a data visualization project that will guide you towards using the ggplot2 library for understanding the data and for developing an intuition for understanding the customers who avail the trips. So, before we start, take a quick revision to data visualization concepts.

R Data Science Project – Uber Data Analysis

Talking about our Uber data analysis project, data storytelling is an important component of Machine Learning through which companies are able to understand the background of various operations. With the help of visualization, companies can avail the benefit of understanding the complex data and gain insights that would help them to craft decisions. You will learn how to implement the ggplot2 on the Uber Pickups dataset and at the end, master the art of data visualization in R.

In this project, we will uncover the Uber pickups pattern of New York City at different temporal intervals. This will help not only in analyzing the general trends in the amount of rides, but also in analyzing the shape of the curve, for example, how many rides were given at particular day of the week or particular hour of the day.

With such patterns, it is possible not only to reveal periods with the increased demand for Uber but also to define how the demand changes during the day or month and even which of the bases is the most popular. They are essential in enhancing market strategies towards operations and customer relations.

You can download the dataset utilized in this project here – Uber Dataset

1. Importing the Essential Packages

In the first step of our R project, we will import the essential packages that we will use in this uber data analysis project. Some of the important libraries of R that we will use are –

ggplot2

This is the backbone of this project. ggplot2 is the most popular data visualization library that is most widely used for creating aesthetic visualization plots.

ggthemes

This is more of an add-on to our main ggplot2 library. With this, we can create better create extra themes and scales with the mainstream ggplot2 package.

lubridate

Our dataset involves various time-frames. In order to understand our data in separate time categories, we will make use of the lubridate package.

dplyr

This package is the lingua franca of data manipulation in R.

tidyr

This package will help you to tidy your data. The basic principle of tidyr is to tidy the columns where each variable is present in a column, each observation is represented by a row and each value depicts a cell.

With the help of this package, we will be able to interface with the JavaScript Library called – Datatables.

scales

With the help of graphical scales, we can automatically map the data to the correct scales with well-placed axes and legends.

library(ggplot2)
library(ggthemes)
library(lubridate)
library(dplyr)
library(tidyr)
library(DT)
library(scales)

Input Screenshot 1:

Input Screenshot 2:

The Input Screenshot 3:

2. Creating vector of colors to be implemented in our plots

In this step of data science project, we will create a vector of our colors that will be included in our plotting functions. You can also select your own set of colors.

Code:

colors = c(""#CC1011", "#665555", "#05a399", "#cfcaca", "#f5e840", "#0683c9", "#e075b0"")

Input Screenshot 4:

3. Reading the Data into their designated variables

Now, we will read several csv files that contain the data from April 2014 to September 2014. We will store these in corresponding data frames like apr_data, may_data, etc. After we have read the files, we will combine all of this data into a single dataframe called ‘data_2014’.

To master this R Uber data analysis project, you need to know everything related to data frames in R

Then, in the next step, we will perform the appropriate formatting of Date.Time column. Then, we will proceed to create factors of time objects like day, month, year etc.

Code:

apr_data <- read.csv("uber-raw-data-apr14.csv")
may_data <- read.csv("uber-raw-data-may14.csv")
jun_data <- read.csv("uber-raw-data-jun14.csv")
jul_data <- read.csv("uber-raw-data-jul14.csv")
aug_data <- read.csv("uber-raw-data-aug14.csv")
sep_data <- read.csv("uber-raw-data-sep14.csv")

data_2014 <- rbind(apr_data,may_data, jun_data, jul_data, aug_data, sep_data)

data_2014$Date.Time <- as.POSIXct(data_2014$Date.Time, format = "%m/%d/%Y %H:%M:%S")

data_2014$Time <- format(as.POSIXct(data_2014$Date.Time, format = "%m/%d/%Y %H:%M:%S"), format="%H:%M:%S")

data_2014$Date.Time <- ymd_hms(data_2014$Date.Time)

data_2014$day <- factor(day(data_2014$Date.Time))
data_2014$month <- factor(month(data_2014$Date.Time, label = TRUE))
data_2014$year <- factor(year(data_2014$Date.Time))
data_2014$dayofweek <- factor(wday(data_2014$Date.Time, label = TRUE))

Input Screenshot 5:

Code:

data_2014$hour <- factor(hour(hms(data_2014$Time)))
data_2014$minute <- factor(minute(hms(data_2014$Time)))
data_2014$second <- factor(second(hms(data_2014$Time)))

Input Screenshot 6:

Plotting the trips by the hours in a day

In the next step or R project, we will use the ggplot function to plot the number of trips that the passengers had made in a day. We will also use dplyr to aggregate our data. In the resulting visualizations, we can understand how the number of passengers fares throughout the day. We observe that the number of trips are higher in the evening around 5:00 and 6:00 PM.

hour_data <- data_2014 %>%
           group_by(hour) %>%
               dplyr::summarize(Total = n()) 
datatable(hour_data)

Input Screenshot 7:

Output Screenshot:

Code:

ggplot(hour_data, aes(hour, Total)) + 
        geom_bar( stat = "identity", fill = "steelblue", color = "red") +
           ggtitle("Trips Every Hour") +
            theme(legend.position = "none") +
            scale_y_continuous(labels = comma)

month_hour <- data_2014 %>%
          group_by(month, hour) %>%
             dplyr::summarize(Total = n())

ggplot(month_hour, aes(hour, Total, fill = month)) + 
       geom_bar( stat = "identity") +
          ggtitle("Trips by Hour and Month") +
           scale_y_continuous(labels = comma)

Input Screenshot 8:

Input Screenshot 9:

Output:

Plotting data by trips during every day of the month

In this section of DataFlair R project, we will learn how to plot our data based on every day of the month. We observe from the resulting visualization that 30th of the month had the highest trips in the year which is mostly contributed by the month of April.

Code:

day_group <- data_2014 %>%
          group_by(day) %>%
             dplyr::summarize(Total = n()) 
datatable(day_group)

Output Screenshot:

Code:

ggplot(day_group, aes(day, Total)) + 
        geom_bar( stat = "identity", fill = "steelblue") +
           ggtitle("Trips Every Day") +
            theme(legend.position = "none") +
            scale_y_continuous(labels = comma)

Input Screenshot 10:

Output:

Code:

day_month_group <- data_2014 %>%
         group_by(month, day) %>%
             dplyr::summarize(Total = n())

ggplot(day_month_group, aes(day, Total, fill = month)) + 
        geom_bar( stat = "identity") +
           ggtitle("Trips by Day and Month") +
            scale_y_continuous(labels = comma) +
            scale_fill_manual(values = colors)

Input Screenshot 11:

Output:

Number of Trips taking place during months in a year

In this section, we will visualize the number of trips that are taking place each month of the year. In the output visualization, we observe that most trips were made during the month of September. Furthermore, we also obtain visual reports of the number of trips that were made on every day of the week.

Code:

month_group <- data_2014 %>%
          group_by(month) %>%
             dplyr::summarize(Total = n()) 
datatable(month_group)

Output Screenshot:

Code:

ggplot( , aes(month, Total, fill = month)) + 
        geom_bar( stat = "identity") +
           ggtitle("Trips by Month") +
            theme(legend.position = "none") +
            scale_y_continuous(labels = comma) +
            scale_fill_manual(values = colors)

Input Screenshot 12:

Output:

month_weekday <- data_2014 %>%
         group_by(month, dayofweek) %>%
             dplyr::summarize(Total = n())

ggplot(month_weekday, aes(month, Total, fill = dayofweek)) + 
       geom_bar( stat = "identity", position = "dodge") +
          ggtitle("Trips by Day and Month") +
           scale_y_continuous(labels = comma) +
           scale_fill_manual(values = colors)

Input Screenshot 13:

Output:

Finding out the number of Trips by bases

In the following visualization, we plot the number of trips that have been taken by the passengers from each of the bases. There are five bases in all out of which, we observe that B02617 had the highest number of trips. Furthermore, this base had the highest number of trips in the month B02617. Thursday observed highest trips in the three bases – B02598, B02617, B02682.

Code:

ggplot(data_2014, aes(Base)) + 
 geom_bar(fill = "darkred") +
 scale_y_continuous(labels = comma) +
 ggtitle("Trips by Bases")

Input Screenshot 14:

Output:

Code:

ggplot(data_2014, aes(Base, fill = month)) + 
 geom_bar(position = "dodge") +
 scale_y_continuous(labels = comma) +
 ggtitle("Trips by Bases and Month") +
 scale_fill_manual(values = colors)

Input Screenshot 15:

Output:

Code:

ggplot(data_2014, aes(Base, fill = dayofweek)) + 
 geom_bar(position = "dodge") +
 scale_y_continuous(labels = comma) +
 ggtitle("Trips by Bases and DayofWeek") +
 scale_fill_manual(values = colors)

Output:

Creating a Heatmap visualization of day, hour and month

In this section, we will learn how to plot heatmaps using ggplot(). We will plot five heatmap plots –

First, we will plot Heatmap by Hour and Day.
Second, we will plot Heatmap by Month and Day.
Third, a Heatmap by Month and Day of the Week.
Fourth, a Heatmap that delineates Month and Bases.
Finally, we will plot the heatmap, by bases and day of the week.

Code:

day_and_hour <- data_2014 %>%
         group_by(day, hour) %>%
            dplyr::summarize(Total = n())

datatable(day_and_hour)

Input Screenshot 16:

Output Screenshot:

Code:

ggplot(day_and_hour, aes(day, hour, fill = Total)) +
            geom_tile(color = "white") +
              ggtitle("Heat Map by Hour and Day")

Input Screenshot 17:

Output:

Code:

ggplot(day_month_group, aes(day, month, fill = Total)) +
            geom_tile(color = "white") +
              ggtitle("Heat Map by Month and Day")

Input Screenshot 18:

Output:

Code:

ggplot(month_weekday, aes(dayofweek, month, fill = Total)) +
            geom_tile(color = "white") +
              ggtitle("Heat Map by Month and Day of Week")

Input Screenshot 19:

Output:

Code:

month_base <-  data_2014 %>%
                    group_by(Base, month) %>%
                        dplyr::summarize(Total = n()) 

day0fweek_bases <-  data_2014 %>%
                    group_by(Base, dayofweek) %>%
                        dplyr::summarize(Total = n()) 

ggplot(month_base, aes(Base, month, fill = Total)) +
            geom_tile(color = "white") +
              ggtitle("Heat Map by Month and Bases")

Input Screenshot 20:

Output:

Code:

ggplot(day0fweek_bases, aes(Base, dayofweek, fill = Total)) +
            geom_tile(color = "white") +
              ggtitle("Heat Map by Bases and Day of Week")

Input Screenshot 21:

Output:

Creating a map visualization of rides in New York

In the final section, we will visualize the rides in New York city by creating a geo-plot that will help us to visualize the rides during 2014 (Apr – Sep) and by the bases in the same period.

Code:

min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004

ggplot(data_2014, aes(x=Lon, y=Lat)) +
  geom_point(size=1, color = "blue") +
     scale_x_continuous(limits=c(min_long, max_long)) +
      scale_y_continuous(limits=c(min_lat, max_lat)) +
        theme_map() +
           ggtitle("NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP)")

ggplot(data_2014, aes(x=Lon, y=Lat, color = Base)) +
  geom_point(size=1) +
     scale_x_continuous(limits=c(min_long, max_long)) +
      scale_y_continuous(limits=c(min_lat, max_lat)) +
       theme_map() +
          ggtitle("NYC MAP BASED ON UBER RIDES DURING 2014 (APR-SEP) by BASE")

Input Screenshot 22:

Output:

Uber data analysis using R

Output:

Summary

At the end of the Uber data analysis R project, we observed how to create data visualizations. We made use of packages like ggplot2 that allowed us to plot various types of visualizations that pertained to several time-frames of the year. With this, we could conclude how time affected customer trips. Finally, we made a geo plot of New York that provided us with the details of how various users made trips from different bases.

Hope you enjoyed the above R Data Science Project. Keep visiting DataFlair for more interesting projects related to the latest technologies like Big Data, R and Data Science. If you face any issue while practicing the same, comment us below. We will definitely help.

Master R technology for Free – Check R Tutorials Series

Did we exceed your expectations?
If Yes, share your valuable feedback on Google

Tags: data science project R project uber data analysis project

DataFlair Team

The DataFlair Team provides industry-driven content on programming, Java, Python, C++, DSA, AI, ML, data Science, Android, Flutter, MERN, Web Development, and technology. Our expert educators focus on delivering value-packed, easy-to-follow resources for tech enthusiasts and professionals.

JeongHwa Yoo says:
July 23, 2019 at 1:25 pm
uber-raw-data-apr14.csv
uber-raw-data-may14.csv
uber-raw-data-jun14.csv
uber-raw-data-jul14.csv
uber-raw-data-aug14.csv
uber-raw-data-sep14.csv
Please
I want to study with Uber samples.
I want.
I want uber data.
Thank you.
Reply
DataFlair Team says:
July 25, 2019 at 2:59 pm
Hi JeongHwa,
Apologies for the problem you faced. We have added the dataset now. You can check the blog and continue your project in R.
Happy learning.
Reply
shahid says:
September 2, 2019 at 10:54 pm
sir i want the uber data set
Reply
- DataFlair Team says:
  September 3, 2019 at 4:23 pm
  Hey Shahid,
  Thanks for the comment, but we already added a link for Uber dataset. Please refer the link in the 1st heading and download the dataset.
  Reply
Hector Alvaro Rojas says:
September 9, 2019 at 5:02 am
Hi guys:
Great article!
Anyway, there is still a problem to download the datasets from https://drive.google.com/file/d/1emopjfEkTt59jJoBH9L9bSdmlDC4AR87/view
This error message appear by the time I try to download:
The page isn’t redirecting properly
An error occurred during a connection to doc-10-c4-docs.googleusercontent.com
Any chance for you to fix the situation?
Reply
- DataFlair Team says:
  September 9, 2019 at 10:33 am
  Hi Hector,
  Sorry for the inconvenience.
  We checked the same link at our end and it is working properly. If you are getting the same error repeatedly, I suggest you to please delete your browsing history and cached memory and then try opening the link. It will surely work fine then.
  If you have any other queries, feel free to comment back.
  Happy to help.
  Reply
meenu says:
September 27, 2019 at 11:50 am
can you add more explanation about the coding and output
Reply
- pavithiran says:
  September 29, 2019 at 3:28 pm
  ok
  Reply
santhosh says:
September 29, 2019 at 3:27 pm
ok
Reply
Group H Tech Service says:
October 24, 2019 at 4:24 pm
Hi DataFlair,
Thanks for the greate tutorial on Uber Data analysis.
But I am getting an error when I run the plotting trips by the hours in a day (“Error in is.list(val) : object ‘hour_data’ not found”) I don’t know what it refers to because the hour_data object points to data_2014 which is populated with 4534327 observations.
Need help, thanks!
Reply
Saptarshi says:
November 3, 2019 at 1:44 pm
ggplot(data_2014, aes(x = Lon, y = Lat))+
geom_point(size=1, color = “blue”)+
scale_x_continuous(limits = c(min_long, max_long))+
scale_y_continuous(limits = c(min_lat, max_lat))+
ggtitle(“NYC map based on Uber rides during 2014 (Apr-Sep)”)
Warning message:
Removed 71701 rows containing missing values (geom_point).
The map is not generating and R is getting hanged. Can you tell me the reason?
Reply
- Ashish Kumar Singh says:
  April 21, 2020 at 10:13 am
  Hey Saptarshi, Are you able to get the solve “Warning message:
  Removed 71701 rows containing missing values (geom_point).”
  Reply
Padma says:
November 4, 2019 at 8:49 pm
Hi please can I get the architecture diagram of Uber data analysis using R
Reply
paddy says:
November 5, 2019 at 8:22 pm
hello,which data science algorithm are you using in this R project .
Reply
- DataFlair Team says:
  November 7, 2019 at 9:46 am
  Hi paddy,
  In this R project, we have showcased various data visualization techniques used for data analysis. Using the plots, we can use several data analysis algorithms to find the relationship between the variables used in the graphs.
  Reply
  - Saptarshi says:
    November 8, 2019 at 11:43 am
    Hi,
    Can you pls reply to my query?
    Reply
Toni Alicia Yungai says:
November 25, 2019 at 9:06 am
There are parts of the code missing after: 3. Reading the Data into their designated variables
data_2014$hour <- factor(hour(hms(data_2014$Time)))
data_2014$minute <- factor(minute(hms(data_2014$Time)))
data_2014$second <- factor(second(hms(data_2014$Time)))
Reply
Akshay says:
November 30, 2019 at 4:46 pm
I am getting this error:
Error in FUN(if (length(d.call) < 2L) newX[, 1] else array(newX[, 1L], :
length(Lab) == 3L is not TRUE
Reply
safeer says:
December 25, 2019 at 7:26 pm
ggplot(data_2014, aes(x = Lon, y = Lat))+
geom_point(size=1, color = “blue”)+
scale_x_continuous(limits = c(min_long, max_long))+
scale_y_continuous(limits = c(min_lat, max_lat))+
ggtitle(“NYC map based on Uber rides during 2014 (Apr-Sep)”)
Warning message:
Removed 71701 rows containing missing values (geom_point).
The map is not generating and R is getting hanged. Can you tell me the reason thnx
Reply
- Jc says:
  February 27, 2020 at 7:53 pm
  I got same problem.. my R hanged :/
  Reply
- kejora says:
  March 25, 2020 at 9:33 pm
  to admin, please give solution for this problem
  Reply
Sai says:
January 5, 2020 at 12:34 pm
I want abstract for this project right now immediately
Reply
kejora says:
March 25, 2020 at 9:23 pm
to admin, please give solution for this problem
Reply
Arsh says:
April 2, 2020 at 11:43 am
data_2014$Date.Time <- ymd_hms(data_2014$Date.Time)
when I execute this command error message appears
"cannot allocate vector size 1.3 MB"
please help me what is issue in it
Reply
Arsh says:
April 2, 2020 at 11:45 am
data_2014$Date.Time <- ymd_hms(data_2014$Date.Time)
when i run this command an error message appears
" cannot allocate vector size of 1.3 MB" please help me to resolve this issue
Reply
Tanu Dharam says:
April 6, 2020 at 6:18 pm
which Mining Algorithm is used on Datasets???
Reply
Sayali says:
April 17, 2020 at 4:36 pm
please can you tell which methodology is used ?
Reply
Aniket says:
May 22, 2020 at 9:50 pm
I’m getting error during hours trip plot as my data table reading na strings givin only one value 45 thousand something that means it only adding all values how to solve this problem I checked I write the same code as of u give .
Reply
geeky beast says:
August 26, 2020 at 3:19 pm
> data_2014$Date.Time <- ymd_hms(data_2014$Date.Time)
Error in ymd_hms(data_2014$Date.Time) : could not find function "ymd_hms"
Please help me to solve this error
Reply
Pravallika says:
November 2, 2020 at 11:14 pm
Hy i have a question can you tell me the algorithm name that you have used in this Uber data Analysis project?
Reply
Pravallika says:
November 2, 2020 at 11:16 pm
what does Lat an lon refers to? in the datasets
Reply
KIRTI KADAM says:
November 7, 2020 at 5:56 pm
ggplot(data_2014, aes(x = Lon, y = Lat))+
geom_point(size=1, color = “blue”)+
scale_x_continuous(limits = c(min_long, max_long))+
scale_y_continuous(limits = c(min_lat, max_lat))+
ggtitle(“NYC map based on Uber rides during 2014 (Apr-Sep)”)
Warning message:
Removed 71701 rows containing missing values (geom_point).
The map is not generating and R is getting hanged. Can you tell me the reason ?
Reply
Chetanc says:
November 16, 2020 at 1:32 pm
Which algorithm is used in this project
Reply
Akash says:
November 25, 2020 at 6:41 pm
Can anyone tell is there any possibility of using Machine learning over the database and if yes,what techniques to use?
Reply
matias says:
January 1, 2021 at 10:46 pm
The code for #11 does not match up with the visualization. Instead, the code should look like this:
month_day %
group_by(month, dayofweek) %>%
dplyr::summarize(Total=n())
datatable(month_day)
ggplot(month_day, aes(month, Total, fill=dayofweek)) +
geom_bar(stat=”identity”, position=”dodge”) +
ggtitle(“Trips by Day and Month”) +
scale_y_continuous(labels=comma) +
scale_fill_manual(values=colors)
Reply
Manika says:
January 31, 2021 at 10:26 pm
Hi,
I am unable to run the below mentioned lines of codes. Gives me parsing error.
data_2014$hour = factor(hour(hms(data_2014$Time)))
data_2014$minute = factor(minute(hms(data_2014$Time)))
data_2014$second = factor(second(hms(data_2014$Time)))
Error is as follows :
Warning message:
In .parse_hms(…, order = “HMS”, quiet = quiet) :
Some strings failed to parse, or all strings are NAs
Reply
Manika says:
January 31, 2021 at 10:34 pm
Hey there! Hope you’re doing okay !!
Could you please suggest a way to find the missing parts of these codes?
Reply
Pavani Pratapa says:
February 9, 2021 at 6:23 am
Hi! Is there any way I could access a Python Version of this same project?
Reply

Project in R – Uber Data Analysis Project

R Data Science Project – Uber Data Analysis

1. Importing the Essential Packages

2. Creating vector of colors to be implemented in our plots

3. Reading the Data into their designated variables

Plotting the trips by the hours in a day

Plotting data by trips during every day of the month

Number of Trips taking place during months in a year

Finding out the number of Trips by bases

Creating a Heatmap visualization of day, hour and month

Creating a map visualization of rides in New York

Summary

53 Responses

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials