class: center, middle, inverse, title-slide # Chicago Taxi Data Analysis ## SCV Group ### ### 27 Oct 2021 --- class: title-slide,middle background-image: url("assets/taxi.jpeg") background-position: 10% 90%, 100% 50% background-size: 100px, 100% 100% <!-- background-color: #0148A4 --> ## .black[Hi! During your Journey,] ## .black[Drive safe &] ## .white[Scroll each page please] ## .black[Enjoy! :-)] --- ## Introduction .scroll-output[ ## Statement of the goal Through this report we will analyze taxi trips made in Chicago in 2020. Our data is collected from Chicago Data Portal, and can be found here(https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew). The goal of this project is to give taxi companies and their drivers in Chicago a guide to which areas they should operate in to maximize their profit, at different times of the day. This translates to two questions: 1. Which areas generate the most revenue for taxi companies and how does this change throughout the day? 2. What can taxi drivers expect to earn in the different areas? ## Brief summary of the approach We start by performing exploratory data analysis to familiarize ourselves with the data. This includes studying the distribution of continuous variables, the proportion of levels in categorical variables and the relation between those. We combine OpenStreetMap, ggplot2, leaflet and sf library to construct interactive maps for total revenue, population, average pickup time and average hourly rate in Chicago. Furthermore, the independent Shiny apps enabled visualizing changes throughout the day. This type of visualization allows to have an overview throughout the day for basic statistics for trip_total, also to focus on a particular period of the day and to compare easily these statistics between two or more periods. To explain our findings, we build up predictive models using multiple regression, and to quantify the uncertainty of our findings, we construct confidence intervals with the bootstrapping method. Finally, we attempt to uncover the unexplained part of the differences by applying the principal component analysis to the demographic data. ## Introduction to the problem Since Uber was introduced the taxi industry has been under a massive pressure, which has put many drivers and taxi companies in a tough economic position. There are many solid analyses of taxi trips. However most of these focus on the big trends in the taxi industry, often time series of the steady decline in activity. By giving a thorough analysis of which areas are most profitable, our goal is to give taxis an extra edge against the competition. ## Exploratory Data Analysis ### Metadata **Taxi_Trips_2020.csv** - `trip_start_timestamp`: Character representation of when the trip started, rounded to the nearest 15 minutes. - `trip_end_timestamp`: Character representation of when the trip ended, rounded to the nearest 15 minutes. - `trip_seconds`: Numeric representation of the time of the trip in seconds. - `trip_miles`: Numeric representation of distance of the trip in miles. - `pickup_community_area`: Discrete integer uniquely defining in which Community Area the trip began. - `dropoff_community_area`: Discrete integer uniquely defining in which Community Area the trip ended. - `trip_total`: Numeric value indicating the total cost of the trip, in USD. **Chicago_Community_Areas.shp** - `geometry`: MultiPolygon object, list of coordinates representing the borders of each area - `area_num_1`: Discrete integer uniquely defining each Community Area. Corresponds to Pickup Community Area and Dropoff Community Area in Taxi Trips - `community_name`: Character variable representing the name of each Community Area **wiki.csv** - `no.`: Equivalent to Area_num_1 - `population`: Numeric representation of the population of each zone. - `area(km2)`: Numeric representation of the area of each zone, in km2. - `density(km2)`: Numeric representation of the density of each zone, in km2. **Per_Capita_Income.csv** - `community_area_number`: Equivalent to Area_num_1 - `percent_of_housing_crowded`: Numeric representation of the percent of occupied housing units with more than one person per room. - `percent_households_below_poverty`: Numeric representation of the percent of households living below the federal poverty level. - `percent_aged_16_unemployed`: Numeric representation of the percent of persons aged16 years or older in the labor force that are unemployed. - `percent_aged_25_without_high_school_diploma`: Numeric representation of the percent of persons aged 25 years or older without a high school diploma - `percent_aged_under_18_or_over_64`: Numeric representation of the percent of the population under 18 or over 64 years of age - `per_capita_income`: Numeric representation of per capita income. . ] --- ## Exploratory Data Analysis .scroll-output[ ### Starting off We have viewed the following part by drawing a sample of size 10 000 from our dataset. We first view `trip_total`’s distribution by plotting a histogram and the corresponding boxplot. .pull-left[ <img src="./assets/Plots_taxis/3.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/1.png" width="1867" /> ] We observe that the data is mostly concentrated close to zero, with a heavy tail stretching to 600$. This could indicate a power-law distribution, however when plotting the logarithmic plot there are no signs of this. We still note that we should be aware of extreme outliers, as they might impact calculations of descriptive statistics further on. ## Visualization of Categorical Variables - Frequency of Areas A natural next step is plotting the distribution for each area. Since the trip total is so prone to outliers, we rather investigate the activity in each area, i.e.. frequency of trips. The following plots represent the 10 most common areas for respectively pickup and dropoff, in addition to the frequency in all other areas. .pull-left[ <img src="./assets/Plots_taxis/7.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/8.png" width="1867" /> ] The areas "Near North Side", "The Loop" and "Near West Side" account for more than 60% of the activity. ## Distribution of log trip_total per Area The following plots present the distribution of the natural logarithm of trip_total. By also considering the log we can more clearly see differences between the areas, as we now consider their order of magnitude instead of their size. We notice a difference in total cost in the most common areas, e.g. Garfield Ridge and O'Hare seem to have higher cost for pick up, however for dropoff Morgan Park seems to have the highest cost. .pull-left[ <img src="./assets/Plots_taxis/9.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/11.png" width="1867" /> ] ## QQ plots To explore the behaviour of the trip.total we plot QQ plots. As expected, trip.total doesn't follow a normal distribution. The logarithm of trip.total seems to have a better fit but still has multiple outliers that deviate from the theoretical quantiles. When we study the distribution of log.trip.total according to the most common areas, the distribution of log.trip.total looks better. .pull-left[ <img src="./assets/Plots_taxis/19.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/20.png" width="1867" /> ] .pull-left[ <img src="./assets/Plots_taxis/18.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/21.png" width="1867" /> ] # Time series The following plots represent the change of price according to the pickup area and dropoff area by month. For example, we see that the price for pickup seems to be higher for all the month in Garfield Ridge and O'Hare. We also notice that almost all areas follow a similar trend in the time variable, for example the cost of trips dropped in all areas in April. .pull-left[ <img src="./assets/Plots_taxis/23.png" width="1867" /> ] .pull-right[ <img src="./assets/Plots_taxis/27.png" width="1867" /> ] . ] --- ## Visualisation of Total Revenue by Area .scroll-output[ To visualize the and highlight the differences between each area we have used interactive maps provided by the leaflet library. We used [this](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6) dataset provided by Chicago Data Portal to visualize the borders between community areas. We attribute a color to each area, which indicates the size of the difference relative to the other areas. Our goal was to get a clear understanding of how area and time of the day affects revenue in each area, and being able to communicate these findings as clearly as possible. As a basis for this analysis we have plotted a map centered around Chicago. This was done by using the Leaflet library to extract a map over Chicago through Google’s map API. Thereafter we extracted `pickup_community_area`, `community_name` and `geometry` from the dataset of community borders. We will first look into how total revenue is distributed among the areas. The relevant variables we will use are `pickup_community_area` and `trip_total`. To indicate the total revenue in each area, we summarized every trip_total in each area. We then colored the areas in shades of blue representing their respective revenues. Since the revenue differences between areas often vary by several orders of magnitude, we made the bins logarithmic. We must then interpret the coloring as an indication of the order of magnitude of the revenue, not the exact revenue. ] --- .scroll-output[ ## Total Revenue by Area
From the map, it is clear that a few areas generate a disproportionate amount of the revenue, with Loop, O'Hare and Near West Side being the most prominent. It is not, however, necessarily that this distribution is static throughout the day. The following slide will investigate whether there are significant changes from hour to hour. . ] --- ## Shiny App: Analysis of Total Revenue by Area .scroll-output[ In this section we will investigate how the total, mean and max trip revenue change by the hour of the day. We will further look for differences between pickup areas and dropoff areas. .orange[For a better view of this page, you can go to https://uberlu.shinyapps.io/report1_thomas/] ### by pickup areas of Chicago .orange[Please click: https://uberlu.shinyapps.io/Pickup/] For the total revenue the same central areas (Near West Side, Loop and Near North Side) and the airports (O'Hare and Garfield Ridge) generally have the highest rates. The only exception is during the night for which the values go down for the airports. This could be due to the fact that there is less flight activity during the night. The city center is more stable as the night life is concentrated there. For the mean of the revenue per area, we can see that it is homogeneous on the border of Michigan Lake. But airports are the areas with some of the higher mean, as before, during the day. For the max of the revenue per area, we see that the maximal value obtained for each time of the day comes from the city center, particularly from Near North Side and Loop, values that are more than 5000$ most of the time. This is probably due to the fact that, as we’ve seen before, the city centre has the highest activity. It is thus more likely to contain extreme outliers. ### by drop-off areas of Chicago This section is similar to the previous one, but instead of looking at pickup areas, we look at dropoff areas and end time of each trip. As before, we can take a look at each hour of the day from 1 A.M. to 12 A.M. .orange[Please click: https://uberlu.shinyapps.io/Dropoff/] For the total revenue, the change for dropoff values is quite similar to the change of pickup values. However the highest values are less concentrated in the city center and more scattered around it. We observe the same pattern for the airports. For the mean case, the values are a lot more homogeneous than for pickups. They are dispersed all over the city. For the maximum values, we see that the higher values appear from the city center to Lake View, an area in the north. As earlier, airports have high values. This was a brief overview of the basic data in order to have an idea of the situation for both pickup and dropoff. Considering that taxi drivers have little control over the dropoff area they end up in, we will for further analysis consider only the pickup case. . ] --- ## Total Revenue vs. Zone Analysis by Maps .scroll-output[ With the revenue map in mind, we will know investigate why some areas generate more revenue than others. To address this, we included population data from Wikipedia as well as an OpenStreetMap dataset about amenities, i.e. restaurants, theaters, hospitals, etc. ## Population Population can potentially be one of the main factors affecting activity in each area, hence affecting the total revenue as well. We investigate this by extracting population, density and area data from [Chicago's Community Wikipedia page](https://en.wikipedia.org/wiki/Community_areas_in_Chicago), and combine it with our original taxi data set. We apply logarithmic function on population to highlight the differences between each area in the map. By comparing the two maps, we spot similarities between the population map and the trip total map. We initially think population might be an exploratory variables for total revenue. This relationship will later be investigated by multiple regression. ## Population by Area
## Combine with Amenities Apart from population, the location of certain city amenities might be another factor which affects the total revenue of each area. We hence extract amenity information from OpenStreetMap and plot those points out using semitransparent points. By plotting on top of Chicago's toner background, we find the places with most clustered orange dots are the most popular areas; these roughly define the central areas. <img src="./assets/2.png" width="1867" /> We would now like to reduce redundant information on the map. We therefore extract the street data from OpenStreetMap and manually select the only roads that allows cars to drive through. In addition we exclude paths such as living streets and footways. These roads are plotted .brown[in black], while the darker lines indicate a bigger street, e.g. a highway. Then, we need to identify which zones these dots belong to. We use the shape file with the geometry information of each area to plot the boundaries .blue[in blue] by calling `geom_sf()`, which allows us to draw different geometric objects. In the end, we again plot out the places of amenities using points .orange[in orange]. Now, we can clearly identify which zones are the most popular ones. <img src="./assets/3.png" width="1867" /> By comparing this graph with the former total revenue one, we have good reason to believe that higher revenue in certain zone might be related to the density of certain amenities. The only exception is O’Hare, a hot spot for taxis with few amenities. It turns out that O’Hare is the location of Chicago O'Hare International Airport, which explains why it attracts many taxis and produces such a high revenue. As an aside, due to the limit of the OpenStreetMap data set, we cannot get all the places where people are likely to take a taxi. However, the current information is enough for us to roughly locate the city center. ## Visualise 24hrs Changes of Total Revenue by Animation The previous visualizations depict a static model. To depict changes throughout the day, we split one day into 24 hrs and plot the total revenue per area, per hour. This can potentially help taxi driver to find most profitable zone within a specific time frame. We plot the polygons representing communities like before, and subsequently group them by hours. From the time stamp we get the time that each record belong to. We then group the records by hours and plot the related total revenue. We use .green["Greens"] color as the background, which is a continuous color variable that indicates the changes of the 10-logarithmic total revenue. This will be more sensitive to changes in areas with lower total revenue. As for yellow bubbles, it uses the original value of total revenue of that hour, and it is more sensitive to areas with larger total revenue. Combining the two, we obtain a better visual perception of how the total revenue is changes throughout the 24 hours. The animation is generated by concatenating 24 static maps of each hour. <!-- --> From the animation above, we spot the same areas which are of the highest interest throughout most of the days. These are located, as depicted earlier, mostly in and around the city center and the airport. . ] --- ## Analysis & Discussion .scroll-output[ ## Investigation on profitability for individual driver From the map it is clear that a few areas generate a disproportionate amount of the revenue, with Loop, O'Hare and Near West Side being the most prominent. Does this imply that every driver should strive to spend as much time as possible in these areas? Not necessarily. The total revenue depends on several factors which differ between areas. For each area there are e.g. differences in population, movement of people and number of active taxis. Although an area generates the most money overall, it doesn't necessarily imply that each taxi driver is better off in that area. Just think of the large queues of taxis which are normal to see at airports - not exactly time efficient! If we start by neglecting the variations of average trip time and revenue between areas, the average waiting time in each area will be a fairly good representation of how attractive the area is for each individual driver; as the drivers only earn money when having a passenger. We represent the average waiting time as the average time it takes from a taxi drops someone off in an area, until the same taxi picks someone else up in the same area. We only consider pick-ups and drop-offs which happen in the same area. This is because taxis which travel from one area to another in between drop-off and pick-up will inflate the average time. This could potentially lead to the central areas getting higher averages, as taxi drivers are inclined to travel back to the city center after a drop-off. We extract `taxi_id`, `trip_start_timestamp`, `trip_end_timestamp`, `pickup_community_area` and `dropoff_community_area` from our data set. `taxi_id` is an unique string identifier for each taxi and `pickup_community_area` and `dropoff_community_area` are identical to the previous description of `pickup_community_area`. `trip_start_timestamp` and `trip_end_time` stamp are date strings one the format **MM/DD/YYYY hh:mm:ss**, which indicates time of trip start and end. We note that the time stamps has been rounded to nearest 15 minute for privacy purposes. Therefore the result will not be 100% accurate, but it will give a good estimate of the true time to pickup. The results are shown below.
By considering average waiting time we get a fairly different picture. The revenue and activity plot previously pointed towards O'Hare and the central areas around Loop as the most attractive - or at least as generating the most money. We can now see that the taxis in these areas are more prone to waiting time, with O’Hare averaging 95 minutes between trips. This is probably an indication of higher competition. However, none of these plots paint the whole picture on its own. Although the waiting time at O’Hare airport is large, a trip from the airport to the city center would generate much more revenue than a trip from Loop to West Town. To take this into account we will go back to the factors we previously neglected; average trip total and average trip time. When combining these our goal is to get a measure of \textit{expected} revenue per hour. We can decompose revenue per hour into: \begin{align} \textit{revenue per hour} = \frac{\textit{revenue per trip}}{\textit{time per trip}} \end{align} We will further view the time per trip as not only the length of the trip, but the sum of trip time and waiting time. We then obtain: \begin{align} \textit{revenue per hour} = \frac{\textit{revenue per trip}}{\textit{waiting time + trip time}} \end{align} To obtain the expected revenue per hour for each area, we simply calculate the average of the three components by area and combine them. To measure the accuracy of our estimator we used the library boot to generate 1000 random samples from each area, and calculate the average hourly rate. Since the function boot.ci didn't work on the data set, we extracted a 95% confidence interval by sorting the sampled values in each area, and taking the 25th and 975th largest values. The obtained result and code for bootstrapping can be viewed on next page ] ---
--- .scroll-output[ <em><strong>Ⅰ</strong> 95% confidence interval is shown when hovering over an area</em><br/> <em><strong>Ⅱ</strong> The estimator is represented as the bootstrapped average of the 25th and 75th quartile<em> The bootstrapped confidence intervals give us a good measure of the uncertainty in each area. By not only considering our estimator, but the uncertainty of our measurements as well, we obtain a more representative model of expected revenue. We observe for example that there are great differences between the sizes of the confidence intervals. The central areas around Loop have for example very tight confidence intervals, which is due to much activity, i.e. many samples. On the other hand, there also exist areas with huge confidence intervals. The model also reveals interesting patterns. The areas with the highest expected hourly rate often has the largest confidence intervals. However, in the category below, i.e. 35 - 40 \$, you find many areas with expected hourly rate close to 40$, which also have tight confidence intervals. Even more interestingly these areas do not coincide with the areas with the most activity. These findings indicate that it is not necessarily the most active areas which will earn the taxi drivers the most money. To assess the strength of this hypothesize we would need to test it on more data from different years. An idea for further development would be to calculate the expectations for each hour of the day as well. By combining more data with this segmentation, the model could have real practical value. . ] --- ## Connecting to Outside Dataset .scroll-output[ To investigate why specific areas generate the most revenue, as well as why some areas have less waiting time, we inspect the outside data set [Per Capita Income](https://data.cityofchicago.org/Health-Human-Services/Per-Capita-Income/r6ad-wvtk) from Chicago governmental website. We will look for correlation among the factors. ## Correlations Initially when plotting the correlation we can clearly see outliers which greatly affect the correlation coefficient. These are dropped before re-plotting. For this plot, .orange[you can click on each cell and see a correlation between the variables]. Our dependent variable `trip_total` is placed at the last column. It has negative correlations with `percent_aged_under_18_or_over_64`, `percent_aged_25_without_high_school_diploma` and `percent_of_housing_crowded`. It is positively correlated to `per_capita_income`, `population` and `density`. As for `avg_pickup_time`, placed at the second last column, it has a negative correlation with `percent_households_below_poverty` and `percent_aged_16_unemployed`.
## Visualisation of Outside Dataset To have a general idea of how each factor look like throughout areas, we plot the factors onto the area polygon. .pull-left[ <img src="./assets/8.png" width="1867" /> ] .pull-right[ <img src="./assets/9.png" width="1867" /> ] . ] --- ## Analysis & Discussion : What can we find under the hood? .scroll-output[ ## Multiple Regression by Backward Model Selection As mentioned earlier, we will investigate the relationship between favorable traits for areas and demographic data in the same area. First, we perform multiple regression for trip_total. To find the most appropriate model, we start with the model containing all possible explanatory variables. Progressively, we remove the least informative variable, and stop when all the variables in the current model are significant, i.e. when `\(\alpha = 0.05\)`. We do a backward search using Akaike Information Criterion (AIC) and get the following model. ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 700722.131 212619.909 3.296 0.002 ## population 3.136 1.379 2.273 0.026 ## percent_of_housing_crowded -27518.550 6744.633 -4.080 0.000 ## percent_aged_16_unemployed 19959.164 4129.041 4.834 0.000 ## percent_aged_under_18_or_over_64 -19294.557 5914.440 -3.262 0.002 ``` Therefore, we get `$${\text{trip_total}} = 700722.131 + 3.136 \times \text{population} -27518.550 \times \text{percent_of_housing_crowded} + 19959.164 \times \text{percent_aged_16_unemployed} -19294.557 \times \text{percent_aged_under_18_or_over_64}$$` Since the P values for each variable and intercept are all less than 0.05, we keep all of them in our formula for calculating trip_total. ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.45 0.42 0 ``` We get the adjusted r squared to be 0.42, which means 42% of the variance for the dependent variable `trip_total` can be explained by our independent variables, and hence it is a relatively good model. Then, we apply the same process on the `avg_pickup_time`. ``` ## Estimate Std. Error t value ## (Intercept) 15.213 3.098 4.910 ## percent_of_housing_crowded 0.674 0.298 2.262 ## percent_aged_25_without_high_school_diploma -0.184 0.101 -1.813 ## percent_households_below_poverty -0.165 0.051 -3.199 ## percent_aged_under_18_or_over_64 0.130 0.093 1.392 ## Pr(>|t|) ## (Intercept) 0.000 ## percent_of_housing_crowded 0.027 ## percent_aged_25_without_high_school_diploma 0.074 ## percent_households_below_poverty 0.002 ## percent_aged_under_18_or_over_64 0.169 ``` `$${\text{avg_pickup_time}} = 15.2134 + 0.6735 \times \text{percent_of_housing_crowded} -0.1839 \times \text{percent_aged_25_without_high_school_diploma} + -0.1646 \times \text{percent_households_below_poverty} -0.1300 \times \text{percent_aged_under_18_or_over_64}$$` ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.21 0.16 0 ``` We get a nice formula for `avg_pickup_time`, however, the adjusted R squared is just 0.16, which indicates that it is not a good model. We need to find other factors to better explain `avg_pickup_time`, which also suggests a direction of our future work. We can find more sensible factors from outside data sets to perform further analysis. <!-- ## PCA --> <!-- Since we have many variables in our dataset, and from the correlation plot, we find multi-colinearity existing between the features, e.g. `percent_households_below_poverty` and `percent_aged_16_unemployed`. We apply PCA to reduce our input dimension, which might uncover the relationship that's not presented in our multiple regression model. --> <!-- Currently, we don't find a satisfying model to explain the average pick up time. We think it might be related to traffic etc. We consider using larger dataset where we have more sensible variables that can be used to predict our total revenue, pickup time, as well as hourly rate. However, once a larger dataset involved, we will get more dimensional data which leads to the redundancy of variables. We already spot multi-colinearity existing between the features, e.g. `percent_households_below_poverty` and `percent_aged_16_unemployed` so far, which also indicates the necessity of performing PCA. --> <!-- For example, we did elementary PCA for our current dataset, including the `pick up time` which does not have a good regression model associates with. --> <!-- ```{r,echo = FALSE} --> <!-- knitr::include_graphics("./assets/a.png") --> <!-- ``` --> <!-- From the plot, we find out that when we have 3 principal components, all variance will be explained. --> <!-- .pull-left[ --> <!-- ```{r,echo = FALSE} --> <!-- knitr::include_graphics("./assets/b.png") --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r,echo = FALSE} --> <!-- knitr::include_graphics("./assets/c.png") --> <!-- ``` --> <!-- ] --> <!-- .pull-left[ --> <!-- ```{r,echo = FALSE} --> <!-- knitr::include_graphics("./assets/d.png") --> <!-- ``` --> <!-- ] --> . ] --- ## Conclusion and the way forward .scroll-output[ Through this project we have visualized the differences between the areas in Chicago. The parameters we have looked into might interest both taxi companies and individual taxi drivers. By investigating variables such as total revenue, average revenue and average waiting time, we have gotten a good indication of the taxi situation in Chicago. We further developed this understanding by viewing how the hours of the day affected these factors, and how amenity rate affected the activity in the areas. By subsequently quantifying the uncertainties in our dataset, we became aware of the limits of our models. This has also led us to believe that the areas with the most activity are not necessarily the most profitable areas for each individual taxi driver. This has made us interested in developing the model further, taking into account a larger dataset and then try to make the model relevant for different times of the day. We finally investigated factors which could explain the varying waiting times and total revenues in each area. The total revenue was explained fairly well by our demographic data, whilst the average waiting time was not. This outlines a path for our future work. By further developing our model for expected hourly rate we can give clearer insights to taxi drivers in Chicago, by for example segmenting by hour of the day. This naturally demands a bigger dataset. By also developing our predictive model based on demographics and amenities, we can potentially make a model which takes into account the development of neighbourhoods when judging how favorable each area is for taxi drivers. ] <!-- --- --> <!-- ```{r} --> <!-- olav3 --> <!-- ``` -->