Exploring New York City Transit

December 5, 2016


Is there a relationship between subway ridership and CitiBike usage in New York City? This project aims to illustrate patterns between these two major forms of commute. First, we will compare overall CitiBike usage with subway usage. Next, we’ll analyze the effect of pedestrian traffic on station usage by classifying subway and CitiBike stations by designated regions. Lastly, we’ll observe the effect of weather on these modes of transit. With this information, we can see whether or not it will be worth adding or removing CitiBike stations if there is a level of usage disproportional to the pedestrian traffic, or during which months maintenance of these services is most important. This project has the potential to help identify smart locations for bike hubs based on proximity to subways. The initial hypothesis was that it would be smarter to place more Bike Hubs around overcrowded subway stations.


The following datasets were sourced from NYC Open Data.


Data Transformations and Algorithms

The following outlines the series of data transformations, computations, and/or algorithms applied. These were used to clean data, build useable datasets, compute metrics, and generate resources used for visualizations.

Statistical Analysis

With subway usage, CitiBike usage, annual weather, and regional pedestrian counts, we can see their relationship using simple statistical methods like correlation and regression. With our datasets, we see how CitiBike and subway usage varies with weather, as well as how they are affected by the region’s population and pedestrian traffic.

I aimed to answer the following family of problems:

  1. How does CitiBike/subway usage vary with weather? Do riders prefer one over the other in certain weather conditions? Before statistical analysis, the following hypotheses were formed:
    • There is a positive correlation between precipitation and subway ridership
    • There is slight positive correlation between temperature and subway ridership
    • There is a negative correlation between precipitation and CitiBike ridership
    • There is a positive correlation between temperature and CitiBike ridership

    We can solve this problem by combining subway usage, CitiBike usage, and weather by date. Once we have the total subway usage and totaly CitiBike usage for each day, we can compute the correlation between each of these usages and weather.

  2. Can we predict subway usage from pedestrian count of a region using a linear regression model?
    A regression was run on these two variables. The results sections shows our regression parameters and discussion of the regression’s (lack of) predictive power.

    To run this regression, I needed to overcome the challenge of the subway usage and subway coordinates being in separate datasets. The main difficulty with this data separation, however, was the difference in naming convention for each station! The initial data transformations did not involve this combining. I wrote algorithms to narrow down station name matches as closely as possible. But for the remaining station names that failed to match, I found the name discrepancies and matched them manually.

  3. How does overall CitiBike usage vary with subway usage? To identify the relationship between total subway usage and CitiBike usage, correlation was calculated and plotted on a scatter plot between the usages for each transit type.


Regression of Pedestrian Count on Subway Usage

The following regression parameters were generated from our regression of region pedestrian count on the subway station entries:

Regression Component Value
slope 18.908
intercept 517937.435
r-squared 0.0855

This yields the regression equation:
Subway_Usage = 517937 + 18.9*(Region_Pedestrian_Count)

Unfortunately the r-squared value indicates that the variation in pedestrian count by region cannot explain the variation in subway usage. This is a surprising finding because it was expected that subway station usage would be heavily effected by the pedestrian traffic in that station’s region. Perhaps if there was a more precise way of measuring pedestrian count in the subway’s vicinity the regression would produce better results. However, the classification is based on pedestrian counts for the pre-selected regions used. Therefore, a limitation is that the pedestrian count metric used was for a wider radius that isn’t completely specific to stations.

Variation in Subway Usage with CitiBike Usage

To see how subway usage varies with CitiBike usage, a orrelation was run on total usage by day for each of the transportation methods. The correlation between subway and CitiBike usage is 0.36 with a low p-value of 4.3 e-06. The following scatterplot displays the relationship between daily usage for each station type.

Fig. 1 - Correlation Scatter Plot: CitiBike and Subway Usage.

Effect of Weather on CitiBike Usage

The strongest correlation found in this investigation was between CitiBike usage and temperature. Subway usage did not correlate to any weather events which exemplified how inelastic this subway system is in New York. People will use the subway regardless of external conditions. The correlation between temperature and CitiBike usage is 0.76 with a very low p-value of 1.1 e-171 (much less than . 05).

Independent Variable Dependent Variable Correlation p-value
Precipitation Subway Usage -0.078 0.0478
Temperature Subway Usage 0.206 0.603
Precipitation CitiBike Usage -0.253 1.053e-14
Temperature CitiBike Usage 0.760 1.100e-171

It is apparent that CitiBike usage is much more sensitive to weather than subway usage. It was expected that the negative and positive correlations between CitiBike usage and precipitation/temperature respectively. These correlations make sense because New Yorkers rely heavily on the subway system. They are either willing to take it during cold temperatures, or they are simply reliant on it for commute.

Fig. 2 - Temperature on CitiBike Usage.
Fig. 3 - Temperature on Subway Usage.

Subway and CitiBike Station Time Series

This time series shows a very distinct weekly trend in which there are peaks during the typical work week and valleys on the weekends. The only exception to this trend can be seen in CitiBike usage where usage varies based on the time of year and rises during the summer, when use is more even across the weeks.

Fig. 4 - Subway and CitiBIke Usage Time Series.

Map of Stations Although there appear to be some clusters that support our hypothesis of high correlation between a region’s pedestrian traffic and its stations’ traffic, when statistical analyses were performed on this data, the expected correlations were not found. However, the 114 pedestrian count regions may have skewed the findings since they are not dispersed evenly throughout the regions in which CitiBike and subway stations exist. Further investigation that would be helpful is finding the correlation between CitiBike stations and subway stations near each other rather than near the same pre-defined regions.

Fig. 5 - Map of Stations.


Most of the correlations expected to find were not found. There was a strong relationship between CitiBike usage and temperature, but not precipitation. There was no correlation between Subway Turnstile usage and weather factors. Also, no correlation was found between pedestrian counts and station usage. The lack of correlation seems to point toward variables which were not investigated. The Subway time series data shows a very strong weekly cycle, under which the usage is much higher during weekdays than during weekends. This suggests that subway usage is controlled by social and economic factors such as commute during working days.

What implications does this investigation have for Boston?* The subway system in New York operates on a much larger scale than the T in Boston. Despite having lower complexity than other transportation systems, the MBTA is still viewed as “a system reaching for expansion even as its core deteriorated.” (Boston Globe 2015) I hope these findings motivate Boston city planners to explore whether or not such expansion is more valuable than expansion of alternative modes of transport such as bikes, taxis, and ridesharing services. Whereas in New York City the subway is a necessity and CitiBikes are dispensable, in Boston the T is often substituted for alternative modes of transportation. Therefore, I suspect that further investment in the T may not be as beneficial as investment in alternate modes of transportation.

Looking Ahead

I hope to build on this project in a variety of ways.

First, this analysis would be stronger if I compared the usages of CitiBike stations and subway stations in the same geographic area. This would require a more specific nearest neighbhors approach. The challenge with this method is that each subway station would have multiple CitiBike stations near it.

Next, an interesting opimtization problem would be: If we had a budget reduction and had to remove a CitiBike station, which station would be the best choice to remove? To solve this, we could take the following approach:

In addition, we could apply the same analysis we used with subway and CitiBikes on taxis and ridership services. The challenge of doing this would be that there aren’t set stations for taxi cabs or Uber cars. However, we could define a taxi/Uber station as a coordinate radius with many pickups.

Also, since we have the start and end coordinates for each CitiBike trip, I’m curious to explore if there are patterns in trip paths. I expect that many of these CitiBike trips are of riders traveling toward subway stations.

Lastly, K-means clustering could be performed to find regional clusters of stations. After assessing the centroids of these we could give them names based on the region they end up in. on the. It would be interesting to modify our time series visualization to see station usage filtered by cluster.