CS 4641 Team 3 Final Report

Final Report Video

https://vimeo.com/582289598/22045a2130

Introduction

The WHO identifies physical inactivity as a top risk factor for global mortality, leading to 3 million deaths annually. Additionally, vehicle emissions are the largest source of air pollutants, which contribute to morbidity and mortality for drivers and commuters alike. Biking helps solve both issues by engaging in physical activity and reducing our carbon footprint. Mobile technology has allowed bike sharing to become increasingly popular in urban areas. Since the introduction of bike share systems (BSS) to the U.S. in 2010, the number of bikes in BSS across the country has increased by 2500%. For example, Capital Bikeshare in the greater Washington, D.C. area grew from 400 to 4,300 bikes serving five jurisdictions. BSS provides affordable, flexible, and environmentally friendly transportation while encouraging users to exercise and have fun.

Problem Definition

With BSSs’ popularity, it is increasingly important to analyze how people use bike-share systems. Understanding the service’s trends and demands can help inform bike station management to meet customer needs and guarantee accessibility.

Our goal is to analyze Capital Bikeshare’s Washington D.C. data to understand what factors influence the use of BSS bikes. Based on this data and different supervised and unsupervised machine learning models, we aim to forecast and predict bike usage and demand. The knowledge gained from this analysis can provide detailed insight for all stakeholders involved in BSS.

Data Collection

Our data set is composed of features provided and extracted from Capital Bikeshare’s Washington D.C. database. The primary dataset1 contains the number of bikes used every hour along with information regarding the day of the week, holiday status, and weather conditions. The secondary dataset2 contains aggregate trip history data at the hour level, including the duration of individual trips. The average trip duration at an hourly resolution was extracted in order to be compatible with the primary dataset. As a result the dataset includes 16 features:

_{Table 0: Dataset Features Table}

Category	Feature	Value	Description
Temporal	Year	0 = 2011 1 = 2012
	Month	1 to 12
	Date		Date of bike trip.
	Hour	0 to 23
	Season	1 = winter 2 = spring 3 = summer 4 = fall
	Holiday	0 = not holiday 1 = holiday	Indicates if the weather day is a holiday or not. Data taken from the Department of Human Resources.
	Weekday	0 = Sunday 1 = Monday 1 = etc.	Day of the week.
	Working Day	0 = weekend or holiday 1 = otherwise	Indicates if the weather day is a work day or not.
Weather	Temperature	[0,1]	Normalized temperature in Celsius
	A-Temperature	[0,1]	Normalized feeling temperature in Celsius
	Humidity	[0,1]	Normalized humidity
	Windspeed	[0,1]	Normalized wind speed
	Weather Situation	1 = clear, few clouds 2 = mist and few clouds 3 = light rain & scattered clouds 4 = heavy rain or thunderstorm
Trip features	Avg Dur (min)		Average trip duration in minutes

These features will be used to estimate bike usage by predicting the number of active bike users at a particular hour.

Methods

Cleaning

To clean the data, missing and duplicate data were filtered out. Continuous features were standardized and encoded features were converted to binary via one hot encoding.

Pearson’s Correlation & Feature Selection

For the purposes of feature selection (to reduce the number of input variables), we will be using Pearson’s Correlation. Since we are using linear regression as a base model, Pearson’s Correlation Coefficient is useful for determining which features are important.

Visualization

Pearson Correlation matrix:

_{Figure 0: Pearson Correlation Matrix}

In a linear regression model, independent variables must be uncorrelated with one another. From the Pearson’s Correlation matrix above, we can see that the number of casual bike riders (casual), registered bike riders (registered), and total bike riders (cnt) provides redundant information (as can be seen by the very high correlation values). As a result, we can reduce the number of dimensions by dropping the casual and registered features. Additionally, day of the year, season, and month are all very highly correlated with each other as well. Out of these three features, we can keep the season feature since it is more highly correlated with the target variable (count of bike riders) than the other two. Finally, the normalized temperature (temp) and normalized feeling temperature (atemp) are highly correlated with one another. We can elect to drop atemp as a feature due to its subjectivity in measurement (compared to temp).

Although we originally started with 16 features, we are able to use the filter method (through the use of the Pearson correlation) to reduce the number of features to the following 11 features: season, yr, hr, holiday, weekday, workingday, weathersit, temp, hum, windspeed, cnt. The visualizations below show the relationship between these features and cnt, the number of users in a given hour. Figure 1 provides insight on the time of day users are most active; the number of users per hour spikes around 8am and 5pm, which are times of typically work commutes. Figures 2, 3, and 4 reflect user count’s positive relationship with temperature, negative relationship with humidity, and weaker relationship with wind speed, respectively, as reflected in the correlation matrix above. Figure 5 shows the seasonality of the number of users- user count is higher per hour during the Spring and Summer months. Finally, as seen in Figure 6, as weather situation increases, which in our case means gets worse, the number of users decreases, which is consistent with the correlation coefficient in the matrix above.

Model & Metric Selection

Linear regression analysis is often used for predicting the dependent variable based on the independent variable. In our case, we had multiple independent variables which led us to use multiple linear regression as a way of forecasting bikeshare usage. Mean squared error (MSE) is the average of the squared errors to indicate how close the regression line is fitted to our data. A lower MSE means the regression line is an accurate fit, which is the main objective of using linear regression. We noticed linear correlations in our data as seen in the figure in the previous section and first chose to use linear regression as a means of analyzing and predicting the bikeshare usage. Our linear regression had a high MSE value, so we explored gradient boosting regression, which combines multiple simple models, also known as weak learners, into a single composite model. Decision trees are the weak learners in gradient boosting, and trees are added one at a time while keeping existing trees in the model unchanged. As we combine more simple models, the final model becomes a stronger predictor and minimizes the loss, which here is mean squared error.

Additionally, we implemented random forest regression because it is known to work well with large datasets and uses multiple decision trees and merges them in order to get better predictions. Because random forest regression searches for the best feature per random subset of features, the model ends up being more accurate than a simple decision tree. Hyperparameters include the number of decision trees, the minimum number of samples in a leaf node after splitting a node, and the depth. Increasing the number of trees used will increase the performance but will be costly. The minimum number of samples in a leaf node after a parent node is split defines the minimum number of samples that must be present for a child node to continue splitting further. Increasing the number of minimum samples will increase performance but only to an extent before underfitting. Greatly increasing the depth will decrease performance of the random forest regressor on the test data.

We also implemented a neural network as another supervised learning method. Like the random forest, NNs are known to work well with large datasets and can perform better than linear regression, especially for nonlinear datasets. Hyperparameters of NNs are the hidden layer size, learning rate, activation function, iteration number, and momentum. We originally used the default Multi-Layer Perceptron Regressor algorithm from sklearn, which uses 100 hidden layers, a constant learning rate, a ReLU activation function, and max iteration of 200, and changed some values to get a better fitting model.

Finally, we incorporated K-Means Clustering as our unsupervised learning method. K-Means partitions our data into k clusters, where the hyperparameter k can be found through the use of the elbow method. The elbow method also serves as a validation step, so we know that we are choosing the correct number of clusters.

In order to compare the performance of these models, we will analyze MSE and adjusted R^2 values, which explains the degree to which the variables explain the variation of the output and accounts for variables that do not improve the model.

Implementation

All models were trained on the cleaned data. For supervised models, the data was split into training and testing sets with a 70:30 split. Prior to formal model implementation, some models were subject to hyperparameter optimization. For supervised models, we optimized hyperparameters based on MSE. For unsupervised models, we optimized hyperparameters based on the sum of square error (SSE or inertia).

Each finalized supervised model design used K-fold Cross Validation (5 folds) and test data prediction to assess model bias. Adjusted R², MSE, “Predicted vs Actual” plots were used to measure and compare prediction performance.

Linear Regression

Linear regression did not require optimization.

Gradient Boosting Regression

Default parameters were selected for the gradient boosting regression model.

_{Table 2: Gradient Boosting Hyperparameters}

Random Forest

A series of grid searches were used to evaluate model hyperparameters: number of estimators/trees, minimum samples per leaf node, and maximum tree depth. For each set of parameters, the average K-Fold CV (5 folds) and prediction performances were measured. See Table 3 and Figures 7 - 10.

_{Figure 7}

_{Figure 8}

_{Figure 9}

_{Figure 10}

_{Table 3: Random Forest Hyperparameters}

For the number of trees, the trend was not smooth; however, 100 was selected based on being a typical forest size and having it yield reliable results without being a computational burden. For max depth, the trend was smooth, indicating values greater than 16 provided consistently better results. Looking at that range in detail, 17 was selected based on generalizability and consistently yielding greater prediction performance than validation performance. Minimum leaf samples provided a clear trend indicating 2 was the best selection based on elbow method.

Neural Network

_{Table 4: Neural Network Hyperparameters}

The Neural Network MLP Regressor was designed to converge and minimize MSE. To do so, the max iteration parameter was increased to 300 from the default 200. Additionally, a grid search was used to evaluate hidden layer size at intervals of 50. 200 layers resulted in the best performance. Default parameters for activation function and learning rate were selected.

K-Means Clustering

_{Figure 11: Elbow Plot}

As it is difficult to visualize multidimensional clusters, normalized temperature and normalized humidity were chosen as the two features for clustering (due to those two features having very high correlation with the count of bikes). The elbow method shows that the optimal number of clusters (which is k, our hyperparameter) would be 4.

Results and Discussion

Supervised Models

_{Table 5: Summary of Results per Supervised Model (including average validation score and predictions scores)}

_{Figure 12: Summary of Results per Supervised Model
This figure depicts each model’s performance per row. The first and second columns capture K-Fold Cross Validation (dashed line) and Prediction (solid circle) performance per metric. The third column captures Prediction vs Actual plots along with the target red-dashed diagonal line.}

The goal of the models is to maximize the adjusted R² value while minimizing MSE. The metrics inform the significance of the correlation and fit of the models. In all models, the predicted adjusted R² and MSE values lie within or outperformed the range of those produced by K-Fold CV. In comparison to each other, Random Forest had the greatest adjusted R² and the smallest MSE value amongst all models. Figure 12 Column 3 depicts the relationship between the predicted and actual values of the testing dataset. Ideally, the points would appear in a diagonal line, such that the actual and predicted values are equal. By visual inspection, the Random Forest and the Neural Network illustrated a more consistent diagonal trend. With all of these metrics and visuals combined, the Random Forest came out as the superior prediction model for our data.

Unsupervised Model (K-Means Clustering)

_{Figure 13: K-Means Cluster Plot}

_{Table 11: Mean Number of Bikes for all points in each cluster}

Due to the unlabeled nature of data in K-Means Clustering, it is difficult to predict the exact number of bikes given the temperature and humidity. However, we can see from the results above that it is possible to know the relative usage of bikes based on a given temperature and humidity. For instance, the number of bikes in use at times where there is a low temperature and a high humidity is relatively quite low. Conversely, the number of bikes in use at times where there is a high temperature and a low humidity is relatively quite high. In times of high humidity and high temperature or in times of low temperature and low humidity, the number of bikes in use is usually somewhere in the middle.

Conclusion

Based on the results, all supervised models were robust towards bias. All prediction performances lie within the range of cross-validation performances.

Feature standardization and feature selection improved model performance (compared to previous attempts with linear regression and random forest regression). However, it is expected that the nonlinear relationship between the features and the number of bike users is responsible for nonlinear models to outperform linear models. Ultimately, the formally optimized models – random forest regression and neural network – outperformed in both MSE and R2 among supervised models.

Model hyperparameter optimization – with greater resolution – had a significant impact on performance. Both random forest regression and neural network were capable of an R2 greater than 92% and MSE less than 0.08 while completing training and predicting under 16 seconds. Further analysis and optimization can be done on the other hyperparameters in these models. Additionally, testing on new data from future or past years can be used to further compare performance, generalizability, and bias.

The unsupervised K-Means clustering model is capable of making broad predictions on the relative usage of bikes based on temperature and humidity. However, it does not compare to regression models predicting the exact number of bikes at a given time. Models such as the Random Forest Regression would likely be better suited for predicting the number of bikes, but K-Means may offer a fast prediction of the range of users using fewer features.

The goal of this project was to use Washington D.C.’s bikeshare data to forecast bike usage. Based on this study, predicting the hourly bike users can be done with a supervised Random Forest Regression model, using 10 of 15 original features from the dataset. More research is necessary to further evaluate generalizability of the current model and at what rate the model requires to be retrained to maintain consistent prediction accuracy. Additionally, more experimentation can be done to evaluate performance while minimizing features in an attempt to simplify data collection strategies, simplify model design, and maximize performance.

References:

[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243514/

[2] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6286277/

[3] https://nacto.org/bike-share-statistics-2016/

[4] https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset#

[5] https://www.kaggle.com/chiragbangera/bike-sharing-data