Introduction

I am going to predict the number of borrowed bikes in a given day and station by using 4 available data set below.

  • station.csv - Contains data that represents a station where users can pick up or return bikes.
  • status.csv - Data about the number of bikes and docks available for a given station and minute.
  • trips.csv - Data about individual bike trips.
  • weather.csv - Data about the weather on a specific day for specific station id.

Hypothesis Generation

What could effect people to rent a bike?

  • Number of bikes
  • Condition of the bikes
  • Availablity of the bike roads around the station
  • Daily Trend: weekdays as compared to weekend or holiday
  • Hourly trend: office timings. Early morning and late evening
  • Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day
  • Temperature: I presume it may have positive correlation
  • Pollution: government / company policies
  • Traffic: Higher traffic may force people to use bike as compared to others

Understanding the Data Set and EDA

I’ve already performed EDA on the data sets, you can access it via this link.

Install Necessary Packages

import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr  
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
from datetime import datetime
from sklearn.model_selection import train_test_split
import math
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, median_absolute_error
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

Loading Data Sets

df_trip = pd.read_csv("trip.csv")
df_station = pd.read_csv("station.csv")
df_weather = pd.read_csv("weather.csv")

Station Data Set

df_station

idnamelatlongdock_countcityinstallation_date
02San Jose Diridon Caltrain Station37.329732-121.90178227San Jose8/6/2013
13San Jose Civic Center37.330698-121.88897915San Jose8/5/2013
24Santa Clara at Almaden37.333988-121.89490211San Jose8/6/2013
35Adobe on Almaden37.331415-121.89320019San Jose8/5/2013
46San Pedro Square37.336721-121.89407415San Jose8/7/2013
........................
6577Market at Sansome37.789625-122.40081127San Francisco8/25/2013
6680Santa Clara County Civic Center37.352601-121.90573315San Jose12/31/2013
6782Broadway St at Battery St37.798541-122.40086215San Francisco1/22/2014
6883Mezes Park37.491269-122.23623415Redwood City2/20/2014
6984Ryland Park37.342725-121.89561715San Jose4/9/2014

70 rows × 7 columns

len(df_station.id.unique())
70

There are 70 different stations.

Trip Data Set

# Have a look at first 5 instance
df_trip.head()

iddurationstart_datestart_station_namestart_station_idend_dateend_station_nameend_station_idbike_idsubscription_typezip_code
04576638/29/2013 14:13South Van Ness at Market668/29/2013 14:14South Van Ness at Market66520Subscriber94127
14607708/29/2013 14:42San Jose City Hall108/29/2013 14:43San Jose City Hall10661Subscriber95138
24130718/29/2013 10:16Mountain View City Hall278/29/2013 10:17Mountain View City Hall2748Subscriber97214
34251778/29/2013 11:29San Jose City Hall108/29/2013 11:30San Jose City Hall1026Subscriber95060
44299838/29/2013 12:02South Van Ness at Market668/29/2013 12:04Market at 10th67319Subscriber94103
# Number of instance and variables
df_trip.shape
(669959, 11)
# Checking for Missing Values
df_trip.isnull().sum()
id                       0
duration                 0
start_date               0
start_station_name       0
start_station_id         0
end_date                 0
end_station_name         0
end_station_id           0
bike_id                  0
subscription_type        0
zip_code              6619
dtype: int64
#Check dtypes
df_trip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669959 entries, 0 to 669958
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   id                  669959 non-null  int64 
 1   duration            669959 non-null  int64 
 2   start_date          669959 non-null  object
 3   start_station_name  669959 non-null  object
 4   start_station_id    669959 non-null  int64 
 5   end_date            669959 non-null  object
 6   end_station_name    669959 non-null  object
 7   end_station_id      669959 non-null  int64 
 8   bike_id             669959 non-null  int64 
 9   subscription_type   669959 non-null  object
 10  zip_code            663340 non-null  object
dtypes: int64(5), object(6)
memory usage: 56.2+ MB
#Create a new variable date from start_date
df_trip['date'] = df_trip.start_date
df_trip.date = pd.to_datetime(df_trip.start_date, format='%m/%d/%Y %H:%M')
df_trip['date'] = pd.to_datetime(df_trip['date'])
# Summary statistics
df_trip.describe()

iddurationstart_station_idend_station_idbike_id
count669959.0000006.699590e+05669959.000000669959.000000669959.000000
mean460382.0098991.107950e+0357.85187657.837438427.587620
std264584.4584872.225544e+0417.11247417.200142153.450988
min4069.0000006.000000e+012.0000002.0000009.000000
25%231082.5000003.440000e+0250.00000050.000000334.000000
50%459274.0000005.170000e+0262.00000062.000000440.000000
75%692601.0000007.550000e+0270.00000070.000000546.000000
max913460.0000001.727040e+0784.00000084.000000878.000000

Weather

df_weather

datestart_station_idzip_codemean_temperature_fmin_temperature_fmean_humiditymax_humiditymean_visibility_milesmin_visibility_milesmax_wind_Speed_mphmax_gust_speed_mphprecipitation_inchesevents
02013-08-29279410768.061.075.093.010.010.023.028.00.0NaN
12013-08-29399410768.061.075.093.010.010.023.028.00.0NaN
22013-08-29419410768.061.075.093.010.010.023.028.00.0NaN
32013-08-29479410768.061.075.093.010.010.023.028.00.0NaN
42013-08-29489410768.061.075.093.010.010.023.028.00.0NaN
..........................................
520382015-08-31719410970.860.263.882.211.49.819.022.50.0NaN
520392015-08-31669410370.860.263.882.211.49.819.022.50.0NaN
520402015-08-31679410370.860.263.882.211.49.819.022.50.0NaN
520412015-08-31729410370.860.263.882.211.49.819.022.50.0NaN
520422015-08-31259503770.860.263.882.211.49.819.022.50.0NaN

52043 rows × 13 columns

df_weather.events.unique()
array([nan, 'Fog', 'Rain', 'Fog-Rain', 'rain', 'Rain-Thunderstorm'],
      dtype=object)
df_weather.loc[df_weather.events == 'rain', 'events'] = "Rain"
df_weather.loc[df_weather.events.isnull(), 'events'] = "Normal"
df_weather.isnull().sum()
date                       0
start_station_id           0
zip_code                   0
mean_temperature_f         0
min_temperature_f          5
mean_humidity              0
max_humidity             105
mean_visibility_miles      0
min_visibility_miles      26
max_wind_Speed_mph         0
max_gust_speed_mph         2
precipitation_inches     252
events                     0
dtype: int64
#Fill null cells with lambda
df_weather = df_weather.apply(lambda x:x.fillna(x.value_counts().index[0]))
df_weather.isnull().sum()
date                     0
start_station_id         0
zip_code                 0
mean_temperature_f       0
min_temperature_f        0
mean_humidity            0
max_humidity             0
mean_visibility_miles    0
min_visibility_miles     0
max_wind_Speed_mph       0
max_gust_speed_mph       0
precipitation_inches     0
events                   0
dtype: int64

Data Preparation

  • Prepare data set for prediction based on df_trip dataframe
  • Drop unnecessary variables
  • Data set should countain date and total number of daily trips from each stations
  • Day,Mount,Year,Weekday,Holiday,Season variables need to be added
  • Station data should be merged
  • Weather data should be merged
#Drop unnecessary variables
df= df_trip.drop(['id','duration','start_date','start_station_name','end_date', 'end_station_name','end_station_id','bike_id','subscription_type', 'zip_code'], 1)
#Number of trips variable
df['date'] = df.date.dt.date
df = df.value_counts().to_frame('no_of_trips').reset_index()

#Sort by date
df = df.sort_values(by=['date'])
df.reset_index(drop=True, inplace=True)

df.head()

start_station_iddateno_of_trips
0222013-08-295
1132013-08-294
2302013-08-291
3492013-08-2918
4552013-08-296
# We will make daily predictions, hours is not important
df['year'] = df_trip['date'].dt.year
df['month'] = df_trip['date'].dt.month
df['day'] = df_trip['date'].dt.day

# Variables added
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43270 entries, 0 to 43269
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   start_station_id  43270 non-null  int64 
 1   date              43270 non-null  object
 2   no_of_trips       43270 non-null  int64 
 3   year              43270 non-null  int64 
 4   month             43270 non-null  int64 
 5   day               43270 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 2.0+ MB
#Day name of the week
df['date'] = pd.to_datetime(df['date'])
df['day_name'] = df['date'].dt.day_name()
#Weekend
df['weekend'] = df['weekend'] = df['date'].dt.dayofweek > 4
df['weekend'] = df['weekend'].astype(object)
#Season variable
df['season'] = (df['date'].dt.month%12 + 3)//3

seasons = {
             1: 'Winter',
             2: 'Spring',
             3: 'Summer',
             4: 'Autumn'
}

df['season'] = df['season'].map(seasons)
#Find holidays
cal = calendar()
dr = pd.date_range(start='2013-07-01', end='2016-07-31')
holidays = cal.holidays(start=dr.min(), end=dr.max())
#Holiday variable
df['holiday'] = df['date'].dt.date.astype('datetime64').isin(holidays)
df['holiday'] = df['holiday'].astype(object)
# allign dtypes before merge
df['date'] = pd.to_datetime(df['date'])
df_station['installation_date'] = pd.to_datetime(df_station['installation_date'])
df_weather['date'] = pd.to_datetime(df_weather['date'])
#merge trips and station data
no_of_docks = []
for day in df.date:
    no_of_docks.append(sum(df_station[df_station.installation_date <= day].dock_count))
    
df['no_of_docks'] = no_of_docks
#merge trips and weather data
df = pd.merge(df, df_weather, on=['start_station_id','date'], how='left')

df.head()

start_station_iddateno_of_tripsyearmonthdayday_nameweekendseasonholiday...mean_temperature_fmin_temperature_fmean_humiditymax_humiditymean_visibility_milesmin_visibility_milesmax_wind_Speed_mphmax_gust_speed_mphprecipitation_inchesevents
0222013-08-2952013829ThursdayFalseSummerFalse...71.062.079.094.010.010.014.017.00.0Normal
1132013-08-2942013829ThursdayFalseSummerFalse...70.462.873.288.810.010.017.821.60.0Normal
2302013-08-2912013829ThursdayFalseSummerFalse...70.462.873.288.810.010.017.821.60.0Normal
3492013-08-29182013829ThursdayFalseSummerFalse...70.462.873.288.810.010.017.821.60.0Normal
4552013-08-2962013829ThursdayFalseSummerFalse...70.462.873.288.810.010.017.821.60.0Normal

5 rows × 22 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43270 entries, 0 to 43269
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   start_station_id       43270 non-null  int64         
 1   date                   43270 non-null  datetime64[ns]
 2   no_of_trips            43270 non-null  int64         
 3   year                   43270 non-null  int64         
 4   month                  43270 non-null  int64         
 5   day                    43270 non-null  int64         
 6   day_name               43270 non-null  object        
 7   weekend                43270 non-null  object        
 8   season                 43270 non-null  object        
 9   holiday                43270 non-null  object        
 10  no_of_docks            43270 non-null  int64         
 11  zip_code               43270 non-null  int64         
 12  mean_temperature_f     43270 non-null  float64       
 13  min_temperature_f      43270 non-null  float64       
 14  mean_humidity          43270 non-null  float64       
 15  max_humidity           43270 non-null  float64       
 16  mean_visibility_miles  43270 non-null  float64       
 17  min_visibility_miles   43270 non-null  float64       
 18  max_wind_Speed_mph     43270 non-null  float64       
 19  max_gust_speed_mph     43270 non-null  float64       
 20  precipitation_inches   43270 non-null  float64       
 21  events                 43270 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(7), object(5)
memory usage: 7.6+ MB

Feature Engineering

Standard Scaler for continous variables

scaler = StandardScaler()
df[['no_of_docks','mean_temperature_f', 'min_temperature_f', 'mean_humidity', 'max_humidity','mean_visibility_miles', 'min_visibility_miles', 'max_wind_Speed_mph', 'max_gust_speed_mph','max_gust_speed_mph']]=scaler.fit_transform(df[['no_of_docks','mean_temperature_f', 'min_temperature_f', 'mean_humidity', 'max_humidity','mean_visibility_miles', 'min_visibility_miles', 'max_wind_Speed_mph', 'max_gust_speed_mph','max_gust_speed_mph']].to_numpy())
df

start_station_iddateno_of_tripsyearmonthdayday_nameweekendseasonholiday...mean_temperature_fmin_temperature_fmean_humiditymax_humiditymean_visibility_milesmin_visibility_milesmax_wind_Speed_mphmax_gust_speed_mphprecipitation_inchesevents
0222013-08-2952013829ThursdayFalseSummerFalse...1.4312421.4241431.1008451.0257950.2146770.695694-0.568521-0.5813150.0Normal
1132013-08-2942013829ThursdayFalseSummerFalse...1.3426521.5397400.5427800.3930640.2146770.6956940.1171230.0011710.0Normal
2302013-08-2912013829ThursdayFalseSummerFalse...1.3426521.5397400.5427800.3930640.2146770.6956940.1171230.0011710.0Normal
3492013-08-29182013829ThursdayFalseSummerFalse...1.3426521.5397400.5427800.3930640.2146770.6956940.1171230.0011710.0Normal
4552013-08-2962013829ThursdayFalseSummerFalse...1.3426521.5397400.5427800.3930640.2146770.6956940.1171230.0011710.0Normal
..................................................................
4326592015-08-31220131017MondayFalseSummerFalse...1.4017121.164050-0.361670-0.4100161.4722860.6192600.3336420.1151360.0Normal
4326672015-08-31220131017MondayFalseSummerFalse...1.4017121.164050-0.361670-0.4100161.4722860.6192600.3336420.1151360.0Normal
43267222015-08-31520131017MondayFalseSummerFalse...1.4312421.424143-0.053772-1.0427460.2146770.6956940.153210-0.2014330.0Normal
43268772015-08-314920131017MondayFalseSummerFalse...1.1359431.135151-0.053772-0.1909940.2146770.3135230.153210-0.0748050.0Normal
43269422015-08-313020131017MondayFalseSummerFalse...1.4017121.164050-0.361670-0.4100161.4722860.6192600.3336420.1151360.0Normal

43270 rows × 22 columns

One hot encoding for categorical variables


# Making dummy variables for categorical data with more inputs.  
data_dummy = pd.get_dummies(df[['day_name','weekend','season', 'holiday', 'events']])

# Output the first five rows.
data_dummy.head()

day_name_Fridayday_name_Mondayday_name_Saturdayday_name_Sundayday_name_Thursdayday_name_Tuesdayday_name_Wednesdayweekend_Falseweekend_Trueseason_Autumnseason_Springseason_Summerseason_Winterholiday_Falseholiday_Trueevents_Fogevents_Fog-Rainevents_Normalevents_Rainevents_Rain-Thunderstorm
000001001000101000100
100001001000101000100
200001001000101000100
300001001000101000100
400001001000101000100
# Merging (concatenate) original data frame with 'dummy' dataframe.
df = pd.concat([df,data_dummy], axis=1)
# Dropping attributes for which we made dummy variables and zip_code
df = df.drop(['day_name','weekend','season', 'holiday', 'events','zip_code'], axis=1)
df.head()

start_station_iddateno_of_tripsyearmonthdayno_of_docksmean_temperature_fmin_temperature_fmean_humidity...season_Springseason_Summerseason_Winterholiday_Falseholiday_Trueevents_Fogevents_Fog-Rainevents_Normalevents_Rainevents_Rain-Thunderstorm
0222013-08-2952013829-2.1724861.4312421.4241431.100845...0101000100
1132013-08-2942013829-2.1724861.3426521.5397400.542780...0101000100
2302013-08-2912013829-2.1724861.3426521.5397400.542780...0101000100
3492013-08-29182013829-2.1724861.3426521.5397400.542780...0101000100
4552013-08-2962013829-2.1724861.3426521.5397400.542780...0101000100

5 rows × 36 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43270 entries, 0 to 43269
Data columns (total 36 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   start_station_id          43270 non-null  int64         
 1   date                      43270 non-null  datetime64[ns]
 2   no_of_trips               43270 non-null  int64         
 3   year                      43270 non-null  int64         
 4   month                     43270 non-null  int64         
 5   day                       43270 non-null  int64         
 6   no_of_docks               43270 non-null  float64       
 7   mean_temperature_f        43270 non-null  float64       
 8   min_temperature_f         43270 non-null  float64       
 9   mean_humidity             43270 non-null  float64       
 10  max_humidity              43270 non-null  float64       
 11  mean_visibility_miles     43270 non-null  float64       
 12  min_visibility_miles      43270 non-null  float64       
 13  max_wind_Speed_mph        43270 non-null  float64       
 14  max_gust_speed_mph        43270 non-null  float64       
 15  precipitation_inches      43270 non-null  float64       
 16  day_name_Friday           43270 non-null  uint8         
 17  day_name_Monday           43270 non-null  uint8         
 18  day_name_Saturday         43270 non-null  uint8         
 19  day_name_Sunday           43270 non-null  uint8         
 20  day_name_Thursday         43270 non-null  uint8         
 21  day_name_Tuesday          43270 non-null  uint8         
 22  day_name_Wednesday        43270 non-null  uint8         
 23  weekend_False             43270 non-null  uint8         
 24  weekend_True              43270 non-null  uint8         
 25  season_Autumn             43270 non-null  uint8         
 26  season_Spring             43270 non-null  uint8         
 27  season_Summer             43270 non-null  uint8         
 28  season_Winter             43270 non-null  uint8         
 29  holiday_False             43270 non-null  uint8         
 30  holiday_True              43270 non-null  uint8         
 31  events_Fog                43270 non-null  uint8         
 32  events_Fog-Rain           43270 non-null  uint8         
 33  events_Normal             43270 non-null  uint8         
 34  events_Rain               43270 non-null  uint8         
 35  events_Rain-Thunderstorm  43270 non-null  uint8         
dtypes: datetime64[ns](1), float64(10), int64(5), uint8(20)
memory usage: 6.4 MB

Model Building

labels = df.no_of_trips
train = df.drop(['no_of_trips', 'date'], 1)
X_train, X_test, y_train, y_test = train_test_split(train, labels, test_size=0.2, random_state = 2)

Random Forest Regressor

rfr = RandomForestRegressor(n_estimators = 55,
                            min_samples_leaf = 3,
                            random_state = 2)
rfr = rfr.fit(X_train, y_train)

Ada Boost Regressor

abr = AdaBoostRegressor(n_estimators = 100,
                        learning_rate = 0.1,
                        loss = 'linear',
                        random_state = 2)

GradientBoostingRegressor

gbr = GradientBoostingRegressor(learning_rate = 0.12,
                                n_estimators = 55,
                                max_depth = 5,
                                min_samples_leaf = 1,
                                random_state = 2)

Scoring the Models

scoring = ['r2','neg_mean_squared_error','neg_mean_absolute_error']
models = [rfr,abr,gbr]

for model in models:
    for i in scoring:
        scores = cross_val_score( model, X_train, y_train, cv=5, scoring = i)
    #    print(scores)
        if i == 'r2':
            print(model, i, ': ', scores.mean())
        elif i == 'neg_mean_squared_error':    
            x = -1*scores.mean()
            y = math.sqrt(x) 
            print(model, 'RMSE: ', "%0.2f" % y)
        elif i == 'neg_mean_absolute_error':
            x = -1*scores.mean()
            print(model,i, ": %0.2f (+/- %0.2f)" % (x, scores.std() * 2))   
RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) r2 :  0.9059889252171234
RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) RMSE:  5.43
RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) neg_mean_absolute_error : 3.39 (+/- 0.11)
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) r2 :  0.6193340252543343
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) RMSE:  10.93
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) neg_mean_absolute_error : 7.74 (+/- 0.23)
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) r2 :  0.8729724087591497
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) RMSE:  6.32
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) neg_mean_absolute_error : 4.12 (+/- 0.08)
#Fit and predict with the best model
rfr = rfr.fit(X_train, y_train)
rfr_preds = rfr.predict(X_test)
#Visualize
y_test.reset_index(drop = True, inplace = True)
fs = 30
plt.figure(figsize=(50,20))
plt.plot(rfr_preds)
plt.plot(y_test)
plt.legend(['Prediction', 'Acutal'])
plt.ylabel("Number of Trips", fontsize = fs)
plt.xlabel("Predicted Date", fontsize = fs)
plt.title("Predicted Values vs Actual Values", fontsize = fs)
plt.show()

png

#Features by importance
def plot_importances(model, model_name):
    importances = model.feature_importances_
    std = np.std([model.feature_importances_ for feature in model.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]    

    # Plot the feature importances of the forest
    plt.figure(figsize = (12,5))
    plt.title("Feature importances of " + model_name)
    plt.bar(range(X_train.shape[1]), importances[indices], color="r", align="center")
    plt.xticks(range(X_train.shape[1]), indices)
    plt.xlim([-1, X_train.shape[1]])
    plt.show()

print("Feature ranking:")

i = 0
for feature in X_train:
    print (i, feature)
    i += 1
    
plot_importances(rfr, "RandomForestRegressor")
Feature ranking:
0 start_station_id
1 year
2 month
3 day
4 no_of_docks
5 mean_temperature_f
6 min_temperature_f
7 mean_humidity
8 max_humidity
9 mean_visibility_miles
10 min_visibility_miles
11 max_wind_Speed_mph
12 max_gust_speed_mph
13 precipitation_inches
14 day_name_Friday
15 day_name_Monday
16 day_name_Saturday
17 day_name_Sunday
18 day_name_Thursday
19 day_name_Tuesday
20 day_name_Wednesday
21 weekend_False
22 weekend_True
23 season_Autumn
24 season_Spring
25 season_Summer
26 season_Winter
27 holiday_False
28 holiday_True
29 events_Fog
30 events_Fog-Rain
31 events_Normal
32 events_Rain
33 events_Rain-Thunderstorm

png

Question: Can you predict the demand for bikes in a given station day in advance?

If yes, what arethe main factors contributing to the predictions? Would this model be actionable to business? The model should be basic, the only requirement is that it should be interpretable.

Answer: Yes

Station, weekday/weekend info, number of docks, mean_temperature, day. If this model is improved, it can be actionable to business. Bike and docks number can be increased for the station which the prediction model interprets the demand will be higher. Or, for the stations with low number predicted demand may be considered to close down.

Question: What are 3 additional pieces of information you would recommend to add to these datasets to gain more insight into the business and usage of this product?

Answer:

  • Condition of the bikes
  • Population in the region of the station
  • Traffic data
  • Availablity of the bike roads in the region of the station
  • Polution
  • Average bike prices in the region