Bike Share Demand Prediction in San Francisco

Introduction

I am going to predict the number of borrowed bikes in a given day and station by using 4 available data set below.

station.csv - Contains data that represents a station where users can pick up or return bikes.
status.csv - Data about the number of bikes and docks available for a given station and minute.
trips.csv - Data about individual bike trips.
weather.csv - Data about the weather on a specific day for specific station id.

Hypothesis Generation

What could effect people to rent a bike?

Number of bikes
Condition of the bikes
Availablity of the bike roads around the station
Daily Trend: weekdays as compared to weekend or holiday
Hourly trend: office timings. Early morning and late evening
Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day
Temperature: I presume it may have positive correlation
Pollution: government / company policies
Traffic: Higher traffic may force people to use bike as compared to others

Understanding the Data Set and EDA

I’ve already performed EDA on the data sets, you can access it via this link.

Install Necessary Packages

import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr  
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
from datetime import datetime
from sklearn.model_selection import train_test_split
import math
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, median_absolute_error
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

Loading Data Sets

df_trip = pd.read_csv("trip.csv")
df_station = pd.read_csv("station.csv")
df_weather = pd.read_csv("weather.csv")

Station Data Set

df_station

	id	name	lat	long	dock_count	city	installation_date
0	2	San Jose Diridon Caltrain Station	37.329732	-121.901782	27	San Jose	8/6/2013
1	3	San Jose Civic Center	37.330698	-121.888979	15	San Jose	8/5/2013
2	4	Santa Clara at Almaden	37.333988	-121.894902	11	San Jose	8/6/2013
3	5	Adobe on Almaden	37.331415	-121.893200	19	San Jose	8/5/2013
4	6	San Pedro Square	37.336721	-121.894074	15	San Jose	8/7/2013
...	...	...	...	...	...	...	...
65	77	Market at Sansome	37.789625	-122.400811	27	San Francisco	8/25/2013
66	80	Santa Clara County Civic Center	37.352601	-121.905733	15	San Jose	12/31/2013
67	82	Broadway St at Battery St	37.798541	-122.400862	15	San Francisco	1/22/2014
68	83	Mezes Park	37.491269	-122.236234	15	Redwood City	2/20/2014
69	84	Ryland Park	37.342725	-121.895617	15	San Jose	4/9/2014

70 rows × 7 columns

len(df_station.id.unique())

There are 70 different stations.

Trip Data Set

# Have a look at first 5 instance
df_trip.head()

	id	duration	start_date	start_station_name	start_station_id	end_date	end_station_name	end_station_id	bike_id	subscription_type	zip_code
0	4576	63	8/29/2013 14:13	South Van Ness at Market	66	8/29/2013 14:14	South Van Ness at Market	66	520	Subscriber	94127
1	4607	70	8/29/2013 14:42	San Jose City Hall	10	8/29/2013 14:43	San Jose City Hall	10	661	Subscriber	95138
2	4130	71	8/29/2013 10:16	Mountain View City Hall	27	8/29/2013 10:17	Mountain View City Hall	27	48	Subscriber	97214
3	4251	77	8/29/2013 11:29	San Jose City Hall	10	8/29/2013 11:30	San Jose City Hall	10	26	Subscriber	95060
4	4299	83	8/29/2013 12:02	South Van Ness at Market	66	8/29/2013 12:04	Market at 10th	67	319	Subscriber	94103

# Number of instance and variables
df_trip.shape

(669959, 11)

# Checking for Missing Values
df_trip.isnull().sum()

id                       0
duration                 0
start_date               0
start_station_name       0
start_station_id         0
end_date                 0
end_station_name         0
end_station_id           0
bike_id                  0
subscription_type        0
zip_code              6619
dtype: int64

#Check dtypes
df_trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669959 entries, 0 to 669958
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   id                  669959 non-null  int64 
 1   duration            669959 non-null  int64 
 2   start_date          669959 non-null  object
 3   start_station_name  669959 non-null  object
 4   start_station_id    669959 non-null  int64 
 5   end_date            669959 non-null  object
 6   end_station_name    669959 non-null  object
 7   end_station_id      669959 non-null  int64 
 8   bike_id             669959 non-null  int64 
 9   subscription_type   669959 non-null  object
 10  zip_code            663340 non-null  object
dtypes: int64(5), object(6)
memory usage: 56.2+ MB

#Create a new variable date from start_date
df_trip['date'] = df_trip.start_date
df_trip.date = pd.to_datetime(df_trip.start_date, format='%m/%d/%Y %H:%M')
df_trip['date'] = pd.to_datetime(df_trip['date'])

# Summary statistics
df_trip.describe()

	id	duration	start_station_id	end_station_id	bike_id
count	669959.000000	6.699590e+05	669959.000000	669959.000000	669959.000000
mean	460382.009899	1.107950e+03	57.851876	57.837438	427.587620
std	264584.458487	2.225544e+04	17.112474	17.200142	153.450988
min	4069.000000	6.000000e+01	2.000000	2.000000	9.000000
25%	231082.500000	3.440000e+02	50.000000	50.000000	334.000000
50%	459274.000000	5.170000e+02	62.000000	62.000000	440.000000
75%	692601.000000	7.550000e+02	70.000000	70.000000	546.000000
max	913460.000000	1.727040e+07	84.000000	84.000000	878.000000

Weather

df_weather

	date	start_station_id	zip_code	mean_temperature_f	min_temperature_f	mean_humidity	max_humidity	mean_visibility_miles	min_visibility_miles	max_wind_Speed_mph	max_gust_speed_mph	precipitation_inches	events
0	2013-08-29	27	94107	68.0	61.0	75.0	93.0	10.0	10.0	23.0	28.0	0.0	NaN
1	2013-08-29	39	94107	68.0	61.0	75.0	93.0	10.0	10.0	23.0	28.0	0.0	NaN
2	2013-08-29	41	94107	68.0	61.0	75.0	93.0	10.0	10.0	23.0	28.0	0.0	NaN
3	2013-08-29	47	94107	68.0	61.0	75.0	93.0	10.0	10.0	23.0	28.0	0.0	NaN
4	2013-08-29	48	94107	68.0	61.0	75.0	93.0	10.0	10.0	23.0	28.0	0.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
52038	2015-08-31	71	94109	70.8	60.2	63.8	82.2	11.4	9.8	19.0	22.5	0.0	NaN
52039	2015-08-31	66	94103	70.8	60.2	63.8	82.2	11.4	9.8	19.0	22.5	0.0	NaN
52040	2015-08-31	67	94103	70.8	60.2	63.8	82.2	11.4	9.8	19.0	22.5	0.0	NaN
52041	2015-08-31	72	94103	70.8	60.2	63.8	82.2	11.4	9.8	19.0	22.5	0.0	NaN
52042	2015-08-31	25	95037	70.8	60.2	63.8	82.2	11.4	9.8	19.0	22.5	0.0	NaN

52043 rows × 13 columns

df_weather.events.unique()

array([nan, 'Fog', 'Rain', 'Fog-Rain', 'rain', 'Rain-Thunderstorm'],
      dtype=object)

df_weather.loc[df_weather.events == 'rain', 'events'] = "Rain"
df_weather.loc[df_weather.events.isnull(), 'events'] = "Normal"

df_weather.isnull().sum()

date                       0
start_station_id           0
zip_code                   0
mean_temperature_f         0
min_temperature_f          5
mean_humidity              0
max_humidity             105
mean_visibility_miles      0
min_visibility_miles      26
max_wind_Speed_mph         0
max_gust_speed_mph         2
precipitation_inches     252
events                     0
dtype: int64

#Fill null cells with lambda
df_weather = df_weather.apply(lambda x:x.fillna(x.value_counts().index[0]))

df_weather.isnull().sum()

date                     0
start_station_id         0
zip_code                 0
mean_temperature_f       0
min_temperature_f        0
mean_humidity            0
max_humidity             0
mean_visibility_miles    0
min_visibility_miles     0
max_wind_Speed_mph       0
max_gust_speed_mph       0
precipitation_inches     0
events                   0
dtype: int64

Data Preparation

Prepare data set for prediction based on df_trip dataframe
Drop unnecessary variables
Data set should countain date and total number of daily trips from each stations
Day,Mount,Year,Weekday,Holiday,Season variables need to be added
Station data should be merged
Weather data should be merged

#Drop unnecessary variables
df= df_trip.drop(['id','duration','start_date','start_station_name','end_date', 'end_station_name','end_station_id','bike_id','subscription_type', 'zip_code'], 1)

#Number of trips variable
df['date'] = df.date.dt.date
df = df.value_counts().to_frame('no_of_trips').reset_index()

#Sort by date
df = df.sort_values(by=['date'])
df.reset_index(drop=True, inplace=True)

df.head()

	start_station_id	date	no_of_trips
0	22	2013-08-29	5
1	13	2013-08-29	4
2	30	2013-08-29	1
3	49	2013-08-29	18
4	55	2013-08-29	6

# We will make daily predictions, hours is not important
df['year'] = df_trip['date'].dt.year
df['month'] = df_trip['date'].dt.month
df['day'] = df_trip['date'].dt.day

# Variables added
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43270 entries, 0 to 43269
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   start_station_id  43270 non-null  int64 
 1   date              43270 non-null  object
 2   no_of_trips       43270 non-null  int64 
 3   year              43270 non-null  int64 
 4   month             43270 non-null  int64 
 5   day               43270 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 2.0+ MB

#Day name of the week
df['date'] = pd.to_datetime(df['date'])
df['day_name'] = df['date'].dt.day_name()

#Weekend
df['weekend'] = df['weekend'] = df['date'].dt.dayofweek > 4
df['weekend'] = df['weekend'].astype(object)

#Season variable
df['season'] = (df['date'].dt.month%12 + 3)//3

seasons = {
             1: 'Winter',
             2: 'Spring',
             3: 'Summer',
             4: 'Autumn'
}

df['season'] = df['season'].map(seasons)

#Find holidays
cal = calendar()
dr = pd.date_range(start='2013-07-01', end='2016-07-31')
holidays = cal.holidays(start=dr.min(), end=dr.max())

#Holiday variable
df['holiday'] = df['date'].dt.date.astype('datetime64').isin(holidays)
df['holiday'] = df['holiday'].astype(object)

# allign dtypes before merge
df['date'] = pd.to_datetime(df['date'])
df_station['installation_date'] = pd.to_datetime(df_station['installation_date'])
df_weather['date'] = pd.to_datetime(df_weather['date'])

#merge trips and station data
no_of_docks = []
for day in df.date:
    no_of_docks.append(sum(df_station[df_station.installation_date <= day].dock_count))
    
df['no_of_docks'] = no_of_docks

#merge trips and weather data
df = pd.merge(df, df_weather, on=['start_station_id','date'], how='left')

df.head()

	start_station_id	date	no_of_trips	year	month	day	day_name	weekend	season	holiday	...	mean_temperature_f	min_temperature_f	mean_humidity	max_humidity	mean_visibility_miles	min_visibility_miles	max_wind_Speed_mph	max_gust_speed_mph	events
0	22	2013-08-29	5	2013	8	29	Thursday	False	Summer	False	...	71.0	62.0	79.0	94.0	10.0	10.0	14.0	17.0	Normal
1	13	2013-08-29	4	2013	8	29	Thursday	False	Summer	False	...	70.4	62.8	73.2	88.8	10.0	10.0	17.8	21.6	Normal
2	30	2013-08-29	1	2013	8	29	Thursday	False	Summer	False	...	70.4	62.8	73.2	88.8	10.0	10.0	17.8	21.6	Normal
3	49	2013-08-29	18	2013	8	29	Thursday	False	Summer	False	...	70.4	62.8	73.2	88.8	10.0	10.0	17.8	21.6	Normal
4	55	2013-08-29	6	2013	8	29	Thursday	False	Summer	False	...	70.4	62.8	73.2	88.8	10.0	10.0	17.8	21.6	Normal

5 rows × 22 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43270 entries, 0 to 43269
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   start_station_id       43270 non-null  int64         
 1   date                   43270 non-null  datetime64[ns]
 2   no_of_trips            43270 non-null  int64         
 3   year                   43270 non-null  int64         
 4   month                  43270 non-null  int64         
 5   day                    43270 non-null  int64         
 6   day_name               43270 non-null  object        
 7   weekend                43270 non-null  object        
 8   season                 43270 non-null  object        
 9   holiday                43270 non-null  object        
 10  no_of_docks            43270 non-null  int64         
 11  zip_code               43270 non-null  int64         
 12  mean_temperature_f     43270 non-null  float64       
 13  min_temperature_f      43270 non-null  float64       
 14  mean_humidity          43270 non-null  float64       
 15  max_humidity           43270 non-null  float64       
 16  mean_visibility_miles  43270 non-null  float64       
 17  min_visibility_miles   43270 non-null  float64       
 18  max_wind_Speed_mph     43270 non-null  float64       
 19  max_gust_speed_mph     43270 non-null  float64       
 20  precipitation_inches   43270 non-null  float64       
 21  events                 43270 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(7), object(5)
memory usage: 7.6+ MB

Feature Engineering

Standard Scaler for continous variables

scaler = StandardScaler()
df[['no_of_docks','mean_temperature_f', 'min_temperature_f', 'mean_humidity', 'max_humidity','mean_visibility_miles', 'min_visibility_miles', 'max_wind_Speed_mph', 'max_gust_speed_mph','max_gust_speed_mph']]=scaler.fit_transform(df[['no_of_docks','mean_temperature_f', 'min_temperature_f', 'mean_humidity', 'max_humidity','mean_visibility_miles', 'min_visibility_miles', 'max_wind_Speed_mph', 'max_gust_speed_mph','max_gust_speed_mph']].to_numpy())
df

	start_station_id	date	no_of_trips	year	month	day	day_name	weekend	season	holiday	...	mean_temperature_f	min_temperature_f	mean_humidity	max_humidity	mean_visibility_miles	min_visibility_miles	max_wind_Speed_mph	max_gust_speed_mph	precipitation_inches	events
0	22	2013-08-29	5	2013	8	29	Thursday	False	Summer	False	...	1.431242	1.424143	1.100845	1.025795	0.214677	0.695694	-0.568521	-0.581315	0.0	Normal
1	13	2013-08-29	4	2013	8	29	Thursday	False	Summer	False	...	1.342652	1.539740	0.542780	0.393064	0.214677	0.695694	0.117123	0.001171	0.0	Normal
2	30	2013-08-29	1	2013	8	29	Thursday	False	Summer	False	...	1.342652	1.539740	0.542780	0.393064	0.214677	0.695694	0.117123	0.001171	0.0	Normal
3	49	2013-08-29	18	2013	8	29	Thursday	False	Summer	False	...	1.342652	1.539740	0.542780	0.393064	0.214677	0.695694	0.117123	0.001171	0.0	Normal
4	55	2013-08-29	6	2013	8	29	Thursday	False	Summer	False	...	1.342652	1.539740	0.542780	0.393064	0.214677	0.695694	0.117123	0.001171	0.0	Normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
43265	9	2015-08-31	2	2013	10	17	Monday	False	Summer	False	...	1.401712	1.164050	-0.361670	-0.410016	1.472286	0.619260	0.333642	0.115136	0.0	Normal
43266	7	2015-08-31	2	2013	10	17	Monday	False	Summer	False	...	1.401712	1.164050	-0.361670	-0.410016	1.472286	0.619260	0.333642	0.115136	0.0	Normal
43267	22	2015-08-31	5	2013	10	17	Monday	False	Summer	False	...	1.431242	1.424143	-0.053772	-1.042746	0.214677	0.695694	0.153210	-0.201433	0.0	Normal
43268	77	2015-08-31	49	2013	10	17	Monday	False	Summer	False	...	1.135943	1.135151	-0.053772	-0.190994	0.214677	0.313523	0.153210	-0.074805	0.0	Normal
43269	42	2015-08-31	30	2013	10	17	Monday	False	Summer	False	...	1.401712	1.164050	-0.361670	-0.410016	1.472286	0.619260	0.333642	0.115136	0.0	Normal

43270 rows × 22 columns

One hot encoding for categorical variables


# Making dummy variables for categorical data with more inputs.  
data_dummy = pd.get_dummies(df[['day_name','weekend','season', 'holiday', 'events']])

# Output the first five rows.
data_dummy.head()

	day_name_Thursday	weekend_False	season_Summer	holiday_False	events_Normal
0	1	1	1	1	1
1	1	1	1	1	1
2	1	1	1	1	1
3	1	1	1	1	1
4	1	1	1	1	1

# Merging (concatenate) original data frame with 'dummy' dataframe.
df = pd.concat([df,data_dummy], axis=1)

# Dropping attributes for which we made dummy variables and zip_code
df = df.drop(['day_name','weekend','season', 'holiday', 'events','zip_code'], axis=1)
df.head()

	start_station_id	date	no_of_trips	year	month	day	no_of_docks	mean_temperature_f	min_temperature_f	mean_humidity	...	season_Summer	holiday_False	events_Normal
0	22	2013-08-29	5	2013	8	29	-2.172486	1.431242	1.424143	1.100845	...	1	1	1
1	13	2013-08-29	4	2013	8	29	-2.172486	1.342652	1.539740	0.542780	...	1	1	1
2	30	2013-08-29	1	2013	8	29	-2.172486	1.342652	1.539740	0.542780	...	1	1	1
3	49	2013-08-29	18	2013	8	29	-2.172486	1.342652	1.539740	0.542780	...	1	1	1
4	55	2013-08-29	6	2013	8	29	-2.172486	1.342652	1.539740	0.542780	...	1	1	1

5 rows × 36 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43270 entries, 0 to 43269
Data columns (total 36 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   start_station_id          43270 non-null  int64         
 1   date                      43270 non-null  datetime64[ns]
 2   no_of_trips               43270 non-null  int64         
 3   year                      43270 non-null  int64         
 4   month                     43270 non-null  int64         
 5   day                       43270 non-null  int64         
 6   no_of_docks               43270 non-null  float64       
 7   mean_temperature_f        43270 non-null  float64       
 8   min_temperature_f         43270 non-null  float64       
 9   mean_humidity             43270 non-null  float64       
 10  max_humidity              43270 non-null  float64       
 11  mean_visibility_miles     43270 non-null  float64       
 12  min_visibility_miles      43270 non-null  float64       
 13  max_wind_Speed_mph        43270 non-null  float64       
 14  max_gust_speed_mph        43270 non-null  float64       
 15  precipitation_inches      43270 non-null  float64       
 16  day_name_Friday           43270 non-null  uint8         
 17  day_name_Monday           43270 non-null  uint8         
 18  day_name_Saturday         43270 non-null  uint8         
 19  day_name_Sunday           43270 non-null  uint8         
 20  day_name_Thursday         43270 non-null  uint8         
 21  day_name_Tuesday          43270 non-null  uint8         
 22  day_name_Wednesday        43270 non-null  uint8         
 23  weekend_False             43270 non-null  uint8         
 24  weekend_True              43270 non-null  uint8         
 25  season_Autumn             43270 non-null  uint8         
 26  season_Spring             43270 non-null  uint8         
 27  season_Summer             43270 non-null  uint8         
 28  season_Winter             43270 non-null  uint8         
 29  holiday_False             43270 non-null  uint8         
 30  holiday_True              43270 non-null  uint8         
 31  events_Fog                43270 non-null  uint8         
 32  events_Fog-Rain           43270 non-null  uint8         
 33  events_Normal             43270 non-null  uint8         
 34  events_Rain               43270 non-null  uint8         
 35  events_Rain-Thunderstorm  43270 non-null  uint8         
dtypes: datetime64[ns](1), float64(10), int64(5), uint8(20)
memory usage: 6.4 MB

Model Building

labels = df.no_of_trips
train = df.drop(['no_of_trips', 'date'], 1)

X_train, X_test, y_train, y_test = train_test_split(train, labels, test_size=0.2, random_state = 2)

Random Forest Regressor

rfr = RandomForestRegressor(n_estimators = 55,
                            min_samples_leaf = 3,
                            random_state = 2)
rfr = rfr.fit(X_train, y_train)

Ada Boost Regressor

abr = AdaBoostRegressor(n_estimators = 100,
                        learning_rate = 0.1,
                        loss = 'linear',
                        random_state = 2)

GradientBoostingRegressor

gbr = GradientBoostingRegressor(learning_rate = 0.12,
                                n_estimators = 55,
                                max_depth = 5,
                                min_samples_leaf = 1,
                                random_state = 2)

Scoring the Models

scoring = ['r2','neg_mean_squared_error','neg_mean_absolute_error']
models = [rfr,abr,gbr]

for model in models:
    for i in scoring:
        scores = cross_val_score( model, X_train, y_train, cv=5, scoring = i)
    #    print(scores)
        if i == 'r2':
            print(model, i, ': ', scores.mean())
        elif i == 'neg_mean_squared_error':    
            x = -1*scores.mean()
            y = math.sqrt(x) 
            print(model, 'RMSE: ', "%0.2f" % y)
        elif i == 'neg_mean_absolute_error':
            x = -1*scores.mean()
            print(model,i, ": %0.2f (+/- %0.2f)" % (x, scores.std() * 2))

RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) r2 :  0.9059889252171234
RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) RMSE:  5.43
RandomForestRegressor(min_samples_leaf=3, n_estimators=55, random_state=2) neg_mean_absolute_error : 3.39 (+/- 0.11)
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) r2 :  0.6193340252543343
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) RMSE:  10.93
AdaBoostRegressor(learning_rate=0.1, n_estimators=100, random_state=2) neg_mean_absolute_error : 7.74 (+/- 0.23)
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) r2 :  0.8729724087591497
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) RMSE:  6.32
GradientBoostingRegressor(learning_rate=0.12, max_depth=5, n_estimators=55,
                          random_state=2) neg_mean_absolute_error : 4.12 (+/- 0.08)

#Fit and predict with the best model
rfr = rfr.fit(X_train, y_train)
rfr_preds = rfr.predict(X_test)

#Visualize
y_test.reset_index(drop = True, inplace = True)
fs = 30
plt.figure(figsize=(50,20))
plt.plot(rfr_preds)
plt.plot(y_test)
plt.legend(['Prediction', 'Acutal'])
plt.ylabel("Number of Trips", fontsize = fs)
plt.xlabel("Predicted Date", fontsize = fs)
plt.title("Predicted Values vs Actual Values", fontsize = fs)
plt.show()

png

#Features by importance
def plot_importances(model, model_name):
    importances = model.feature_importances_
    std = np.std([model.feature_importances_ for feature in model.estimators_],
                 axis=0)
    indices = np.argsort(importances)[::-1]    

    # Plot the feature importances of the forest
    plt.figure(figsize = (12,5))
    plt.title("Feature importances of " + model_name)
    plt.bar(range(X_train.shape[1]), importances[indices], color="r", align="center")
    plt.xticks(range(X_train.shape[1]), indices)
    plt.xlim([-1, X_train.shape[1]])
    plt.show()

print("Feature ranking:")

i = 0
for feature in X_train:
    print (i, feature)
    i += 1
    
plot_importances(rfr, "RandomForestRegressor")

Feature ranking:
0 start_station_id
1 year
2 month
3 day
4 no_of_docks
5 mean_temperature_f
6 min_temperature_f
7 mean_humidity
8 max_humidity
9 mean_visibility_miles
10 min_visibility_miles
11 max_wind_Speed_mph
12 max_gust_speed_mph
13 precipitation_inches
14 day_name_Friday
15 day_name_Monday
16 day_name_Saturday
17 day_name_Sunday
18 day_name_Thursday
19 day_name_Tuesday
20 day_name_Wednesday
21 weekend_False
22 weekend_True
23 season_Autumn
24 season_Spring
25 season_Summer
26 season_Winter
27 holiday_False
28 holiday_True
29 events_Fog
30 events_Fog-Rain
31 events_Normal
32 events_Rain
33 events_Rain-Thunderstorm

png

Question: Can you predict the demand for bikes in a given station day in advance?

If yes, what arethe main factors contributing to the predictions? Would this model be actionable to business? The model should be basic, the only requirement is that it should be interpretable.

Answer: Yes

Station, weekday/weekend info, number of docks, mean_temperature, day. If this model is improved, it can be actionable to business. Bike and docks number can be increased for the station which the prediction model interprets the demand will be higher. Or, for the stations with low number predicted demand may be considered to close down.

Answer:

Condition of the bikes
Population in the region of the station
Traffic data
Availablity of the bike roads in the region of the station
Polution
Average bike prices in the region

	day_name_Thursday	weekend_False	season_Summer	holiday_False	events_Normal
0	1	1	1	1	1
1	1	1	1	1	1
2	1	1	1	1	1
3	1	1	1	1	1
4	1	1	1	1	1

	day_name_Thursday	weekend_False	season_Summer	holiday_False	events_Normal
0	1	1	1	1	1
1	1	1	1	1	1
2	1	1	1	1	1
3	1	1	1	1	1
4	1	1	1	1	1

Ufuk CEYHANLI