Thursday, March 28, 2013

Capital Bikeshare: Time Series Clustering

Weekdays on the bike share network are very different from weekends. This becomes very obvious when you plot the total number of rentals, per hour and per day. In the elegant rainbow plot below, it is clear that (Monday to Friday) are incredibly similar. Two sharp peaks are present, the first at 8am (most likely the morning commute) and another at 6pm (most likely the evening commute). Lunchtime (12 to 1) also shows increased activity.  The weekend on the other hand is a completely different graph, there are no sharp peaks, simply a constant increase in activity until midday and then a constant decrease.


This got me thinking, each station must have its own temporal profile, a characteristic time-series that says something about how and when that station is used.  If a station peaks early in the weekday morning and then never again, it could mean that this stations primary purpose is transporting people to work. Likewise if a station peaks at 6pm but never again then it could be considered an evening commute station. Other patterns might also exists that lend themselves to easy interpretation.

So this is exactly what I did. For each of the 191 stations, I computed a time-series by calculating the average number of rentals per hour for that station. I also split the data into weekday and weekend because as we saw above this is very important. 

Normalization


Next I normalized each time-series. I did this because I am interested in the characteristics of the time-series and not the difference in rental volumes. As an example, before normalization the following two time-series have a Euclidean distance of 2025.28 but after normalization they have a Euclidean distance of only 0.39.

Raw rental counts: Station 31245 is approximately twice as busy/popular as station 31018

Normalized rental counts: When we take rental volume out of the picture, we see that these two stations are in fact very similar.

Distance functions


Okay so after we have normalized each stations time-series, we need to choose a distance function. Our time-series objects have a really nice property, they are all of equal length. Each one contains exactly 24 points. There are an infinite number of ways we can compute the difference between two sets of 24 points, so their are an infinite number of distance functions we could use. To name but a few: Euclidean distance (ED)Mean Absolute Error (MAE), Dynamic time warping (DTW). I have decided to use the Euclidean Distance, after all there's only so much 24 points can tell us so their is no sense in over complicating things.

Clustering algorithms


Okay its clustering time, once again there are so many cluster algorithms we could choose from. To name but a few: K-MeansDBSCAN, OPTICS. I have chosen K-Means, why? Well there's a really nice implementation of k-means written in SciPy and I just love Python.

So I ran k-means with k = 3, choose 3 nice colors and plotted each time-series that belonged to each cluster in the following plots. I plotted each individual time-series with a transparency of 0.5 and then plotted the average of these time-series (sometimes referred to as the signature of the cluster) with 0 transparency.
Cluster 1: The morning commute cluster?
Cluster 2: The evening commute cluster?

Cluster 3: Both morning and evening commute cluster?
So what do these clusters look like on a map? Well they seem to make sense. One interpretation of these results is as follows "The majority of stations that are used for both morning and evening commutes (the blue stations) are in the city center. The morning commute stations (the green stations) are presumably used primarily by people travelling to the orange stations, then in the evening the evening commute stations (the orange stations) are used by the same people to travel back home to the green stations."

Map of bike station, colored by cluster membership.

A word of warning, the above is simply one interesting interpretation of these results. There are many possible explanations for such structure. Perhaps there are three types of cyclist, one group that likes to cycle in the mornings but never in the evenings (the green stations), another that likes to cycle in the evenings but never the mornings (the orange stations) and a third who love to cycle both in the mornings and evenings.

Is that it?


Of course not, but this time-series clustering has given me an even better idea, so my next post will be on something related but very different. At some stage I will try to post results which are harder to interpret, when k increases to 4, 5, 6, etc. I may also experiment with OPTICS and see how much the results differ. Also it would be nice to do the exact same analysis for bike returns instead of bike rentals.

About the visualization


As always the above visualizations require a mash-up of tools and frameworks. All of the plots were created using Python and matplotlib. All of the processing (Normalization, Euclidean Distance and Clustering) was done using Python and a combination of Numpy & SciPi. The map is once again powered by Leaflet and D3.

If for some reason you want to explore the live example, you can find it here.

3 comments:

  1. Did you try just using a correlation coefficient as the distance metric (pearson/spearman)?

    ReplyDelete
  2. It's good information. Could you share source code of your work? Link you noted seems to be broken.

    ReplyDelete
  3. excuse me, where I can find the database?

    ReplyDelete