Of Data and Science: March 2013

Thursday, March 28, 2013

Capital Bikeshare: Time Series Clustering

Weekdays on the bike share network are very different from weekends. This becomes very obvious when you plot the total number of rentals, per hour and per day. In the elegant rainbow plot below, it is clear that (Monday to Friday) are incredibly similar. Two sharp peaks are present, the first at 8am (most likely the morning commute) and another at 6pm (most likely the evening commute). Lunchtime (12 to 1) also shows increased activity. The weekend on the other hand is a completely different graph, there are no sharp peaks, simply a constant increase in activity until midday and then a constant decrease.

This got me thinking, each station must have its own temporal profile, a characteristic time-series that says something about how and when that station is used. If a station peaks early in the weekday morning and then never again, it could mean that this stations primary purpose is transporting people to work. Likewise if a station peaks at 6pm but never again then it could be considered an evening commute station. Other patterns might also exists that lend themselves to easy interpretation.

So this is exactly what I did. For each of the 191 stations, I computed a time-series by calculating the average number of rentals per hour for that station. I also split the data into weekday and weekend because as we saw above this is very important.

Normalization

Next I normalized each time-series. I did this because I am interested in the characteristics of the time-series and not the difference in rental volumes. As an example, before normalization the following two time-series have a Euclidean distance of 2025.28 but after normalization they have a Euclidean distance of only 0.39.

Raw rental counts: Station 31245 is approximately twice as busy/popular as station 31018

Normalized rental counts: When we take rental volume out of the picture, we see that these two stations are in fact very similar.

Distance functions

Okay so after we have normalized each stations time-series, we need to choose a distance function. Our time-series objects have a really nice property, they are all of equal length. Each one contains exactly 24 points. There are an infinite number of ways we can compute the difference between two sets of 24 points, so their are an infinite number of distance functions we could use. To name but a few: Euclidean distance (ED), Mean Absolute Error (MAE), Dynamic time warping (DTW). I have decided to use the Euclidean Distance, after all there's only so much 24 points can tell us so their is no sense in over complicating things.

Clustering algorithms

Okay its clustering time, once again there are so many cluster algorithms we could choose from. To name but a few: K-Means, DBSCAN, OPTICS. I have chosen K-Means, why? Well there's a really nice implementation of k-means written in SciPy and I just love Python.

So I ran k-means with k = 3, choose 3 nice colors and plotted each time-series that belonged to each cluster in the following plots. I plotted each individual time-series with a transparency of 0.5 and then plotted the average of these time-series (sometimes referred to as the signature of the cluster) with 0 transparency.

Cluster 1: The morning commute cluster?

Cluster 2: The evening commute cluster?

Cluster 3: Both morning and evening commute cluster?

So what do these clusters look like on a map? Well they seem to make sense. One interpretation of these results is as follows "The majority of stations that are used for both morning and evening commutes (the blue stations) are in the city center. The morning commute stations (the green stations) are presumably used primarily by people travelling to the orange stations, then in the evening the evening commute stations (the orange stations) are used by the same people to travel back home to the green stations."

Map of bike station, colored by cluster membership.

A word of warning, the above is simply one interesting interpretation of these results. There are many possible explanations for such structure. Perhaps there are three types of cyclist, one group that likes to cycle in the mornings but never in the evenings (the green stations), another that likes to cycle in the evenings but never the mornings (the orange stations) and a third who love to cycle both in the mornings and evenings.

Is that it?

Of course not, but this time-series clustering has given me an even better idea, so my next post will be on something related but very different. At some stage I will try to post results which are harder to interpret, when k increases to 4, 5, 6, etc. I may also experiment with OPTICS and see how much the results differ. Also it would be nice to do the exact same analysis for bike returns instead of bike rentals.

About the visualization

As always the above visualizations require a mash-up of tools and frameworks. All of the plots were created using Python and matplotlib. All of the processing (Normalization, Euclidean Distance and Clustering) was done using Python and a combination of Numpy & SciPi. The map is once again powered by Leaflet and D3.

If for some reason you want to explore the live example, you can find it here.

Monday, March 11, 2013

Capital Bikeshare: Space & Time

This week I've been studying the spatial and temporal components of the Capital Bikeshare data-set. First off what do I mean by the spatial component? Well the bike stations have fixed locations (actually they don't, I'm lying). The stations were actually designed to be movable, they are not fixed to the ground at all (apart from the obvious effect of gravity). They use solar panels to power their rental system software and to log data to a remote server which provides the live availability feed.

According to Capital Bikeshare "Typically a station is only moved due to road construction or some other temporary issue and then it is moved back". Unfortunately this movement data is not contained in the quarterly data dumps so unless you've been logging it for the past few years (which I haven't) then you won't have access to it. If anyone does have access to this data and they would like to share it with me then please leave a comment below.

Anyways back to our discussion about space and time.

This is not a simulator

The first thing I did in order to study the spatial and temporal components was to build a tool which simulates the bikes moving on the network. In reality this is not really a simulator at all because its using the real historic rental data.

The blue dots are the bike stations and the moving red dots are bikes travelling from station to station. The data-set does not contain GPS traces, it contains tuples of (origin, destination, duration) so the red dots are simply moving along an interpolated path from origin to destination where the entire journey takes the correct duration of time.

This is a really interesting exploratory tool, you can move the simulator to any date and time you want, you can also choose from a variety of speeds [1x, 10x, 100x and 1000x]. At 100x you can clearly see different spatial and temporal patterns. You can watch people moving towards the city center during the morning compute and home again during the evening commute. You can watch certain areas of the city light up with activity at different times. You can also use the tool for general exploration, you can watch for example, the very first bike rental which was on Sept 15th 2010 or you can see how federal holidays disturb typical travel patterns for example on Thanks Giving.

If your not satisfied by simply watching the video then you can explore the full data-set yourself by following this link.

You might have noticed that different stations appear as the simulator approaches December 2012. This is because Capital Bikeshare is an ever growing network. If you go all the way back to the beginning of the data-set you will see all of the stations fade away accept for the very first station at 27th & Crystal Dr.

Conclusion

Okay so the simulator is cool, you could use it to watch the entire data-set play out from the very first rental on Sept 15th 2010 until the very last rental (released as of this date) December 31st 2012. Its unlikely that anyone will every do this tho (I certainly won't). I think the simulator is better suited for exploring rental activity on specific dates and or times.

About the visualization

The visualization combines quite a few different elements. The map is being provided by Leaflet an awesome Javascript library for mobile-friendly interactive maps. You can change the map provider by hovering over the icon in the top right corner of the map. My favorite alternative is the visually appealing Stamen-Watercolor.

The data has been stored in a PostgreSQL database with an index on the time column. This allows the simulator to quickly change to any date/time. Why PostgreSQL? I have found PostGIS and pgrouting very useful for some of my other projects.

The transition animations are being powered by D3. I love D3, it makes this sort of thing so simple and elegant to code. The bike stations and bikes themselves are not actually Leaflet layers. D3 is being used to manipulate an SVG element that is sitting on top of an equally sized Leaflet canvas.

The datepicker is provided by Pickaday. I haven't used this in a project before but I quite like it.

Last but not least, the timepicker is provided by jQuery.

Monday, March 4, 2013

The Capital Bikeshare Data-set

Capital Bikeshare (also abbreviated CaBi) is a bicycle sharing system that serves Washington, D.C.; Arlington County, Virginia; and the city of Alexandria, Virginia. At the time of this writing (Feb 26th 2013), the network contains 1,670+ bicycles serving 175+ stations.

Station Map (Feb 26th 2013)

I really like Capital Bikeshare, not because I love cycling (I do) or because I live in Washington D.C. (I don't) but because of the data. Capital Bikeshare releases all of its historical data on system usage in quarterly data dumps. At the time of this writing, 9 quarters of data have been released the last quarter of 2010; all of 2011; and all of 2012.

The 9 quarters of Capital Bikeshare data

I've been studying this data for the past few months now. The main purpose of this blog post is to share some of my work with you. After all, whats the point in doing something cool if you don't share it.

The effects of weather on bike rentals

The following is a screenshot of an interactive calendar I developed. This calendar displays a square for each day in the Capital Bikeshare data set. The squares are grouped by day, week and month to produce a beautiful mosaic describing bike rentals. The color of each square represents the number of bikes rented from the system on that day. Dark Red means few rentals, Dark Green means lots of rentals. Hovering over any square will display all of the corresponding weather information for that day in the table below the mosaic.

What does this interactive calendar teach us? Well firstly, if we forget about the weather information for a moment, we see some general trends. In 2011 and 2012, the bulk of renting happens in the summer months with reduced activity on either side. We also see that the system is increasing in popularity with time, all of the dark green is in 2012.

More interestingly I think are the oddly colored squares. A red square surrounded by green squares indicates lower than expected rentals, the more contrast the more peculiar. The highest contrast square is Saturday Aug 27th 2011, this was the day of Hurricane Irene. Other days of high contrast (to choose but a few) include

Sat Oct 29th 2011: 29.97mm of rain.

Sun Apr 22th 2012: 32.26mm of rain.

Mon Oct 29th 2012: 97.79mm of rain.

Tue Oct 30th 2012: 21.34mm of rain.

Federal holidays also cause a disruption of normal bike rental usage. Most notably "Thanksgiving, Christmas and New Years".

Click here to view the interactive calendar for yourself. If you find anything interesting that you would like to share, please post it as a comment.

Conclusion

In conclusion weather has a rather dramatic effect on bike rentals. Thorough exploration of the calendar above shows that rain fall is the largest dampening factor (no pun intended) on rentals. This is followed closely by low temperature and then high wind speed. Also most federal holidays appear to have an effect on usage, some positive, some negative, the most influential are of course Thanksgiving, Christmas and New Years.

So a note to CapitalBikeshare: if you need to do maintenance or upgrades on your network, please, please wait until a rainy day.

About the visualization

The calendar layout I used was created by the seriously talented Mike Bostock. I simply re-purposed it for my own needs. The weather data was downloaded from wunderground.