Visualizing the Vélib Stations in Paris using pandas and bokeh
In this post, we will visualize the Paris Vélib bicycle stations using pandas and then, to do interactive exploration, bokeh. The goal is to get familiar with the plotting syntax of bokeh, which is quite different from matplotlib, the classic plotting package in the Python scientific stack.
Fetching the data¶
JC Decaux, the company responsible for the Paris shared biking system Vélib, has an open-data service available here: https://developer.jcdecaux.com/#/opendata/vls?page=static. We can use it to fetch the static data describing the different stations.
import pandas as pd
df = pd.read_json("https://developer.jcdecaux.com/rest/vls/stations/Paris.json")
Let's look at the head of the data:
df.head()
| address | latitude | longitude | name | number | |
|---|---|---|---|---|---|
| 0 | RUE DES CHAMPEAUX (PRES DE LA GARE ROUTIERE) -... | 48.864528 | 2.416171 | 31705 - CHAMPEAUX (BAGNOLET) | 31705 |
| 1 | 52 RUE D'ENGHIEN / ANGLE RUE DU FAUBOURG POISS... | 48.872420 | 2.348395 | 10042 - POISSONNIÈRE - ENGHIEN | 10042 |
| 2 | 74 BOULEVARD DES BATIGNOLLES - 75008 PARIS | 48.882149 | 2.319860 | 08020 - METRO ROME | 8020 |
| 3 | 37 RUE CASANOVA - 75001 PARIS | 48.868217 | 2.330494 | 01022 - RUE DE LA PAIX | 1022 |
| 4 | 139 AVENUE JEAN LOLIVE / MAIL CHARLES DE GAULL... | 48.893269 | 2.412716 | 35014 - DE GAULLE (PANTIN) | 35014 |
Now, let's see what we can do with it!
Examining the data¶
A first question that can be asked is "how many stations are there in each city / neighbourhood?". It turns out that we can extract a 5 digit postcode from each address field quite easily using regular expressions. This is because the pandas.str.findall function accepts regular expressions as arguments.
df['postcode'] = [item[0] for item in df.address.str.findall("\d\d\d\d\d")]
df.head()
| address | latitude | longitude | name | number | postcode | |
|---|---|---|---|---|---|---|
| 0 | RUE DES CHAMPEAUX (PRES DE LA GARE ROUTIERE) -... | 48.864528 | 2.416171 | 31705 - CHAMPEAUX (BAGNOLET) | 31705 | 93170 |
| 1 | 52 RUE D'ENGHIEN / ANGLE RUE DU FAUBOURG POISS... | 48.872420 | 2.348395 | 10042 - POISSONNIÈRE - ENGHIEN | 10042 | 75010 |
| 2 | 74 BOULEVARD DES BATIGNOLLES - 75008 PARIS | 48.882149 | 2.319860 | 08020 - METRO ROME | 8020 | 75008 |
| 3 | 37 RUE CASANOVA - 75001 PARIS | 48.868217 | 2.330494 | 01022 - RUE DE LA PAIX | 1022 | 75001 |
| 4 | 139 AVENUE JEAN LOLIVE / MAIL CHARLES DE GAULL... | 48.893269 | 2.412716 | 35014 - DE GAULLE (PANTIN) | 35014 | 93500 |
This allows us to easily count the number of stations in given locations:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('bmh')
plt.figure(figsize=(10, 6))
df.groupby(by='postcode').size().plot(kind='bar')
plt.tight_layout()
This allows us to determine that there are the most stations in the 15th arrondissement of Paris.
We can also decide to plot each station as a dot on a map. Let's try that:
fig, ax = plt.subplots(figsize=(10, 8))
df.plot(ax=ax, kind='scatter', x='longitude', y='latitude')
plt.tight_layout()
We can faintly distinguish the Seine River contour, were there are no Vélib stations.
Finally, a last visualization could be to compute the mean coordinates of stations for each postcode and plot them on a map:
mean_stations = df.groupby('postcode').mean()
mean_stations.head()
| latitude | longitude | number | |
|---|---|---|---|
| postcode | |||
| 75001 | 48.862984 | 2.339561 | 1021.384615 |
| 75002 | 48.868225 | 2.342684 | 2026.416667 |
| 75003 | 48.862940 | 2.359657 | 3013.733333 |
| 75004 | 48.855631 | 2.356978 | 4030.791667 |
| 75005 | 48.845414 | 2.348918 | 5029.631579 |
mean_stations.describe()
| latitude | longitude | number | |
|---|---|---|---|
| count | 51.000000 | 51.000000 | 51.000000 |
| mean | 48.856930 | 2.350670 | 23765.439084 |
| std | 0.029586 | 0.062251 | 13170.681267 |
| min | 48.808535 | 2.222189 | 1021.384615 |
| 25% | 48.832531 | 2.309834 | 13470.615741 |
| 50% | 48.856693 | 2.348918 | 21703.833333 |
| 75% | 48.881089 | 2.399400 | 33554.500000 |
| max | 48.909302 | 2.474920 | 44101.500000 |
mean_stations['station_count'] = df.groupby(by='postcode').size()
We can also label the points as in this SO thread.
def label_point(x, y, val, ax):
a = pd.DataFrame({'x': x, 'y': y, 'val': val})
for i, point in a.iterrows():
ax.text(point['x'], point['y'], str(point['val']))
fig, ax = plt.subplots(figsize=(10, 8))
mean_stations.plot(ax=ax, kind='scatter', x='longitude', y='latitude', s=mean_stations['station_count'], color='red')
label_point(mean_stations.longitude.values, mean_stations.latitude.values, mean_stations.index, ax)
plt.tight_layout()
mean_stations.latitude.values
array([ 48.86298384, 48.86822491, 48.86294019, 48.85563069,
48.84541393, 48.84970745, 48.85669329, 48.87365634,
48.87717112, 48.8757906 , 48.85910098, 48.8404467 ,
48.82890592, 48.83029111, 48.84110915, 48.86006093,
48.88597374, 48.89068584, 48.88586487, 48.8625977 ,
48.83477101, 48.90304026, 48.81480097, 48.82522672,
48.86982978, 48.82117249, 48.88387195, 48.84312621,
48.81895035, 48.89330857, 48.88121858, 48.85665367,
48.86730718, 48.90918588, 48.90858672, 48.88096003,
48.90930234, 48.88471616, 48.90529932, 48.89585992,
48.80853453, 48.84670081, 48.83630349, 48.84275142,
48.81425975, 48.82415844, 48.81336175, 48.81130206,
48.84703447, 48.8199541 , 48.81861291])
Finally, we can put everything together: stations and mean locations of stations.
s = df.groupby(by='postcode').size()
cmap = list(s.index.values)
fig, ax = plt.subplots(figsize=(10, 8))
df.plot(ax=ax, kind='scatter', x='longitude', y='latitude',
c=[cmap.index(item) + 1 for item in df.postcode.values],
colormap='cubehelix', label='index of location')
mean_stations.plot(ax=ax, kind='scatter', x='longitude', y='latitude', s=100, color='red')
label_point(mean_stations.longitude.values, mean_stations.latitude.values, mean_stations.index, ax)
plt.tight_layout()
Using Bokeh¶
The maps I plotted in the previous section were static. This is a limiting factor when exploring a dataset. To really come to grips with the data, it is often useful to make it interactive, which is what we will do using bokeh. We will follow the quickstart guide to Bokeh and try to obtain the same plots as above using this framework.
To get a feeling for how bokeh works, we will first use the high level bokeh.charts interface and then the medium and low-level bokeh.plotting and bokeh.models.
High level version¶
First, we import the different elements we need for bokeh.
import bokeh.plotting as bp
Let's tell bokeh to show things in the notebook:
bp.output_notebook()
Now, let's use the high level function found the charts module:
import bokeh.charts
p = bokeh.charts.Scatter(df, x='longitude', y='latitude', color='postcode',
tools="crosshair, hover, wheel_zoom, pan")
bp.show(p)
<Bokeh Notebook handle for In[20]>
That was easy! The visualization is interesting and we didn't have much to do to obtain it.
What if we want a hover tool displaying the address over each station? I didn't find any easy way to extend the previous chart, so let's switch to a lower level of plotting and do this in detail.
Medium and low-level bokeh¶
We now need to do the following things to make our plot, from the medium or low-level perspective:
- create a figure
- add renderers (points in our cases)
- show the plot
Let's do a simple scatter plot to show how this goes:
p = bp.figure(title="simple scatter plot")
p.scatter(x=df.longitude.values, y=df.latitude.values)
bp.show(p)
<Bokeh Notebook handle for In[21]>
Now, let's customize this plot a little more:
- add colors to each dot according to postcode
- add labels showing the adress of a station using hovering
We will start with the colors. I didn't figure out how to apply this easily with bokeh, so I had to resort to a manual generation of each color code using matplotlib classes, in particular a ScalarMappable.
import matplotlib as mpl
color_index = pd.Series([cmap.index(item) for item in df.postcode.values])
norm = mpl.colors.Normalize()
norm.autoscale(color_index)
sm = mpl.cm.ScalarMappable(norm, 'hot')
We can test the output into rgba space using to_rgba:
sm.to_rgba(0.1, bytes=True)
(10, 0, 0, 255)
Finally, let's just generate the list of colors we need:
colors = [
"#%02x%02x%02x" % (int(r), int(g), int(b)) for r, g, b, a in [sm.to_rgba(item, bytes=True) for item in color_index]
]
colors[:10]
['#ffb700', '#830000', '#660000', '#0a0000', '#ffff22', '#ff0a00', '#ff9d00', '#c40000', '#9d0000', '#730000']
Let's now customize the tooltip shown while hovering. The way to do this is well described in the Bokeh tutorial about interactions:
- we need to build a datasource containing a description field
- and a hover tool, based on this description field from the data source
import bokeh.models as bm
source = bm.ColumnDataSource(
data=dict(
x=df.longitude.values,
y=df.latitude.values,
c=colors,
desc=df.address.values,
)
)
hover = bm.HoverTool(
tooltips=[
("address", "@desc"),
]
)
pan = bm.PanTool()
zoom = bm.WheelZoomTool()
Finally, here's the scatter plot, in low-level plotting language, with hovering tooltips!
p = bp.figure(title="Vélib stations in Paris",
tools=[hover, pan, zoom])
p.circle(x='x', y='y', fill_color='c', size=10, source=source)
bp.show(p)
<Bokeh Notebook handle for In[27]>
I've just found out that it is possible to plot markers on top of a Google Map using bokeh. Let's try and do this:
geo_source = bm.GeoJSONDataSource(
data=dict(
x=df.longitude.values,
y=df.latitude.values,
c=colors,
desc=df.address.values,
)
)
hover = bm.HoverTool(
tooltips=[
("address", "@desc"),
]
)
pan = bm.PanTool()
zoom = bm.WheelZoomTool()
p = bp.figure(title="Vélib stations in Paris",
tools=[hover, pan, zoom])
p.circle(x='x', y='y', fill_color='c', size=10, source=geo_source)
bp.show(p)
<Bokeh Notebook handle for In[28]>
Unfortunately, this doesn't work, yet. There are several bug reports describing this behaviour (one of them is here: https://github.com/bokeh/bokeh/issues/3737). Hopefully, this will get fixed soon!
Using Folium¶
A last thing I wanted to try was to use Folium for displaying interactive maps. It seems very simple to use to get markers on a map using an OpenStreetMap tiling.
import folium
map_osm = folium.Map(location=[48.86, 2.35])
for lng, lat, desc in zip(df.longitude.values,
df.latitude.values,
df.address.values):
map_osm.circle_marker([lat, lng], radius=100, popup=desc)
map_osm
That's it for today! I hope you had fun!
This post was entirely written using the IPython notebook. Its content is BSD-licensed. You can see a static view or download this notebook with the help of nbviewer at 20160205_VisualizingVelibStations.ipynb.