# Hands-on mobility data workshop ### 31.07.2019, Alexandra Kapp und Fabian Dinklage
## Slides https://technologiestiftung.github.io/citylab-slides/mobility-hands-on/ Notes: - During the input we will share some links for you
## Agenda - Context: Why bike-sharing data - Data collection - Data cleaning and data quality - Data Visualization - Pseudonymisation - **Data Dive:** Brainstorm Ideas - **Data Dive:** Start Analysis
## Context - **Why bike-sharing?** - Detailed mobility data for whole Berlin - (Non)-existent obligations of providers - Many possibilities: - urban planning, - businesses (e.g. mobility apps), - monitoring of compliance with city laws (e.g. prohibited parking zones) - **Data:** - trips of three bike-sharing providers for Berlin over three months period
## Use Cases What are interesting use cases, analyses and visualizations that are possible with this data?
## Procedure
## Data collection - APIs of bike-sharing providers: real-time locations of all available bikes - **Challenge 1**: Get Data from APIs - [Details on different APIs](https://lab.technologiestiftung-berlin.de/projects/bike-sharing/de/) - **Challenge 2**: Collecting all trips - Query all APIs every 4 minutes - Run script via Uberspace server with a cronjob - store data in a database (AWS RDS Postgres Database)
## Data collection
## Data Collection - **Challenge 3**: Amount of Data - With queries every 4 minutes and approx. 10.000 bikes thats: - 150.000 rows each hour - 3.500.000 rows each day - 25.000.000 rows each week
## Data Collection - **Challenge 3**: Amount of Data - After three month 20 GB storage was full - Start of daily cleaning script (again: cronjob on Uberspace): Delete all rows, were the location of the bike did not change (Reduction from 1.5 billion to 5 million rows) _
## Data Cleaning [Jupyter Notebook with cleaning code](https://github.com/technologiestiftung/bike-sharing/blob/master/src/analysis/preprocess.ipynb)
## Data Cleaning: Implausible Peaks over time
## Data Cleaning: Implausible Provider
## Data Cleaning - **Raw:** 1,5 Billion rows - **After cleaning of database:** 4,9 Mio rows - **Remove data outside Berlin:** 4,8 Mio rows - **Merge locations to trips:** 2,4 Mio. trips - **Remove trips < 200 meters OR trips longer than 24 h OR trips faster than 30 km/h:** 0,5 Mio. trips - **Remove provider with implausible data:** 0,3 Mio. trips
## How to approach large datasets: - visually appealing - with good performance - in it's completeness Note: - In the next slides we dive into the process of visualizing this dataset - Now we're at the point where we're generating tons of cleaned data in our db - So how handle and visualize these in an appropriate way?
## Introducing Deck.gl - visualize large-scale Geospatial Data - variety of visualization layers out of the box available - hardware accelerated (Web.gl) Note: - If you want to handle large geospational datasets - visualizing them as nodes on the browser can be a pain in the ass - Fortunately there are hardware accelerated solutions (Web.gl) - One approach is the framework deck.gl
## [Deck.gl examples](https://deck.gl/#/examples/core-layers/geojson-layer-paths) Note: - the framework makes it possible to visualize 100.0000s of datapoints - as you can see at this example of mobility networks in the states - The framework provides a set of different layers for geospatial data - please check out other example such as the geojson layer mapping property values on districts of the city - all layers use the hardware accelerated map service mapbox as well - The documentation is extensive - Preformed containers where you can just throw your data in - For our data we decided to start with the trips layers - Before we go into detail about data preprocessing let's have a look into the visualisation
## [Our prototype](https://fabiandinklage.com/projects/bikesharing/) Note: - We wanted to make the topic of shared mobility more tangible - This visualization is based on the [trips layer](https://github.com/uber/deck.gl/tree/master/examples/website/trips) - Buildings layer added - Camera Handler to show specific spots on the map - daily data is made available through some magic in the backround - Daily Trips peaks Histogram for analysis purposes in d3 - Now let's check out how things work under the hood.
## Data output - bikes-accessible.geojson - bikes-trips.geojson - bikes-trails.json Note: - To make the data most accessible we're generating a set of three distinct datasets from the database - In order to accomplish that, a set of calculation steps are necessary
## Required steps - aggregate bike rows to trips - Creates (possible) routes with start- and endpoints - Calculate distances between waypoints ([Turf.js](https://turfjs.org/)) - Speed of bike trips - Save in appropriate data format for visualization Note: - we created a node script that accomplishes all these steps - we aggregate the data from the database - sorted the db by bikeid and time - if the time changes but location doesn't, the bike is accessible during that time - all rows where location doesn't change are removed - if time and location changes, the bike is being used - First we used a service provider (Here, Osm, Google) to calculate routes between - let me show you what's meant by that: - we have start and endpoint of one bike with two timestamps - but we don't know how the bike came from a to b - The Api just needs a start and end location and delivers waypoints of a possible route - in our special case we also need to calculate the distances between the routed waypoints - compare distances with total trip length to find out the speed of a bike - this information is only interesting if we want to animate bike over time - which is what we are actually doing in our viz - after doing the calculations - Transfer the data into appropiate geojson files - In the first step we did it for one weekend - Make use of the geojson format - As a little refreshment here an overview of the gejson
```javascript const geojson = { "type": "FeatureCollection", "features": [ { "type": "Feature", "geometry": { "type": featureType, // point, polygon, line and more ... "coordinates": [] // [lat,lng] here. }, "properties": { ... } } ] } ``` Notes: - Object with a key that holds an array - Each object in array represents one item, depending on the type your geojson has - Inside the properties key more data about the item that can be stored
```javascript [ { "vendor": 0, "segments": [ // all segments represent a trip [ 13.3782979, // longitude 52.5076544, // latitude 39497.8 // timestamp ], ... ] } ] ``` Notes: - in case of animted trips, the data format has to be changed a little - you have an array of objects - the vis will interpolate between two locations and timestamps - bikes will move in between - first we did this fow two days and it was already more than 20mb to load - which leads to the next question:
## How does it scale? - make newest data inside visualization available (another **[cronjob](https://gist.github.com/fdnklg/2c784e82735a3122acf1c07009795fd4)**) - transform and write data daily (cronjob at **[uberspace](https://uberspace.de/en/)**) - create own routing service (**[osrm-docker](https://hub.docker.com/r/osrm/osrm-backend/)** at **[aws](https://docs.aws.amazon.com/efs/latest/ug/gs-step-one-create-ec2-resources.html)**) Note: - Automate your processes - If you want to visualize up to date mobility data for each day a couple of steps are necessary - To become independent from request limits of routing providers it's possible to use open source solutions like osrm - I just want to briefly tell you how the different services are working together - If you're interested how things work on a deeper level we can later dive into the code together
## Pseudonymisation - **NO** unique bike ID and **NO** provider ID given - Fields: - TripID - a unique ID - Start date & time rounded to the nearest 15 minutes - End date & time rounded to the nearest 15 minutes - StartLatitude - rounded to nearest 3 decimal places - StartLongitude - rounded to nearest 3 decimal places - EndLatitude - rounded to nearest 3 decimal places - EndLongitude - rounded to nearest 3 decimal places - TripDuration - duration of the trip minutes - TripDistance - distance of trip in miles based on company route data [Standard of Harvard Civic Analytics Network](https://data.louisvilleky.gov/dataset/dockless-vehicles)
## Pseudonymisation - **pseudonomysed_raw.csv**: anononymised raw data (2.5M trips) - **pseudonomysed.csv**: cleaned anonymised raw data (0.3M trips)
## Data Dive - Ideas **get into small groups** **15 min** define research questions or sketch visualization ideas
## Data Dive - Data ### Raw Data: https://send.firefox.com/download/1d2460498f803b0f/#euosKRfI1-2si_il9sbqpg ###### . ### Preprocessed for DeckGL: https://send.firefox.com/download/471fca96e8e734d3/#3LPMEYJawiE7j8cHR0f-Zg ###### . ### GitHub Repo Data Collection and Analysis: https://github.com/technologiestiftung/bike-sharing ###### . ### GitHub Repo Visualization: https://github.com/technologiestiftung/deck-gl-geojson-example
## Data analysis - Option 1: Work with data yourself - Option 2: Follow examples of analysis in Jupyter Notebook - Option 3: Try out visualizations with DeckGL
## Results
## Contact: ### Alexandra Kapp kapp@technologiestiftung-berlin.de\ [@lxndrkp](https://twitter.com/lxndrkp) ###### . ###### . ### Fabian Dinklage dinklage@technologiestiftung-berlin.de\ [@fdnklg](https://twitter.com/fdnklg) ###### . ###### . [@citylabberlin](https://twitter.com/citylabberlin)\ [citylab-berlin.org](https://citylab-berlin.org)