LynxKite tutorial on Bike-sharing dataset

fiddlefoddle
7 min readJul 2, 2020

Case-study: Baywheels Bike-sharing using LynxKite

Introduction

In our first article, we illustrated how graph theory could be handy in analysing IoT data. In this article, we explore how we can generate the graph visualisations and analyses that we saw to answer key questions that business stakeholders have. We are going to be using LynxKite UI. This is an open-source software which has just been made free and accessible for anyone to try! As a drag and drop platform, the learning curve is gentle and easy to pick up. I will demonstrate how analyses can be done using this platform.

The finished workspace can be found here: Bike-sharing Tutorial.

Building the graph

  1. Importing the data with the import operations. LynxKite is able to take in 7 different data types (CSV, JSON, Neo4j etc). Use Import CSV for the stations and Import Parquet for the trips.
  2. Plot the stations as vertices with using Table as Vertices and the trips between the stations as edges connecting these trips with using Table as Edges.
  3. Next, convert the latitude and longitude of the stations as a position attribute based on their geolocations. This can be done with Convert vertex attributes to Position.
  4. Connect the vertices to the edges, and merge them based on the station ids in both datasets and this graph is built! There are 70 vertices and 670k edges. The default visualization shows only the crucial vertices, though it is also able to show all vertices if selected.

5. To plot the vertices onto a physical map, visualise the position attribute as geo-coordinates, to show the actual positions of the stations on the map. As there are 5 clusters on the map, using the colour labels on ‘city’ can help us differentiate the clusters in a more visual manner.

Factors for building the graph
Initial Map

Using six operations, the graph is built. LynxKite UI looks like this so far:

LynxKite steps for setting up vertices and edges

Business Objective 1: User Journey

In this section, the way the trips are mapped out will be explored to derive insights that may be useful for the business.

As shown in the visualisation above, there are very limited trips that are happening between the cities, therefore there are very weak and invisible edges between the cities.

  1. Apply a soft filter to view the trips that are happening within San Francisco (SF), the city with the biggest cluster. For a company which manages multiple cities which may exhibit different trends, it is applicable to analyse on a city-level.

The resulting visualization is shown below:

Trips taken in San Francisco

Popular Routes

  1. The thickness of the routes shows the popularity of the trips taken. As the trips taken can be cluttered, a filter can be done to sieve out the more popular trips.
  2. Using Convert edge attributes to vertex, the number of trips would be aggregated by the count of the IDs.
  3. Plot the graph with the attribute edge_ID_count as size. This would then be differentiated with the size of the vertices. From the histogram, the distribution of the number of trips taken is plotted, filtering it to a number would highlight the more frequently taken trips.

Filtering to trips with more than 30, 000 instances, the routes and directions become clearer. The most popular routes are between Harry Bridges Plaza (Ferry Building), Embarcadero at Sansome, San Francisco Caltrain (Townsend at 4th) and San Francisco Caltrain (330 Townsend).

Most popular trips

The most popular route indicated by the thickest edge is from Harry Bridges Plaza (Ferry Building) to Embarcadero at Sansome, while the most popular station indicated by label size is San Francisco Caltrain (Townsend at 4th).

Duration of the trips

Given that there are so many trips happening, how can the business find the time taken for the routes and understand the distribution of the stations better?

It takes just 4steps on LynxKite:

LynxKite UI for duration
  1. Using Merge parallel edges by Attribute: start_station_id it would then aggregate all the edges that started there and ended on another vertex. There are different kinds of aggregation — median, first, average. In this case, the average is taken on the length of duration to get a feel of how long it takes.
  2. It is important to create insights that are useful at first glance, the duration is better comprehended as minutes, however, it is currently in seconds which is less intuitive. Therefore, using the Derive Edge Attribute to calculate the duration in minutes with the formula: (duration_average/60).round would create another edge attribute named duration_mins.
Derive Edge Attribute function

3. When visualising, select a number of centres — for example, take 10 centres. Label the length of duration as the edge labels to visualise the time taken.

Duration (mins) on the routes taken

Business Objective 2: Segmentation

Segmenting the visualisations by different segments help to break the information down to understand how the relationships are like among different groups.

User Type

There are two kinds of users: Subscribers and Customers. Subscribers are users who pay a regular fee depending on their subscription type (annual or monthly). Subscribers get to rent the bikes at a lower price compared to customers but generate more income than customers. Therefore, it is important to understand the relationship between trips taken and the customer type. This helps the business to understand what are the factors that can convert existing customers into subscribers.

One way of doing so is to create two filters side by side to visualise both diagrams:

Customer

Customer’s Journeys

Subscriber

Customer’s Journeys

We can tell the difference between the types of routes taken by subscribers and customers allowing us to find an effective targeting strategy.

Business Objective 3: Inventory Management

To understand how the business operates, knowing the inventory is important. There are a few things that can be delved further.

Centrality

Centrality is a good way of identifying how central a vertex is to the other vertices. Using one function, Compute Centrality, we can derive the centrality degree of each station. There are different methods such as Average distance and Harmonic measures.

LynxKite UI for the centrality mapping
  1. Visualise the centrality as a vertex label and label colour in its spectrum.
Centrality degrees of each station (vertices)

The lighter the colour, the closer the centrality to 1 which means that it is a central vertex. From the graph above, it can be seen that the usual Station 70 — San Francisco (Townsend at 4th) is the most central node. This indicates that it is frequently visited and a good station to spread information from.

Bottleneck Stations

Using the in and out-degree of the stations, the stations that run into more outgoing than incoming edges could be stations which may require more replenishments or docks to support their supply and demand.

This can be calculated in 6 steps each:

LyncKite UI for Bottleneck stations
  1. Compute the incoming degrees with Compute Degree by the incoming edges.
  2. Compute the outgoing degrees with Compute Degree by the outgoing edges.
  3. Using the operation Derive vertex attributes to derive the

In-out ratio: incoming/outgoing

A high in-out ratio means that there’s a lot more bikes parking here than going out, indicating a high supply of bikes. In other words, there may be a lack of docks.

Out-in ratio: outgoing/incoming

A high out-in ratio means that there are a lot more bikes leaving here which could result in a shortage, indicating high demand for bikes. In other words, there may be a lack of bikes.

4. Rank the ratio in descending order from the highest ratio with Add Rank Attribute.

5. Filter the top 10 vertices to see which are the dangerous ones and why they could be at risk.

6. Visualisation of the in_out_degree or out_in_degree could help in understanding the stations more easily. Choose the colour scale with { } if dealing with absolute values.

Highest in-out ratio
Highest out-in ratio

Conclusion

Understanding the data is an important step for the business to gain insights into their operations and make data-driven decisions to achieve their business objectives. Especially when the data has multiple components, it is useful to break them down and dissect them. Being able to generate insights quickly is helpful to make strategic decisions for the company. An efficient drag and drop software that achieves the same effect of coding from scratch could save the business time and effort. Anyone in the team would be capable of handling the software once they learn the basics.

If you are interested in the business insights that accompany these analyses, you may read them over here.

--

--