\DTMlangsetup

[en-GB]ord=omit

Using iterated local alignment to aggregate GPS trajectories into a traffic flow map

Tarn Duong¹¹1Paris, France F-75000. Email: [email protected]

Abstract

Desire line maps are widely deployed for traffic flow analysis by virtue of their ease of interpretation and computation. They can be considered to be simplified traffic flow maps, whereas the computational challenges in aggregating small scale traffic flows prevent the wider dissemination of high resolution flow maps. GPS trajectories are a promising data source to solve this challenging problem. The solution begins with the alignment (or map matching) of the GPS trajectories to the road network. However even the state-of-the-art map matching APIs produce sub-optimal results with small misalignments. While these misalignments are negligible for large scale flow aggregation in desire line maps, they pose substantial obstacles for small scale flow aggregation in high resolution maps. To remove these remaining misalignments, we introduce innovative local alignment algorithms, where we infer road segments to serve as local reference segments, and proceed to align nearby road segments to them. With each local alignment iteration, the misalignments of the GPS trajectories with each other and with the road network are reduced, and so converge closer to a minimal flow map. By analysing a set of empirical GPS trajectories collected in Hannover, Germany, we confirm that our minimal flow map has high levels of spatial resolution, accuracy and coverage.

Keywords: Desire lines/spider diagram, floating car data FCD, map matching, route finding

1 Introduction

One of the fundamental quantities in transport planning is a traffic flow map, i.e. a map of the traffic flow levels on the road segments in a road network (Ortúzar and Willumsen, 2011). Whilst traffic flow maps are a rich source of information about vehicle mobility patterns, they are costly in terms of time and resources to compute for any reasonably sized road network. To alleviate this cost burden, most approaches restrict the spatial coverage and resolution of the flow map. One of the most common is to place sensors at fixed locations in the road network, whose results are then visualised as a traffic count map. Another are trip intent/recall questionnaires to inform large scale properties such as trajectory origins and destinations, which can be visualised as a desire line flow map/spider diagram (Tobler, 1987). Both of these are simplified flow maps, since the detailed mobility patterns outside of the sensor locations or origin/destination pairs are not known. These unknown patterns can be inferred from other data sources, such as route assignment models (Evans, 1976). While these model-assigned routes are highly detailed, the trade-off is that they are not guaranteed to correspond closely to empirical mobility patterns.

Thus an ideal data source combines the empirical information of road sensor counts or questionnaires, with the small scale details of model-assigned trajectories. This gap in the market can be filled by GPS trajectory data. Due to the prevalence of GPS-enabled devices, such as vehicle navigation guides and mobile telephones, GPS trajectory data can be acquired with low marginal cost whilst at the same time offering extensive spatial coverage and resolution of empirical mobility patterns (Herrera et al., 2010, Andrienko and Andrienko, 2013). Throughout, we employ the term ‘GPS trajectory’ data rather than Floating Car Data (FCD) or Floating Mobile Data (FMD), as our analysis is not restricted to trajectory data from cars or from mobile phones, so these terms are equivalent for our purposes. An example is the 1183 GPS trajectories are collected from a GPS-enabled mobile phone, from December 2017 to March 2019 by a single driver in Hannover, Germany, with an overall average sampling rate about 1 GPS point per second (Zourlidou et al., 2022). They are plotted as the green circles in Figure 1.

Refer to caption — Figure 1: GPS trajectories in Hannover, Germany. (a) City level. (b) Neighbourhood level, zoom of black rectangle. (c) Small neighbourhood level, zoom of blue rectangle. Trajectory ID = 7 (orange circles), 315 (purple diamonds), others (green circles).

In Figure 1(a), at the city level, the GPS points (green circles) appear to be aligned to the road network. If we zoom in on the small black rectangle in the central region, then in Figure 1(b) at the neighbourhood level, we observe that the GPS points deviate from the road network. This deviation is clearer in the closer zoom in Figure 1(c). Moreover, if we focus on the orange circles (Trajectory ID = 7) and purple diamonds (ID = 315), then we observe that the vehicle location between the recorded GPS points is unknown. These maps illustrate the errors in GPS trajectories. These are broadly classified as ‘measurement errors’ where the recorded GPS coordinates are not the true locations, and ‘sampling errors’ where the information about the trajectory is lost in between recorded GPS coordinates (Ortúzar and Willumsen, 2011, Saki and Hagen, 2022).

Our goal is to produce a traffic flow map from these noisy GPS trajectories, which can be utilised at any scale, from the city/regional level to the individual road segment level. This requires us to minimise the errors in GPS trajectories. Our approach is composed of two stages. The first stage is to align the GPS trajectories to the road network, which is known as map matching (Quddus et al., 2007, Chao et al., 2020). It produces a route, which is a connected sequence of road segments in the road network, that is consistent with the GPS trajectory. Our contribution is an improvement to standard map matching by adding post hoc route finding. In common with many open source transport planning tools, we employ APIs based on the OpenStreetMap (OSM) network. While we are able to improvement the overall alignment to the road network, these map matched routes inherit incompressible misalignments, ranging from several centimetres to several metres, from the OSM road network. These small misalignments prevent the accurate aggregation of traffic flows at this scale.

The second stage is to resolve these misalignments, and our contribution here is the proposed local alignment of map matched routes. In contrast to global alignment approaches, we locally infer which road segments should serve as a local reference, and then proceed to align other nearby road segments to it. To accomplish this, we introduce several novel algorithms which employ a mix of advanced statistical and geospatial methods. These include ‘node snap**’ where the nearby boundary points of road segments are combined via statistical clustering, and ‘line blending’ where road segments, which are near to each other but do not share overlap** road sub-segments, are aligned to maximise their overlap** sub-segments. Inputting these locally aligned routes into a flow aggregation API leads to a more accurate flow map. Iterating these local alignments in turn leads to a minimal flow map.

The outline of the paper is as follows. In Section 2, we describe the computation of map matched routes using off-the-shelf map matching and route finding APIs. In Section 3, we describe the local alignment of the map matched routes and their aggregation into a flow map. In Section 4, we demonstrate that the locally aligned flow map from the Hannover GPS trajectories is well-aligned to the OSM road network, and has a high level of accuracy and spatial coverage of estimated traffic flows compared to reference traffic flows. We then discuss some software implementation issues, and some concluding remarks.

2 Alignment of GPS trajectories with map matching and route finding

We introduce some mathematical notation to state precisely the problem of map matching. We represent the road network by a graph $\mathcal{N}=\mathcal{N}(\mathcal{E},\mathcal{V})$ , where the $n_{\mathcal{E}}$ edges $\mathcal{E}=\{{\bm{e}}_{1},\dots,{\bm{e}}_{n_{\mathcal{E}}}\}$ of this graph are the road segments and the $n_{\mathcal{V}}$ nodes/vertices $\mathcal{V}=\{{\bm{v}}_{1},\dots,{\bm{v}}_{n_{\mathcal{V}}}\}$ indicate that two (or more) different road segments are accessible to/from each other at this node point. These nodes ${\bm{v}}_{i}$ are single GPS points. Each road segment is composed of a sequence of connected piecewise linear segments, so ${\bm{e}}_{i}=\{{\bm{e}}_{i,1},\dots,{\bm{e}}_{i,n_{i}}\}$ is an ordered sequence of $n_{i}$ GPS points ${\bm{e}}_{i,j},j=1,\dots,n_{i}$ . We refer to this as a ‘linestring’ geometry, following the Open Geospatial Consortium terminology (OGC, 2010). We set $\mathcal{N}$ to be the OSM network (https://www.openstreetmap.org), which is a freely available road network with global coverage.

We denote a single GPS trajectory $G=\{{\bm{g}}_{1},\dots,{\bm{g}}_{n_{G}}\}$ as a temporally ordered sequence of $n_{G}$ GPS points ${\bm{g}}_{i},i=1,\dots,n_{G}$ . This is known as a ‘multipoint’ geometry (OGC, 2010). Whilst some authors require that each GPS point ${\bm{g}}_{i}$ be accompanied by their timestamp $t_{i}$ to be considered a GPS trajectory, this is not strictly required since ${\bm{g}}_{i}$ are ordered according to the timestamps, even if the timestamps themselves are not recorded in the GPS trajectory. Due to measurement error, the points of a GPS trajectory $G$ are not necessarily coincident with the road network $\mathcal{N}$ , and due to sampling error, there is no information about the vehicle in between the GPS points of $G$ .

We represent the output of a map matching algorithm $M$ from an empirical GPS trajectory $G$ as $M(G)=\{{\bm{m}}_{1},\dots,{\bm{m}}_{n_{M}}\}$ which is an ordered, connected sequence of $n_{M}$ edges. The goal is that the map matched route follows closely the road network $\mathcal{N}(\mathcal{V},\mathcal{E})$ . Whilst it is straightforward to ensure that all boundary points of the segments $\{{\bm{m}}_{1},\dots,{\bm{m}}_{n_{M}}\}$ coincide with the nodes $\mathcal{V}$ of the road network graph $\mathcal{N}$ , it is more challenging to ensure that the segments $\{{\bm{m}}_{1},\dots,{\bm{m}}_{n_{M}}\}$ themselves coincide with the edges $\mathcal{E}$ . In comparison to an empirical trajectory $G$ , all boundary points of the edges of a map matched route $M(G)$ are aligned to the road network (reduced measurement error) and the vehicle position is estimated by the linestring connecting the boundary points of the road segment (reduced sampling error).

There is a large body of research on this difficult problem of map matching. We focus on the popular class of Hidden Markov Models (HMM) map matching algorithms. HMM methods iteratively build the map matched route by selecting the most likely next segment to connect to the current route using a probabilistic model. According to a review of map matching algorithms (Chao et al., 2020), HMM is a state-transition method. The other three classes are similarity, candidate-evolving and scoring methods. Further details of alternative map matching algorithms are found in Quddus et al. (2007), Chao et al. (2020). We leave this discussion here since the improvements offered by our proposed methods are valid for any map matching algorithm, and concentrate on HMM map matching due to its accuracy and computational efficiency. Even if we restrict ourselves to HMM algorithms on the OSM road network, there are many off-the-shelf map matching APIs available. We focus on the Valhalla routing engine (https://valhalla.github.io/valhalla), which includes its highly recommended map matching API (Saki and Hagen, 2022).

Figure 2 displays the $n=1147$ map matched routes by the Valhalla map matching API. We discard 36 trajectories (3.04%) from our original data set of 1183 trajectories. In Figure 2(a), the map matched routes (blue lines) overall are well-aligned to the road network. For the orange map matched route, all its segments are aligned to the road network, whereas the purple route appears to be displaced by several metres from the road centreline. The measurement and sampling errors of the map matched routes $M(G)$ are reduced in comparison those for the empirical trajectories $G$ , though these errors remain sizeable at the road segment level in Figure 2(b).

Our first contribution is to better align the edges of the map matched routes $M(G)$ to the road network edges $\mathcal{E}$ . We propose a post hoc adjustment of the map matching output by an additional call to a route finding API, as outlined in Algorithm 1. The inputs of ST_ROUTE are the empirical GPS trajectory $G$ , the map matching API $M$ , the route finding API $R$ , and the number of waypoints ${\bm{n}}_{W}$ for the route finding. Since we do not have an a priori single optimal value for the number of waypoints, we consider a range of $w$ values ${\bm{n}}_{W}=(n_{W,1},\dots,n_{W,w})$ . In Step 1, we compute the initial map matched route $M(G)$ from the empirical GPS trajectory $G$ by calling the map matching API $M$ . This initial map matched route $M(G)=\{{\bm{m}}_{1},\dots,{\bm{m}}_{n_{M}}\}$ is a linestring with $n_{M}$ edges, with $n_{M}+1$ points. In Steps 2–5, we loop over the $w$ number of waypoints in ${\bm{n}}_{W}$ . In Steps 3–4, we take a sample of $n_{W,i}$ waypoints, where the first waypoint ${\bm{w}}_{1}$ is the start point of ${\bm{m}}_{1}$ and the $n_{W,i}$ th way point is the end point of ${\bm{m}}_{n_{M}}$ , and the intermediate waypoints ${\bm{w}}_{2},\dots,{\bm{w}}_{n_{W,i}-1}$ are sampled from the start points of the edges ${\bm{m}}_{2},\dots,{\bm{m}}_{n_{M}}$ . In Step 5, we call the route finding API $R$ with the waypoints ${\bm{w}}_{1},\dots,{\bm{w}}_{n_{W,i}}$ . The result is the map matched route $M_{i}^{*}(G)=R({\bm{w}}_{1},\dots,{\bm{w}}_{n_{W,i}})=\{{\bm{m}}_{1}^{*},\dots% ,{\bm{m}}_{n_{M^{*}}}\}$ , which is a route of $n_{M^{*}}$ connected edges. In Step 6, we select the route with the smallest dynamic time war** (DTW) normalised distance between the routes $M_{i}^{*}(G)$ and the empirical trajectory $G$ , for $i=1,\dots w$ . The DTW distance is based on the lengths of all the distortions of $M_{i}^{*}(G)$ to achieve a maximal alignment between $M_{i}^{*}(G)$ and $G$ (Sakoe and Chiba, 1978, Giorgino, 2009).

Algorithm 1 ST_ROUTE – Map matched route

1:Input:

G

GPS trajectory,

M

map matching API,

R

route finding API,

{\bm{n}}_{W}

#waypoints

2:Output:

M^{*}(G)

map matched route

3:Compute initial map matched route

M(G)

from empirical GPS trajectory

G

4:for

i:=1

w

5: Set

n_{W}:=n_{W,i}

6: Sample

n_{W}

waypoints

{\bm{w}}_{1},\dots,{\bm{w}}_{n_{W}}

from

M(G)

, with

{\bm{w}}_{1}

\operatorname{Start}(M)

{\bm{w}}_{n_{W}}:=\operatorname{End}(M)

7: Compute map matched route

M^{*}_{i}(G):=R({\bm{w}}_{1},\dots,{\bm{w}}_{n_{W}})

8:Select minimal route

M^{*}(G):=\operatorname{argmin}_{i\in\{1\dots w\}}\,\operatorname{DTW}(M^{*}_{% i}(G),G)

We set $R$ to be the Valhalla Odin turn-by-turn route finding API to be consistent with our choice of $M$ as the Valhalla Meili map matching API. We set the number of waypoints as ${\bm{n}}_{W}=3,13,23,33,43,63,83$ . Figure 3 displays the results $M^{*}(G_{1}),\dots,M^{*}(G_{1147})$ from ST_ROUTE. For the trajectory ID = 7 (orange) $n_{W}=83$ , and ID = 315 (purple) $n_{W}=23$ give the minimal DTW route. The overall impression of the map matched routes $M^{*}$ in Figure 3(a) is that the misalignment has been reduced, especially in the purple line since it now aligns more closely to the road centreline. In Figure 3(b), at the level of road segments, whilst the map matched routes $M^{*}$ tend to be contained inside the road segments, they are not exactly coincident with each other. This is in part because the road network graph $\mathcal{N}$ itself contains small misalignments, and so they are propagated into any map matching or route finding algorithm based on it. These small alignments lead to a lack of overlap** sub-segments, which in turn lead to inaccurate flow aggregation.

3 Local alignment of road segments for flow map aggregation

Our goal is to resolve the remaining misalignments from the map matching/route finding in the previous section, so we are able to aggregate accurately traffic flows on road segments. We aim to achieve this by local, internal alignment between the map matched routes. By internal alignment, we mean that we align the routes with each other, rather than to the external road network graph. Since an external reference road network is not required as an explicit input, our proposal can be deployed in more cases, e.g. when the quality of the road network graph is insufficient, or when the alignment to the road network graph is computationally intensive. By local alignment, we mean that we focus on aligning sub-segments of the routes, rather than complete routes.

Recall that we represent the road network by a graph $\mathcal{N}=\mathcal{N}(\mathcal{E},\mathcal{V})$ . To this representation, we add the traffic flows on each of the network edges. We consider, without loss of generality, only the road segments with positive traffic flow $\mathcal{F}=\{(f,{\bm{\ell}}):{\bm{\ell}}\in\mathcal{N}(\mathcal{E},\mathcal{V% }),f>0\}$ where ${\bm{\ell}}$ is a road segment composed of edges in $\mathcal{E}$ with traffic flow $f$ . Furthermore, we denote ${\bm{f}}=(f,{\bm{\ell}})$ so we can write succinctly $\mathcal{F}=\{{\bm{f}}_{1},\dots,{\bm{f}}_{n_{\mathcal{F}}}\}$ for the $n_{\mathcal{F}}$ road segment flows in a flow map. Our objective is to estimate these road segment flows where $\mathcal{F}$ forms a minimal network graph.

We begin by illustrating the difference in flow aggregation between the $M$ and $M^{*}$ map matched routes. For the routes $M$ from Figure 2(b) and $M^{*}$ routes from Figure 3(b), the flow maps are given below in Figure 4(a–b) respectively. In these maps, the colour (purple to orange) and width of the road segments is proportional to the traffic flow. We observe that there are fewer, wider linestring segments in Figure 4(b) than in Figure 4(a).

The flow aggregation in Figure 4 was carried out using the overline function in the R package stplanr, which we refer to as ST_OVERLINE_PLANR (Lovelace and Ellison, 2018). Starting with the map matched routes $\mathcal{M}^{*}=\{M^{*}({G_{1}}),\dots,M^{*}(G_{n})\}$ , then the flow map is $\mathcal{F}={\tt ST\_OVERLINE\_PLANR}(\mathcal{M}^{*})$ . This flow aggregation involves the search for all road segments from the routes which exactly equal to each other. These exactly equal road segments are reduced to a single common segment, and the associated traffic flow is the number of exactly equal segments. Since it relies on exactly equal road segments, then small misalignments are sufficient to make the flow aggregations inaccurate.

Our goal of producing a minimal flow map relies on resolving the crucial problem of how to aggregate similar, but not exactly overlap**, road segments. Many solutions have been offered, such as edge bundling (Zhou et al., 2013) and rasterisation (Wood et al., 2010, Morgan and Lovelace, 2021). Edge bundling consists of clustering trajectory linestrings and replacing all cluster members with a single representative linestring. These (and subsequent) authors conclude that it performs poorly when applied to noisy GPS trajectories at the road segment level, and remains mostly suited to coarser aggregations, such as desire lines. Rasterisation relies on converting the vectorial flow map into a raster matrix, and aggregating the flows within the same raster pixel neighbourhood. Whilst this is indeed able to improve flow aggregations, it depends highly on the raster pixel neighbourhood size, and the rasterisation of the vectorial flow map leads to a loss of spatial resolution. We propose an alternative aggregation which does not lose resolution. Due to the complexity of this aggregation, it is divided into several algorithms, so that after the application of each algorithm, we progress further towards a minimal flow map.

We develop our novel algorithms within the R statistical analysis environment, to take advantage of its integrated access to advanced statistical and geospatial analysis methods. Whilst R is not a bona fide GIS (Geographical Informations System), its geospatial functionalities conform to the OGC standards (OGC, 2010) via the package sf (Pebesma, 2018), and is a viable option for research in transport geospatial data analysis (Necula, 2015, Lovelace et al., 2019).

3.1 Node snap** with hierarchical clustering

The proposed algorithm is ST_SNAPNODE, where the boundary points of the traffic flow linestrings are snapped to each other. Since the former are also nodes of the flow map, this gives the name to the algorithm. We focus on snap** these nodes since the linestring misalignments are in part caused by the existence of nodes which are close to each other but not exactly equal.

Since we are searching for points which are close to together, then this is well-suited to statistical clustering. There are many statistical clustering algorithms available, and we focus on hierarchical clustering (Gordon, 1999). A naive implementation where we consider all boundary points of all flow linestrings in a 1-pass complete linkage clustering is computationally intensive for any reasonable number of routes (Müllner, 2013). To resolve this computational bottleneck, we approximate the 1-pass complete linkage clustering by a nested 2-pass clustering. We begin with an efficient single linkage clustering of the boundary points of all linestrings in the R package fastcluster (Müllner, 2013). Since single linkage can result in chain-like clusters, we compute a subsequent complete linkage clustering to break these potential chains. In this nested approach, the complete linkage distance matrix is calculated only within each single linkage cluster, and so we are less likely to reach computational limits.

Algorithm 2 is a description of ST_SNAPNODE. The inputs are the $n_{\mathcal{F}}$ traffic flow linestrings $\mathcal{F}=\{{\bm{f}}_{1},\dots,{\bm{f}}_{n_{\mathcal{F}}}\}$ and the snap tolerance $\varepsilon_{S}$ . In Step 1, we extract the boundary points of all flow linestrings. In Steps 2–3, we compute a single linkage clustering on all boundary points, and cut the dendrogram at height $\varepsilon_{S}$ , resulting in $C^{\prime}$ clusters. In Steps 4–7, within each of these single linkage clusters, we compute a complete linkage clustering, and cut the dendrogram at height $\varepsilon_{S}$ . This divides each single linkage cluster into $C^{\prime\prime}$ clusters, where all cluster members are at most $\varepsilon_{S}$ distance from each other. In Steps 8–11, we compute the weighted spatial centroid in each of the $C^{\prime\prime}$ complete linkage clusters, weighted by the corresponding traffic flow values. In Step 12, we renumber the cluster labels from the above nested clusterings to approximate a 1-pass complete linkage clustering into $C^{*}$ clusters. In Steps 13–16, within each of these $C^{*}$ clusters, for all points of the corresponding flow linestrings which are closer than $\varepsilon_{S}$ distance to the centroid, we snap them to the centroid. In Steps 17–18, we collate and sort all the $n_{\mathcal{F}^{*}}$ snapped linestrings $\mathcal{F}^{*}=\{{\bm{f}}_{1}^{*},\dots,{\bm{f}}_{n_{\mathcal{F}^{*}}}^{*}\}$ . In comparison to the unsnapped linestrings $\mathcal{F}$ , there are fewer snapped linestrings $\mathcal{F}^{*}$ which have higher flow values, are more interconnected, and share more exactly overlap** sub-segments.

Algorithm 2 ST_SNAPNODE – Snap node clustering of linestrings

1:Input:

\mathcal{F}

flow linestrings,

\varepsilon_{S}

snap tolerance

2:Output:

\mathcal{F}^{*}

snap node clustered flow linestrings

3:Extract boundary points

B(\mathcal{F}):=\{\operatorname{Start}({\bm{e}}_{1}),\operatorname{End}({\bm{e% }}_{1}),\dots,\operatorname{Start}({\bm{e}}_{n_{\mathcal{F}}}),\operatorname{% End}({\bm{e}}_{n_{\mathcal{F}}})\}

4:Compute hierarchical clustering with single linkage on

B(\mathcal{F})

5:Cut single linkage dendrogram at height

\varepsilon_{S}

to compute

C^{\prime}

cluster labels

6:for

i:=1

C^{\prime}

7: Extract

i

th cluster of boundary points

B_{i}^{\prime}:=\{{\bm{b}}_{1}^{\prime},\dots,{\bm{b}}_{n_{B^{\prime}}}^{% \prime}\}

8: Compute hierarchical clustering with complete linkage on

B_{i}^{\prime}

9: Cut complete linkage dendrogram at height

\varepsilon_{S}

to compute

C^{\prime\prime}

cluster labels

10: for

j:=1

C^{\prime\prime}

11: Extract

j

th cluster of boundary points

B_{j}^{\prime\prime}:=\{{\bm{b}}_{1}^{\prime\prime},\dots,{\bm{b}}_{n_{B^{% \prime\prime}}}^{\prime\prime}\}

12: Extract corresponding flows

\{{\bm{f}}_{1}^{\prime\prime},\dots,{\bm{f}}_{n_{B^{\prime\prime}}}^{\prime% \prime}\}

from

\mathcal{F}

13: Compute

{\bm{b}}_{j}^{*}

:= weighted centroid of

\{{\bm{f}}_{1}^{\prime\prime},\dots,{\bm{f}}_{n_{B^{\prime\prime}}}^{\prime% \prime}\}

, weights :=

f_{1}^{\prime\prime},\dots,f_{n_{B^{\prime\prime}}}^{\prime\prime}

14:Renumber collated complete linkage cluster labels to unique

C^{*}

labels

15:for

i:=1

C^{*}

16: Extract corresponding flow linestrings

\mathcal{F}_{i}^{*}:=\{{\bm{f}}_{1}^{*},\dots,{\bm{f}}_{n_{B^{*}}}^{*}\}

from

\mathcal{F}

17: Snap points of

\{{\bm{f}}_{1}^{*},\dots,{\bm{f}}_{n_{B^{*}}}^{*}\}

within

\varepsilon_{S}

distance of cluster centroid

{\bm{b}}^{*}_{i}

{\bm{b}}^{*}_{i}

18:

\mathcal{F}_{i}^{*}

:= rejoin snapped and unsnapped points into linestrings

19:Collate snapped linestrings

\{\mathcal{F}_{1}^{*},\dots\mathcal{F}_{C^{\prime*}}^{*}\}

into

\mathcal{F}^{*}:=\{{\bm{f}}_{1}^{*},\dots,{\bm{f}}_{n_{\mathcal{F}^{*}}}^{*}\}

20:Sort

\mathcal{F}^{*}

by descending flow and length

An illustration of Algorithm 2 is given in Figure 5. In Figure 5(a) is the flow map before node snap**, where we observe that the boundary points (black solid circles) of some of the purple linestrings are closer to each than the snap** tolerance $\varepsilon_{S}=4$ m. Figure 5(b) shows an update flow map after node snap**, and applying ST_OVERLINE_PLANR. There are fewer thin purple road segments, since by snap** their intersection nodes, they share more exact sub-segments, and these shared sub-segments are merged during the flow aggregation to result in wide orange road segments.

3.2 Node splitting to add missing intersection nodes

Node snap** is focused near the boundary points of the flow linestrings. So it does not address misalignments far from the boundary points. Moreover there remain some linestrings which do indeed intersect but whose intersection is not correctly computed by ST_OVERLINE_PLANR. The solution to both of these problems is the explicit addition of the missing intersection nodes to these linestrings.

ST_SPLITNODE is the collation of a couple of algorithms which explicitly add the nodes at the intersections in the interior of linestrings, and splits these linestrings into simple linestring segments at these added nodes. This gives the name to this method. The inputs to Algorithm 3 are the linestrings $\mathcal{F}$ and the node splitting $S$ . For computational stability and efficiency, our method draws on two existing algorithms: in Steps 1–3, to_spatial_subdivision ( $S=$ ‘subdivision’) in the sfnetworks package (van der Meer et al., 2023), and in Steps 4–5, geos_unary_union ( $S=$ ‘unary’) in the geos package (Dunnington and Pebesma, 2023). The first option tends to find fewer intersection nodes, but this can be an advantage for our proposed line blending (to be introduced in the next subsection) since many small linestrings with similar flow values do not provide a clear prioritisation of this blending. It is also similar to the intersection computation performed by ST_OVERLINE_PLANR. The second option finds more missing intersection nodes, and leads to more comprehensive flow aggregation. We require both types of node splitting to compute a minimal flow map.

Algorithm 3 ST_SPLITNODE – Split nodes at interior intersections of linestrings

1:Input:

\mathcal{F}

flow linestrings,

S

node splitting type

2:Output:

\mathcal{F}^{*}

node split flow linestrings

3:Initialise local network

\mathcal{N}^{*}

with linestrings

\mathcal{F}

4:if

S==

‘subdivision’ then

\mathcal{F}^{*}:=

sfnetworks::to_spatial_subdivision

(\mathcal{N}^{*})

6:else if

S==

‘unary’ then

\mathcal{F}^{*}:=

geos::geos_unary_union

(\mathcal{N}^{*})

In Figure 6(a) is the flow map without any node splitting or node snap**. This map has missing intersection nodes, and intersections nodes which are close to each other. In Figure 6(b) is the flow map after node splitting ( $S=$ ‘unary’) to add the missing intersections nodes, and then node snap** (with snap tolerance $\varepsilon_{S}=4$ m). The result is that there are fewer road segments with higher flow values. Due to the combined action of ST_SPLITNODE and ST_SNAPNODE, Figure 6(b) is an improvement over Figures 5(a–b) and 6(a).

3.3 Line blending to align similar linestrings with local reference

So far we have focused on improving the alignment of linestrings induced by resolving inconsistencies at their intersections. We now focus on aligning linestrings more generally. For this, we require a comparison relationship to establish an order of alignment of nearby linestrings. For two linestrings ${\bm{f}}_{1}$ and ${\bm{f}}_{2}$ from a flow map $\mathcal{F}$ , we define that ${\bm{f}}_{2}$ is a candidate to be aligned to the reference ${\bm{f}}_{1}$ at threshold $\varepsilon\geq 0$ if

{\bm{f}}_{2}\subseteq{\tt ST\_BUFFER}({\bm{f}}_{1},\varepsilon),\ f_{2}\leq f_% {1}.

(1)

This relation is insensitive to any local complexities in ${\bm{f}}_{2}$ as long as they are all contained within the buffer zone around ${\bm{f}}_{1}$ . The buffer zone we employ has flat edges, e.g. for the R package sf, this corresponds to ST_BUFFER(endCapStyle="FLAT"), so the buffer zone ends at the boundary points of ${\bm{f}}_{1}$ . This avoids erroneously considering neighbouring segments of ${\bm{f}}_{1}$ , which are connected to ${\bm{f}}_{1}$ as a part of a longer sequence of road segments, to be candidates to be aligned to ${\bm{f}}_{1}$ . The condition on the flow values means that we place a higher priority on linestrings with higher flows. Since we can define a local reference linestring, a global road network graph is no longer required to align the candidate linestrings.

Let ${\bm{f}}_{\rm ref}$ be a reference linestring from $\mathcal{F}$ . The set of $m$ linestrings from $\mathcal{F}\backslash\{{\bm{f}}_{\rm ref}\}$ which satisfy Equation (1) are the candidate linestrings $\mathcal{F}_{\rm cand}=\{{\bm{f}}_{{\rm cand},1},\dots,{\bm{f}}_{{\rm cand},n_% {\mathcal{F}_{\rm cand}}}\}$ , with the convention that if $n_{\mathcal{F}_{\rm cand}}=0$ then $\mathcal{F}_{\rm cand}$ is the empty set. We call our approach ‘line blending’, since we will blend the candidate linestrings onto the reference linestring. The inputs in Algorithm 4 are the reference linestring ${\bm{f}}_{\rm ref}$ , the set of $m$ candidate linestrings $\mathcal{F}_{\rm cand}$ , and the blend tolerance $\varepsilon$ . The output are the modified reference and $m$ modified candidate flow linestrings, all with added interior points for accurate calculation of exactly equal linestring segments. In Step 1, we initialise a local network graph $\mathcal{N}^{*}$ with the reference linestring. In Steps 2–3, we extract all points of the candidate linestrings, and use the network_blend function in the sfnetworks package (which we denote as ST_NETWORK_BLEND) to blend efficiently these points into the reference linestring (van der Meer et al., 2023). This network blending requires a blend threshold, which we set to be $\varepsilon$ . ST_NETWORK_BLEND projects the candidate linestrings onto the reference linestring, and explicitly adds them to the network, thereby creating new edges in $\mathcal{N}^{*}$ . The result is a local network graph with more, shorter edges and with nodes at the projected candidate points, and whose union is the reference linestring. In Steps 4–5, we extract the $m$ blended candidate linestrings with these added interior points by applying the shortest path search between the start and end point of each blended candidate linestring along the network graph $\mathcal{N}^{*}$ . Step 6 is the equivalent for the blended reference linestring. Step 7 involves collating the blended candidate linestrings.

Algorithm 4 ST_LINEBLEND – Blend candidate linestrings onto reference linestring

1:Input:

{\bm{f}}_{\rm ref}

reference,

\mathcal{F}_{\rm cand}

candidate flow linestrings,

\varepsilon

blend tolerance

2:Output:

({\bm{f}}^{*}_{\rm ref},\mathcal{F}^{*}_{\rm cand})

blended reference and

m

candidate flow linestrings

3:Initialise local network graph

\mathcal{N}^{*}

with

{\bm{f}}_{\rm ref}

4:Extract all points

G_{\rm cand}

of candidate linestrings and flow values

f_{\rm cand}

from

\mathcal{F}_{\rm cand}

5:Update network by blending candidate points

\mathcal{N}^{*}:={\tt ST\_NETWORK\_BLEND}(\mathcal{N}^{*},G_{\rm cand},\varepsilon)

6:for

i:=1

n_{\mathcal{F}_{\rm cand}}

{\bm{f}}^{*}_{{\rm cand},i}

:= shortest path from

\operatorname{Start}({\bm{f}}_{{\rm cand},i})

\operatorname{End}({\bm{f}}_{{\rm cand},i})

along network

\mathcal{N}^{*}

{\bm{f}}^{*}_{{\rm ref}}

:= shortest path from

\operatorname{Start}({\bm{f}}_{{\rm ref}})

\operatorname{End}({\bm{f}}_{{\rm ref}})

along network

\mathcal{N}^{*}

9:Collate

\mathcal{F}^{*}_{\rm cand}:=\{{\bm{f}}_{{\rm cand},1}^{*},\dots,{\bm{f}}_{{\rm cand% },n_{\mathcal{F}_{\rm cand}}}^{*}\}

The result from ${\tt ST\_LINEBLEND}$ is the modified reference linestring and the projected candidate linestrings with added interior points. These added interior points resolve a limitation of the flow aggregation of ST_OVERLINE_PLANR (Morgan and Lovelace, 2021). If there are many candidate linestrings, then this may lead to many sub-segments in the projected linestrings, each with their own flow values. So we assign the weighted mean flow, weighted by the sub-segment length, to all sub-segments. This single flow value takes into account the contribution of each candidate linestring to the flow along the reference linestring. Moreover, we take the rounded value of this weighted mean flow to expedite the aggregation computations and to reduce visual clutter of the flow map.

In Figure 7(a), we illustrate this line blending with the linestrings $ABC$ with flow $f_{ABC}=7$ in blue, and $AD$ with flow $f_{AD}=2$ in orange. As these two linestrings do not share a common sub-segment, so flow aggregation does not modify them. Since $AD\subset{\tt ST\_BUFFER}(ABC,4~{}\mathrm{m})$ (the pale blue rectangle), and $f_{AD}<f_{ABC}$ , then $ABC$ is the reference ${\bm{f}}_{\rm ref}$ and $AD$ is the candidate linestring ${\bm{f}}_{\rm cand}$ , according to Equation (1). Line blending involves projecting $AD$ to $ABC$ , so $D$ is projected to $D^{\prime}$ (which lies exactly on the reference linestring), and $B$ is added explicitly to this projected linestring. We also add $D^{\prime}$ to the reference linestring. The reference linestring becomes ${\bm{f}}^{*}_{\rm ref}=(7,ABD^{\prime}C)$ , and the candidate ${\bm{f}}^{*}_{\rm cand}=(2,ABD^{\prime})$ . Now ${\bm{f}}^{*}_{\rm ref},{\bm{f}}^{*}_{\rm cand}$ share the sub-segment $ABD^{\prime}$ exactly, and ST_OVERLINE_PLANR gives the aggregated flows as $(9,ABD^{\prime})$ and $(2,D^{\prime}C)$ . We take the rounded value of the weighted mean of these two flow linestrings. The result of ST_LINEBLEND with ST_OVERLINE_PLANR is a single linestring $(9,ABD^{\prime}C)$ , as shown in Figure 7(b).

More complex situations arise when other linestrings touch the candidate linestring, but are not candidates themselves for blending. Since line blending projects the candidates to the reference linestring, then we have to also project these other linestrings to avoid leaving a gap in the updated flow map. This procedure ST_SNAP_CAND_TOUCH is outlined in Algorithm 5. The inputs are the reference linestring ${\bm{f}}_{\rm ref}$ , the $n_{\mathcal{F}_{\rm cand}}$ candidate linestrings $\mathcal{F}_{\rm cand}$ , the $n_{\mathcal{F}_{\rm ct}}$ candidate-touching linestrings $\mathcal{F}_{\rm ct}$ , and the snap tolerance $\varepsilon_{S}$ . In Steps 1–8, we iterate over each candidate-touching linestring. In Steps 2–3, we compute the intersection points between the candidate-touching linestring and the boundary points of the candidate linestrings, and the respective distances. In Steps 4–7, if this intersection point is within $\varepsilon_{S}$ distance to the boundary points of ${\bm{f}}_{\rm ref}$ , then we snap the candidate-touching linestring to the closest boundary point. In Step 8, if the intersection point is not within $\varepsilon_{S}$ distance, then we snap it to the nearest point on ${\bm{f}}_{\rm ref}$ . These snap** operations are similar to those in Steps 15–16 in ST_SNAPNODE in Algorithm 2, and ensure that we maintain connectivity between $\mathcal{F}_{\rm ct}$ and ${\bm{f}}_{\rm ref}$ . In Step 9, we collate these snapped linestrings.

Algorithm 5 ST_SNAP_CAND_TOUCH – Snap candidate-touching linestrings onto reference

1:Input:

{\bm{f}}_{\rm ref}

reference,

\mathcal{F}_{\rm cand}

candidate,

\mathcal{F}_{\rm ct}

candidate-touching flow linestrings,

\varepsilon_{S}

snap tolerance

2:Output:

\mathcal{F}^{*}_{\rm ct}

snapped candidate-touching flow linestrings

3:for

i:=1

n_{\mathcal{F}_{\rm ct}}

{\bm{g}}_{{\rm ct},i}^{*}:={\bm{f}}_{{\rm ct},i}\cap\{\operatorname{Start}(% \mathcal{F}_{{\rm cand}}),\operatorname{End}(\mathcal{F}_{{\rm cand}})\}

d_{i,\operatorname{Start}}^{*}:={\tt ST\_DIST}({\bm{g}}_{{\rm ct},i}^{*},% \operatorname{Start}({\bm{f}}_{\rm ref}))

;

d_{i,\operatorname{End}}^{*}:={\tt ST\_DIST}({\bm{g}}_{{\rm ct},i}^{*},% \operatorname{End}({\bm{f}}_{\rm ref}))

6: if (

d_{i,\operatorname{Start}}^{*}\leq\varepsilon_{S}

and

d_{i,\operatorname{Start}}^{*}\leq d_{i,\operatorname{End}}^{*}

) then

{\bm{f}}^{*}_{{\rm ct},i}:=

snap

{\bm{f}}_{{\rm ct},i}

\operatorname{Start}({\bm{f}}_{\rm ref})

8: else if (

d_{i,\operatorname{End}}^{*}\leq\varepsilon_{S}

and

d_{i,\operatorname{End}}^{*}\leq d_{i,\operatorname{Start}}^{*}

) then

{\bm{f}}^{*}_{{\rm ct},i}:=

snap

{\bm{f}}_{{\rm ct},i}

\operatorname{End}({\bm{f}}_{\rm ref})

10: else

{\bm{f}}^{*}_{{\rm ct},i}:=

snap

{\bm{f}}_{{\rm ct},i}

to nearest point of

{\bm{f}}_{\rm ref}

11:Collate

\mathcal{F}^{*}_{\rm ct}:=\{{\bm{f}}^{*}_{{\rm ct},1},\dots,{\bm{f}}^{*}_{{\rm ct% },n_{\mathcal{F}_{\rm ct}}}\}

Figure 8(a) is an illustration of ST_SNAP_CAND_TOUCH with the reference $ABC$ with flow $f_{ABC}=7$ (blue), the candidate linestring $AD$ with flow $f_{AD}=2$ (orange), and the blending buffer zone with blend tolerance $\varepsilon=4$ m (light blue). The candidate-touching linestring ${\bm{f}}_{\rm ct}=(5,DE)$ (purple) touches the candidate $AD$ at $D$ . The orange line is entirely within the pale blue buffer zone around the blue reference line, whereas the purple line extends outside of the buffer zone and so is not a candidate for blending to the reference linestring. If we apply ST_LINEBLEND to blend the candidate linestring, then $D$ is projected to $D^{\prime}$ to lie on the reference linestring, i.e. ${\bm{f}}_{\rm cand}^{*}=(2,ABD^{\prime})$ . If we apply ST_SNAP_CAND_TOUCH to the candidate-touching linestring, then as the intersection point $D$ is less than $\varepsilon_{S}=4$ m from the boundary points of ${\bm{f}}_{\rm ref}$ , it is projected to the boundary point $C$ , i.e. ${\bm{f}}^{*}_{\rm ct}=(5,CE)$ . Thus line blending, candidate-touching snap** and flow aggregation yields that ${\bm{f}}^{*}_{\rm ct}=(5,CE)$ remains connected to ${\bm{f}}_{\rm ref}^{*}=(9,ABD^{\prime}C)$ .

So far we have only considered line blending where the reference linestring has a single flow. Due to the noisiness of the empirical GPS trajectories, this is insufficient to reach a minimal flow map. So we allow the reference to be a sequence of $k$ connected edges, each with their own flow. If we treat this edge sequence momentarily as a single linestring, with flow equal to the weighted mean flow, weighted by the length of the $k$ edges, then we can apply Equation (1) to search for potential candidate linestrings. Due to computational limitations, we retain that candidates be single edges with single flow values. The generalisation to $k$ -edge reference linestrings allows us to blend candidate linestrings which exceed the buffer zones of 1-edge reference linestrings.

We have described the situation for blending candidate and candidate-touching linestrings to a single reference linestring (possibly composed of $k$ edges). The next step to determine the priority for line blending for a set of reference linestrings. We require that a reference linestring cannot be a candidate linestring to another reference linestring, and a candidate linestring is a candidate for one reference linestring only. These ensure that if two linestrings ${\bm{f}}_{2}\subseteq{\tt ST\_BUFFER}({\bm{f}}_{1},\varepsilon)$ and ${\bm{f}}_{1}\subseteq{\tt ST\_BUFFER}({\bm{f}}_{2},\varepsilon)$ , then we select only one of them. The flow condition $f_{2}\leq f_{1}$ usually distinguishes the reference linestring, except for where $f_{1}=f_{2}$ . In the situation of equal flow values, then we designate the longer linestring to be the reference.

The inputs to ST_LINEBLEND_PRIORITY in Algorithm 6 are the flow linestrings $\mathcal{F}$ , the number of connected edges for the reference linestrings $k$ , and the blend tolerance $\varepsilon$ . The output is a set of non-overlap** reference linestrings $\mathcal{F}_{\rm ref}$ , a set of unique candidate linestrings $\mathcal{F}_{\rm cand}$ and a set of (potentially repeated) candidate-touching $\mathcal{F}_{\rm ct}$ linestrings. In Steps 1–2, we set up a network graph from the linestrings $\mathcal{F}$ , and then extract $\mathcal{K}$ , all simple paths (i.e. all paths composed of connected, unique edges) with $k$ -edges. In Steps 3–4, we compute the weighted mean of the $k$ flow values, weighted by the length of individual edges, and assign this to the entire $k$ -edge path. The function ${\rm Edges}(\cdot)$ extracts the edges from a $k$ -edge path. In Step 5, we sort the flow linestrings, in descending order of their flow values and length. In Steps 6–8, we construct a maximal set of non-overlap** $k$ -edge paths $\mathcal{K}^{*}$ . We initialise $\mathcal{K}^{*}$ to contain the first path. We iterate through $\mathcal{K}$ and, if this path in $\mathcal{K}$ does not overlap the current $\mathcal{K}^{*}$ , then we add it to $\mathcal{K}^{*}$ . For $k=1$ , we bypass Steps 2–8. In Steps 9–18, we step through the ${\bm{k}}_{i}^{*}\in\mathcal{K}^{*}$ , starting from the linestring with the highest flow and length. In Steps 11–13, if the current flow linestring ${\bm{k}}_{i}^{*}$ has no incident edges in the candidate linestrings $\mathcal{F}_{\rm cand}$ , then we set it to be a reference linestring. In Step 13, we search for candidate linestrings for ${\bm{k}}_{i}^{*}$ , from those linestrings which are not already reference or candidate linestrings, according to Equation (1). In Steps 14–17, if there is at least one candidate linestring, then we update $\mathcal{F}_{{\rm ref}},\mathcal{F}_{\rm cand}$ and $\mathcal{F}_{\rm ct}$ . In Step 18, we extract the candidate $\mathcal{F}_{\rm cand}$ and candidate-touching $\mathcal{F}_{\rm ct}$ linestring sets, since the reference linestring set is already computed as $\mathcal{F}_{\rm ref}$ .

Algorithm 6 ST_LINEBLEND_PRIORITY – Compute line blending priority

1:Input:

\mathcal{F}

flow linestrings,

k

#edges,

\varepsilon

blend tolerance

2:Output:

(\mathcal{F}_{\rm ref},\mathcal{F}_{\rm cand},\mathcal{F}_{\rm ct})

reference, candidate, candidate-touching linestrings

3:Initialise network graph

\mathcal{N}^{*}

from linestrings

\mathcal{F}

4:Extract all simple paths of length

k

edges

\mathcal{K}:=\{{\bm{k}}_{1},\dots,{\bm{k}}_{n_{\mathcal{K}}}\}

from

\mathcal{N}^{*}

5:for

i:=1

n_{\mathcal{K}}

f_{i}

:= weighted mean of

k

flows, weight := len(Edges(

{\bm{k}}_{i}

))

7:Sort

\mathcal{K}

by descending flow and combined length

8:Initialise

\mathcal{K}^{*}:=\{{\bm{k}}_{1}\}

9:for

i:=2

n_{\mathcal{K}}

10: if (

{\bm{k}}_{i}\cap\mathcal{K}^{*}=\{\}

) then

\mathcal{K}^{*}:=\mathcal{K}^{*}\cup\{{\bm{k}}_{i}\}

11:Initialise

\mathcal{Q}:=\mathcal{F}_{\rm ref}:=\mathcal{F}_{\rm cand}:=\mathcal{F}_{\rm ct% }:=\{\}

12:for

i:=1

n_{\mathcal{K}^{*}}

13: if (

{\rm Edges}({\bm{k}}_{i}^{*})\notin\mathcal{F}_{\rm cand}

) then

14: Set reference linestring

{\bm{k}}_{{\rm ref}}:={\bm{k}}_{i}^{*}

15: Search candidate linestrings

\mathcal{F}_{{\rm cand}}^{*}:=\{{\bm{f}}\in\mathcal{F}\backslash({\rm Edges}(% \mathcal{F}_{\rm ref})\cup\{{\bm{k}}_{\rm ref}\}\cup\mathcal{F}_{\rm cand}):{% \bm{f}}\subseteq

16:

{\tt ST\_BUFFER}({\bm{k}}_{\rm ref},\varepsilon),f\leq f_{\rm ref}\}

17: if

\mathcal{F}_{\rm cand}^{*}\neq\{\}

then

18: Update

\mathcal{F}_{\rm ref}:=\mathcal{F}_{\rm ref}\cup\{{\bm{k}}_{\rm ref}\},% \mathcal{F}_{\rm cand}:=\mathcal{F}_{\rm cand}\cup\mathcal{F}_{\rm cand}^{*}

19: Search touches

\mathcal{F}_{{\rm ct}}:=\{{\bm{f}}\in\mathcal{F}\backslash\{{\rm Edges}(% \mathcal{F}_{\rm ref})\ \cup\mathcal{F}_{\rm cand}\}:{\tt ST\_TOUCHES}(% \mathcal{F}_{{\rm cand}}^{*},{\bm{f}})\}

20: Update

\mathcal{Q}:=\mathcal{Q}\cup\{(\{{\bm{k}}_{\rm ref}\},\mathcal{F}_{\rm cand}^{% *},\mathcal{F}_{\rm ct}^{*})\}

21:Extract

\mathcal{F}_{\rm cand}:=\{\mathcal{F}^{*}_{{\rm cand},1},\dots,\mathcal{F}^{*}% _{{\rm cand},n_{\mathcal{Q}}}\},\mathcal{F}_{\rm ct}:=\{\mathcal{F}^{*}_{{\rm ct% },1},\dots,\mathcal{F}^{*}_{{\rm ct},n_{\mathcal{Q}}}\}

from

\mathcal{Q}

The local alignment of road segments, in Algorithm 7, is established by combining Algorithms 2–6. Its inputs are the flow linestrings $\mathcal{F}$ , the split node type $S$ , the number of connected edges for the reference linestrings $k$ , the blend tolerance $\varepsilon$ and the snap tolerance $\varepsilon_{S}$ . In Step 1, we node split the flow linestrings $\mathcal{F}$ . Step 2, we node snap the flow linestrings. In Step 3, we set up the priority for the line blending, and in Step 4 we store all the linestrings that will not be modified by the line blending in $\mathcal{F}^{c}$ . In Steps 5–8, we iterate over each reference linestring. In Step 6, we update the reference and candidate flow linestrings by blending the candidate linestrings. In Step 7, we apply ST_OVERLINE_PLANR within each of the $k$ -edges of the updated reference linestrings. In Step 8, we compute the weighted mean flow, weighted by the length of the corresponding sub-segments, and assign it as the flow value to all sub-segments. The result from Steps 6–8 is a modified $k$ -edge reference linestring with $k$ aggregated flows. In Step 9, we snap the candidate-touching linestrings to the reference linestring to maintain connectivity. In Steps 10–11, we collate the modified linestrings from Steps 5–9, with the unmodified linestrings $\mathcal{F}^{c}$ from Step 4, and sort them in descending flow and length.

Algorithm 7 ST_OVERLINE_LINEBLEND – Locally align road segment flows using line blending

1:Input:

\mathcal{F}

flow linestrings,

S

split node,

k

#edges,

\varepsilon

blend tolerance,

\varepsilon_{S}

snap tolerance

2:Output:

\mathcal{F}^{*}

aggregated aligned flow linestrings

\mathcal{F}:={\tt ST\_SPLITNODE}(\mathcal{F},S)

\mathcal{F}:={\tt ST\_SNAPNODE}(\mathcal{F},\varepsilon_{S})

(\mathcal{F}_{\rm ref},\mathcal{F}_{\rm cand},\mathcal{F}_{\rm ct}):={\tt ST\_% LINEBLEND\_PRIORITY}(\mathcal{F},k,\varepsilon)

\mathcal{F}^{c}:=\mathcal{F}\backslash({\rm Edges}(\mathcal{F}_{\rm ref})\cup% \mathcal{F}_{\rm cand}\cup\mathcal{F}_{\rm ct})

7:for

i:=1

n_{\mathcal{F}_{\rm ref}}

({\bm{f}}_{{\rm ref},i}^{*},\mathcal{F}_{{\rm cand},i}^{*}):={\tt ST\_% LINEBLEND}({\bm{f}}_{{\rm ref},i},\mathcal{F}_{{\rm cand},i},\varepsilon)

{\bm{f}}_{{\rm ref},i}^{*}:={\tt ST\_OVERLINE\_PLANR}({\bm{f}}_{{\rm ref},i}^{% *}\cup\mathcal{F}_{{\rm cand},i}^{*})

10:

f^{*}_{{\rm ref},i}

:= weighted mean of

{\bm{f}}^{*}_{{\rm ref},i}

, weight

:=\mathrm{len}({\rm Edges}({\bm{f}}_{{\rm ref},i}^{*}))

11:

\mathcal{F}_{{\rm ct},i}^{*}:={\tt ST\_SNAP\_CAND\_TOUCH}({\bm{f}}_{{\rm ref},% i}^{*},\mathcal{F}_{{\rm cand},i},\mathcal{F}_{{\rm ct},i},\varepsilon_{S})

12:

\mathcal{F}^{*}:=\mathcal{F}_{\rm ref}^{*}\cup\mathcal{F}_{\rm ct}^{*}\cup% \mathcal{F}^{c}

13:Sort

\mathcal{F}^{*}

in descending order of flow and length

We outline the overall workflow ST_OVERLINE to compute a minimal flow map, by combining pre- and post-processing with the iteration of local alignment. For Algorithm 8, the inputs are the map matched routes $\mathcal{M}$ , the split node type $S$ , the blend tolerance $\varepsilon$ , the simplify tolerance $\varepsilon_{D}$ , the snap tolerance $\varepsilon_{S}$ , the number of connected edges in reference linestrings ${\bm{k}}$ , and the maximum number of iterations $j_{\max}$ . In Step 1, we apply ST_OVERLINE_PLANR to the map matched routes to produce an initial flow map $\mathcal{F}$ . In Step 2, we apply ST_SPLITNODE to this flow map, because applying ST_SPLITNODE on a flow map is more robust than on the map matched routes. In Step 3, we employ the standard ST_SIMPLIFY with simplify tolerance $\varepsilon_{D}$ . These simplified linestrings are modified so that all modified segments are at most $\varepsilon_{D}$ distance from the unmodified segments (Ramer, 1972, Douglas and Peucker, 1973). These simplified linestrings usually lead to more overlap** segments, which assist the flow aggregation in Step 4. Steps 5–10 is the iteration of ST_OVERLINE_LINEBLEND for the $k$ -edge reference linestrings. For each $k$ in ${\bm{k}}$ , we iterate until the maximum number of iterations $j_{\max}$ is reached or two consecutive flow maps are identical. Within each iteration, we search for the $k$ -edge reference linestrings, blend the candidate linestrings, and snap the candidate-touching linestrings, and compute the flow aggregation. In Steps 11–14 is some housekee** in ST_OVERLINE_PRUNE, where we remove pseudo nodes, and replace the incident edges with a concatenated edge with the weighted mean flow, as well as some loops and leaf edges with low flow values.

Algorithm 8 ST_OVERLINE – Compute locally aligned flow map from map matched routes

1:Input:

\mathcal{M}

map matched routes,

S

split node,

\varepsilon_{D}

simplify tolerance,

{\bm{k}}

#edges,

j_{\max}

max #iterations,

\varepsilon

blend tolerance,

\varepsilon_{S}

snap tolerance

2:Output:

\mathcal{F}

aligned road segment flows

\mathcal{F}:={\tt ST\_OVERLINE\_PLANR}(\mathcal{M})

;

\mathcal{F}:={\tt ST\_SPLITNODE}(\mathcal{F},S)

\mathcal{F}:={\tt ST\_SIMPLIFY}(\mathcal{F},\varepsilon_{D})

\mathcal{F}:={\tt ST\_OVERLINE\_PLANR}(\mathcal{F})

7:for

k

{\bm{k}}

\mathcal{F}_{\rm prev}:=\{\}

;

j:=0

9: while

j<j_{\max}

and

\mathcal{F}_{\rm prev}\neq\mathcal{F}

10:

\mathcal{F}_{\rm prev}:=\mathcal{F}

;

j:=j+1

11:

\mathcal{F}:={\tt ST\_OVERLINE\_LINEBLEND}(\mathcal{F},k,\varepsilon,% \varepsilon_{S})

12:

\mathcal{F}:={\tt ST\_OVERLINE\_PLANR}(\mathcal{F})

13:

\mathcal{F}_{\rm prev}:=\{\}

;

j:=0

14:while

j<j_{\max}

and

\mathcal{F}_{\rm prev}\neq\mathcal{F}

15:

\mathcal{F}_{\rm prev}:=\mathcal{F}

;

j:=j+1

16:

\mathcal{F}:={\tt ST\_OVERLINE\_PRUNE}(\mathcal{F})

4 Results

In this section we compute a minimal flow map for the Hannover GPS trajectories. From the complete set of 1183 trajectories, we keep 1177 trajectories with length greater than $100$ m. We input the 1177 trajectories into ST_ROUTE with $M$ as the map matching and $R$ the route finding APIs from the Valhalla routing engine, and ${\bm{n}}_{W}=3,13,23,33,43,63,83$ waypoints. We employ the dockerised version 3.4.0 of the Valhalla routing engine (GIS OPS, 2023). Of these input trajectories, 1147 yield a sufficiently high quality match, where the Hausdorff distance $d_{\operatorname{Haus}}(M^{*}(G),G)<100$ m, and the ratio $\operatorname{len}(M^{*}(G))/\operatorname{len}(G)<1.1$ . We continue the analysis with these 1147 map matched routes.

For all iterations, we set $j_{\max}=20$ . For the line blending, we begin with 1 iteration of ST_OVERLINE (Steps 1–4) with simplify tolerance $\varepsilon_{D}=1~{}\mathrm{m}$ . We follow with ST_OVERLINE (Steps 5–14) with blend and snap tolerances $\varepsilon=\varepsilon_{S}=4~{}\mathrm{m}$ , #edges ${\bm{k}}=1,2$ , and node split type $S=$ ‘subdivision’. These simplify, blend and snap tolerances were chosen heuristically as a trade-off between being sufficiently large to account for the noisiness of the GPS trajectories and the map matching/route finding APIs, whilst not being too large to obscure separate road segments within region with a dense road network. We set ${\bm{k}}=1,2$ since the search for connected $k$ -edges with $k>2$ is too computationally intensive for our setup due to the large number of road segments (13 495). We set $S=$ ‘subdivision’ node splitting, as it provides more stable line blending priority at this early stage. We follow with 1 iteration of ST_OVERLINE (Steps 5–14), with the same tuning parameter choices, except with $S=$ ‘unary’. ‘Unary’ node splitting is usually applied after an iteration of ‘subdivision’ node splitting since the former can now add any missing intersection nodes without adversely affecting the line blending priority. We end with 2 iterations of ST_OVERLINE (Steps 5–14), with $\varepsilon=\varepsilon_{S}=5~{}\mathrm{m},{\bm{k}}=1,2,3,4,S=$ ‘unary’, since the larger values for $\varepsilon,\varepsilon_{S}$ capture any line blending missed at 4 m, and the connected 3-, 4-edge searches become computationally feasible with the lower number of road segments. These 5 iterations of ST_OVERLINE converge to a minimal flow map with 1 413 road segments, i.e. a 89.5% reduction of the 13 495 segments in the initial flow map.

Figure 9 illustrates the results of these iterations. In Figure 9(a) are the GPS trajectories (green circles) and their map matched routes resulting from ST_ROUTE (blue lines). Whilst the map matched routes no longer completely obscure the road network, they remain misaligned to each other and to the road network. So it is not possible to accurately estimate the traffic flow map directly from them. In Figure 9(b) is a flow map resulting from ST_OVERLINE (Steps 1–4) with $S=$ ‘subdivision’. This is similar to the flow map that would be obtained by following Morgan and Lovelace (2021) without rasterisation. Since each linestring is labelled by its flow value, the crowding of the labels indicates that this is unlikely to be a minimal flow map. In Figure 9(c), we complete the line blending iteration ST_OVERLINE (Steps 5–14) with $S=$ ‘subdivision’. The crowding of the flow value labels is reduced so we are progressing to a minimal flow map. In Figure 9(d), we carry out a further 3 line blending iterations ST_OVERLINE (Steps 5–14) with $S=$ ‘unary’ to arrive at a minimal flow map.

4.1 Validation

A visual inspection of the minimal flow map in Figure 9(d) reveals good alignment to the OSM road network in general, though there remain some data artefacts elsewhere. For example, in Figure 10(a), the green GPS trajectories give no indication of the blue wonky route output by ST_ROUTE, and in Figure 10(c), the GPS trajectories do not involve a loop, though ST_ROUTE contains a loop. Since our local alignment does not use an external road network to correct these wonky or extraneous loops in the map matched routes, they are propagated to the flow maps in Figure 10(a, d). Overall, these type of data artefacts are small and infrequent, and are associated with road segments with low flow.

For a more quantitative validation of the accuracy of our proposed flow map, we require a gold standard reference flow map. The experimental design of the Hannover GPS trajectories is to serve as a reference data set for learning turning rules at road junctions, rather than as a reference flow map (Zourlidou et al., 2022). So we have to compute a proxy reference flow map. For this, we first compute line transects $\mathcal{T}_{i}=\{{\bm{t}}_{i,1},\dots,{\bm{t}}_{i,n_{\mathcal{T}_{i}}}\}={\tt ST% \_TRANSECT}({\bm{f}}_{i},\varepsilon_{T},\delta_{T})$ for a road linestring ${\bm{f}}_{i}\in\mathcal{F}$ , where a line transect is an orthogonal line segment (of length $2\varepsilon_{T}$ m) to ${\bm{f}}_{i}$ . If $\operatorname{len}({\bm{f}}_{i})>\delta_{T}$ m, then these transects are placed at every $\delta_{T}$ m of ${\bm{f}}_{i}$ , or if $\operatorname{len}({\bm{f}}_{i})\leq\delta_{T}$ m then the single transect is subtended from the centroid of ${\bm{f}}_{i}$ . The empirical proxy flows are the number of intersection points between the map matched routes from $\mathcal{M}$ which intersect the line transects $f_{i,j}^{\mathrm{emp}}=\#\{{\tt ST\_INTERSECTION}(\mathcal{M},{\bm{t}}_{i,j}):% {\bm{t}}_{i,j}\in\mathcal{T}_{i}\},j=1,\dots,n_{\mathcal{T}_{i}}$ , and the mean discrepancy over all line transects of ${\bm{f}}_{i}$ is $\bar{f}_{i}^{\mathrm{emp}}=\mathrm{mean}\{f_{i,j}^{\mathrm{emp}},j=1,\dots,n_{% \mathcal{T}_{i}}\}$ . Our error measure of ${\bm{f}}_{i}$ is the absolute difference $\operatorname{Err}_{i,j}=|f_{i}-\bar{f}_{i}^{\mathrm{emp}}|$ for all line transects ${\bm{t}}_{j},j=1,\dots,n_{\mathcal{T}_{i}}$ which intersect ${\bm{f}}_{i}$ . We also compute the relative error $\operatorname{RErr}_{i,j}=\operatorname{Err}_{i,j}/f_{i}$ , which is well-defined since $f_{i}\geq 1$ for all ${\bm{f}}_{i}\in\mathcal{F}$ .

With the tuning parameters tolerance $\varepsilon_{T}=5$ m and $\delta_{T}=50$ m, then we compute 7 380 line transects. The colour (purple to orange) of the line transects is proportional to the absolute error. In Figure 11(a), the vast majority of the error values are low (purple), with only a few high error values (orange), in/near the solid black rectangle. The zoom of the black rectangle is given in Figure 11(b), along with the road segment flows in blue. The road segment ${\bm{f}}_{11}$ (thick blue segment) with the estimated flow is $f_{11}=238$ , and the empirical proxy flow $\bar{f}_{11}^{\mathrm{emp}}=254$ , has a high absolute error, i.e. $\operatorname{Err}_{11}=16,\operatorname{RErr}_{11}=0.07$ . The road segment ${\bm{f}}_{331}$ with $f_{331}=25,\bar{f}_{331}^{\mathrm{emp}}=12$ , has a higher relative error, i.e. $\operatorname{Err}_{331}=13,\operatorname{RErr}_{331}=0.52$ . These errors are due to that these road segments, at an earlier iteration of the flow map, comprised shorter sub-segments with higher flows due to the presence of the more than $100$ GPS trajectories passing near the intersection of ${\bm{f}}_{11}$ and ${\bm{f}}_{331}$ . Since the nodes of these sub-segments are pseudo nodes, they were removed, and the concatenated sub-segments assigned the single weighted mean flow = 25.

To supplement the visual examination of these flow estimation errors, Figure 12 shows the bivariate (Err, RErr) histogram plot of the errors. The vast majority (6518 or 88.3%) in the orange hexagonal bin have zero absolute and relative error, and only 39 (0.53%) have Err $>4$ , and 364 (4.9%) have RErr $>0.1$ , and only 16 (0.21%) have Err $>4$ and RErr $>0.1$ which comprise most of the purple bins.

This demonstrates that our flow map is mostly accurate as it usually controls the estimation error. By accuracy, we mean by how close the estimated flows are to the empirical proxy flows. However accuracy is insufficient on its own: if we take the extreme situation of a flow map with a single road segment with zero error, then this has the highest accuracy, but we have no estimated flows outside of this single segment. So we require that the spatial coverage of the flow map be also high. We cannot resolve this question of spatial coverage unambiguously since we do not have a gold standard flow map, though we can verify that all $n=1147$ map matched routes intersect at least one line transect from the estimated flow map. Whilst this calculation does not exclude that there can be some regions of some map matched routes are without nearby line transects, we can further verify with a visual inspection that these regions tend to be small in area and/or comprise a low number of routes. So we claim that our minimal flow map has high levels of accuracy and spatial coverage.

To conclude the validation of our flow maps, the flow map at the city level is illustrated in Figure 13(a). The orange segments with high flows are apparent, whereas these high flows were not apparent from the scatter plot of the GPS trajectories in Figure 1(a). The desire line map (or spider diagram) is in Figure 13(b) and represent the traffic flows between the origin/destination hub nodes (black solid circles) These hub nodes are the centroids from a hierarchical clustering, similar to that in ST_SNAPNODE, of the boundary points of GPS trajectories with the cutoff at 5 000 m. The straight lines indicate the GPS trajectories whose origin/destination are associated with different hub nodes, and the circles indicate the trajectories whose origin/destination are associated with the same hub node. The desire line map is in effect a low resolution map of straight-line flows between the hub nodes only, whereas the high resolution map shows the traffic flows on all road segments.

4.2 Software

These analysis algorithms have been developed in R since the complexity of Algorithms 1–8 require a mix of advanced geospatial and statistical methods. As a compiled language, R can have slower execution times. There are two main computational bottlenecks. The first consists of the map matching/route finding in Algorithm 1. The Valhalla routing engine APIs are available as a web-based service (e.g. https://valhalla1.openstreetmap.de) or as a local dockerised image (e.g. https://github.com/gis-ops/docker-valhalla). We use the latter as it allows for faster computation since these local API requests are not sent to a remote web-based server, and can be parallelised on a stand-alone machine. We conduct a small study of the execution times based on 10 replicates of ST_ROUTE on 10 randomly selected GPS trajectories on an Intel i5 Quad core 3.10 GHz machine running Ubuntu 22.04 and R 4.4.0. Executing ST_ROUTE with a local Valhalla API is around 7.1 times faster than the web-based API, and a parallelisation (with 3 workers) is around 1.8 times faster than a serial computation. This is less than 3 because the dockerised image is not optimised for simultaneous API calls. Combining these together, a parallelised local API achieves around a 12.6 fold speed improvement in comparison to a serial web-based API.

The second bottleneck is the line blending in Algorithm 7. Based on the execution times based on 10 replicates of ST_OVERLINE on a subset of 433 GPS trajectories, 3-worker parallelisation is around 1.8 times faster than a serial computation. This is less than 3 because only the repeated application of line blending (Algorithm 4) is parallelised, whilst the line blending priority (Algorithm 6) remains a serial computation. These speed factors are intended to be illustrative, since execution times, involving remote web servers APIs and parallelisation, are difficult to predict on different internet connections and machines. We tentatively claim that a local Valhalla API reduces the execution time by an order of magnitude, whereas parallelisation reduces it almost linearly by the number of workers.

We anticipate releasing an add-on package on CRAN (https://cran.r-project.org), which is the main R package repository. Since our R add-on package is under development, in the mean time, we provide a geopackage and QGIS project with the input GPS trajectories, map matched routes, iterated flow maps and desire lines, as listed in Table 1. The interested reader is able to interactively explore in QGIS the added value of our proposed high resolution minimal flow map flowmap4, in comparison to the input trajectories traj, the map matched routes route, the desire line flow map flowmap_desire, and the flow map computed according to a leading alternative without rasterisation (Morgan and Lovelace, 2021) which is similar to flowmap0.

Layer	Description	$n$
traj	Empirical GPS trajectories	1 147
route	Map matched routes ST_ROUTE	1 147
flowmap0	Flow map ST_OVERLINE(1–4), $\varepsilon_{D}=1~{},S=$ ‘subdivision’	13 495
flowmap1	Flow map ST_OVERLINE(5–14), ${\bm{k}}=1,2,\varepsilon=\varepsilon_{S}=4,S=$ ‘subdivision’	2 437
flowmap2	Flow map ST_OVERLINE(5–14), ${\bm{k}}=1,2,\varepsilon=\varepsilon_{S}=4,S=$ ‘unary’	1 953
flowmap3	Flow map ST_OVERLINE(5–14), ${\bm{k}}=1,2,3,4,\varepsilon=\varepsilon_{S}=5,S=$ ‘unary’	1 867
flowmap4	Flow map ST_OVERLINE(5–14), ${\bm{k}}=1,2,3,4,\varepsilon=\varepsilon_{S}=5,S=$ ‘unary’	1 413
flowmap_desire	Desire lines ST_DESIRELINE, $\varepsilon=5000$	41

Table 1: Geospatial layers in geopackage. The first column is layer name, the second is description, and the third is number of geospatial features

n

5 Conclusion

We have introduced novel analysis algorithms to compute a flow map from empirical GPS trajectories. Our starting point is to focus on aligning segments of the map matched routes rather than the complete routes. We define a spatial relation to set up local reference road segments, which allows us to align other nearby road segments to this local reference segment. This local alignment is the key innovation to computing a minimal flow map that is aligned to the underlying road network. We presented solid evidence for the high level of spatial resolution, accuracy and coverage for our proposed minimal flow map. Since it accurately shows the traffic flow on all road segments at all scales, it provides increased added value in comparison to the empirical GPS trajectories, to the low resolution desire lines map, and to existing high resolution flow map methodologies.

References

Andrienko and Andrienko (2013) Andrienko, N. and G. Andrienko (2013). Visual analytics of movement: An overview of methods, tools and procedures. Information Visualization 12, 3–24.
Chao et al. (2020) Chao, P., Y. Xu, W. Hua, and X. Zhou (2020). A survey on map-matching algorithms. In R. Borovica-Gajic, J. Qi, and W. Wang (Eds.), Databases Theory and Applications, Volume 12008, pp. 121–133. Springer International Publishing.
Douglas and Peucker (1973) Douglas, D. H. and T. K. Peucker (1973). Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica (The Canadian Cartographer) 10, 112–122.
Dunnington and Pebesma (2023) Dunnington, D. and E. Pebesma (2023). geos: Open Source Geometry Engine (’GEOS’) R API. R package version 0.2.4. https://github.com/paleolimbot/geos.
Evans (1976) Evans, S. P. (1976). Derivation and analysis of some models for combining trip distribution and assignment. Transportation Research 10, 37–57.
Giorgino (2009) Giorgino, T. (2009). Computing and visualizing dynamic time war** alignments in R: The dtw package. Journal of Statistical Software 31(7), 1–24.
GIS OPS (2023) GIS OPS (2023). Valhalla routing engine version 3.4.0 [Docker]. https://github.com/gis-ops/docker-valhalla.
Gordon (1999) Gordon, A. D. (1999). Classification (2nd ed.). London: Chapman and Hall/CRC.
Herrera et al. (2010) Herrera, J. C., D. B. Work, R. Herring, Q. Ban, X. Jacobson, and A. M. Bayen (2010). Evaluation of traffic data obtained via GPS-enabled mobile phones: The Mobile Century field experiment. Transportation Research Part C: Emerging Technologies 18, 568–583.
Lovelace and Ellison (2018) Lovelace, R. and R. Ellison (2018). stplanr: A package for transport planning. The R Journal 10, 7–23.
Lovelace et al. (2019) Lovelace, R., J. Nowosad, and J. Muenchow (2019). Geocomputation with R. Chapman and Hall/CRC.
Morgan and Lovelace (2021) Morgan, M. and R. Lovelace (2021). Travel flow aggregation: Nationally scalable methods for interactive and online visualisation of transport behaviour at the road network level. Environment and Planning B: Urban Analytics and City Science 48, 1684–1696.
Müllner (2013) Müllner, D. (2013). fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software 53(9), 1–18.
Necula (2015) Necula, E. (2015). Analyzing traffic patterns on street segments based on GPS data using R. Transportation Research Procedia 10, 276–285.
OGC (2010) OGC (2010). OpenGIS implementation standard for geographic information - Simple feature access - Part 1: Common architecture. Version 1.2.1.
Ortúzar and Willumsen (2011) Ortúzar, J. D. and L. G. Willumsen (2011). Modelling Transport (4th ed.). Hoboken: Wiley.
Pebesma (2018) Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal 10, 439–446.
Quddus et al. (2007) Quddus, M. A., W. Y. Ochieng, and R. B. Noland (2007). Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportation Research Part C: Emerging Technologies 15, 312–328.
Ramer (1972) Ramer, U. (1972). An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing 1, 244–256.
Saki and Hagen (2022) Saki, S. and T. Hagen (2022). A practical guide to an open-source map-matching approach for Big GPS Data. SN Computer Science 3(5), 415.
Sakoe and Chiba (1978) Sakoe, H. and S. Chiba (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 43–49.
Tobler (1987) Tobler, W. R. (1987). Experiments in migration map** by computer. The American Cartographer 14, 155–163.
van der Meer et al. (2023) van der Meer, L., L. Abad, A. Gilardi, and R. Lovelace (2023). sfnetworks: Tidy Geospatial Networks. R package version 0.6.4. https://luukvdmeer.github.io/sfnetworks.
Wood et al. (2010) Wood, J., J. Dykes, and A. Slingsby (2010). Visualisation of origins, destinations and flows with OD maps. The Cartographic Journal 47, 117–129.
Zhou et al. (2013) Zhou, H., P. Xu, X. Yuan, and H. Qu (2013). Edge bundling in information visualization. Tsinghua Science and Technology 18, 145–156.
Zourlidou et al. (2022) Zourlidou, S., J. Golze, and M. Sester (2022). Dataset: GPS trajectory dataset of the region of Hannover, Germany. https://doi.org/10.25835/9bidqxvl.