Handling memory allocation with pgrouting in centrality analysis

ricardo_rs · October 10, 2024, 6:38pm

Hello,

Firstly, thank you all for your time. I hope this problem is shared by the community and it is helpful for others too.

I am using pgrouting to asssess the centrality before and after the flooding in Rio Grande do Sul (Brazil). Calculating the centrality was easy using a origin destination matrix (OD) as Daniel Patterson mentioned in its post or Matthew Forrest or Regina O Obe in their books. However, increasing the size of the network or the number of origin and destination returned:

ERROR:  invalid memory alloc request size 2450771848

Apart from configuration tips (post), users reported different methods to optimize pgr_dijkstra. The “array_agg” method use the function array_agg() in the start vid and end vid parameters of pgr_dijkstra. The “st_bbox() method” only apply Dijkstra where a bounding box intersects with the start vid (origin) and end vid(destination), reducing rows. The simple or naive method uses an ARRAY() that constitutes the OD-matrix, where the LEFT JOIN added the geometry from the original network table using the ID.

---simple or naive approach
SELECT 
     b.id,
     b.the_geom,
     count(the_geom) as centrality
FROM  
      pgr_dijkstra(
        'SELECT 
	        id,
	        source,
	        target,
	        cost
      	FROM
        	porto_alegre_net_largest', 
            ARRAY(SELECT net_id AS start_id FROM weight_sampling_100_origin) ,
            ARRAY(SELECT net_id AS end_id FROM weight_sampling_100_origin ), 
            directed := TRUE) j
     LEFT JOIN porto_alegre_net_largest AS b   
       ON j.edge = b.id
     GROUP BY
       b.id,
       b.the_geom
  ORDER BY
       centrality DESC;

I expected to obtain less computing costs using bbox or array_agg(), however, these are the preliminary performance results. The simple way using pgr_dijkstra() and an array for origin and destination achieved a better performance returning the results in less time.

Then, I used this naive method using different algorithms. I expected pgr_astar() to be faster than pgr_dijkstra(), since pgr_astar() does not explore each segment of the network, unlike pgr_dijkstra that explores all the network. The results obtained indicated that it is again pgr_dijkstra() the algorith that required less time to return the results.

This is the google drive link to the data to reproduce the results. These are the files:

network folder: it contains “porto_alegre_net_largest.geojson”, which represents the network before the flooding of the study area, and its respective vertice table “porto_alegre_net_largest_vertices_pgr.geojson”. There are similar tables to run pgr_astar().
origin_destination_matrix: it contains “weight_sampling_100_origin.geojson” & “weight_sampling_100_destination.geojson”, which represent the origin and destination points. There are similar tables for 200, 300 and 500. 500 points is my current limit that I can not run.
metrics_algorithm_naive_osgeo.csv: csv that contains “metrics_algorithm_naive_osgeo.csv” & “metrics_and_quries_method_dijkstra_osgeo.csv” the query used, it’s performance (explain analyze) method or algorithm.

Thank you very much for your time and attention! Your help would be very valuable to improve the analysis that aims to identify the most central roads.

Images from the flooding by global-flood.emergency.copernicus

cvvergara · October 10, 2024, 8:27pm

Hi @ricardo_rs

There are many centrality measurements like degree centrality, closeness centrality, harmonic centrality, betweenness centrality etc.

In pgRouting in develop branch, There is the new function pgr_betweennessCentrality. Its the only centrality measurement that we have. Don’t know if its the one you are looking for.

A thing to mention is the way pgRouting 3.6.2 (3.6.3 soon to be released) and under works:
data on the database
→
data saved on a C structure
→
data saved on a C++ container
→
data saved on a boost graph
So there are 3 copies (saved one way or another) of the same data

Starting from pgRouting 3.7.0
data on the database
→
data saved on a C++ container
→
data saved on a boost graph
So there will be only 2 copies (saved one way or another) of the same data

Add to all of that the structures needed to store calculations and results.

So this error you are getting:

ERROR:  invalid memory alloc request size 2450771848

Might happen because of memory issues.
There is interesting documentation about the problem in the allpairs family documentation.
Note that in that documentation, for 3247 edges it gives 4868919 output rows.

Betweenness Centrality more or less work in the same way as all pairs: perform a Dijkstra from all vertices to all vertices. so the result will approximately be:
bytes: vertices^2 rows * the bytes size of a result row.

To be able to test your work.

Can you give the information in the following form:

A database dump of the tables involved
A query script that execute the different sizes on that database
(aka for me is just a command to get the database working and ready to test)

Right now I do not have the time to try to figure out how to upload so many files to the database, and then just on one query try to figure out all the queries that for sure you have been doing.

Regards
Vicky

rnanclares · October 11, 2024, 7:35am

Agree with Vicky, upload your database dump and scripts so we can test it.

Using the bbox method is not a sound strategy, because in some cases the best route might fall outside of your bbox.

You forgot to mention how much RAM you’re using. I’ve previously worked with Brazil states and I didn’t have memory allocation issues.

ricardo_rs · October 11, 2024, 12:14pm

Hi @cvvergara and @rnanclares,

Thank you so much for your quick response!

I can try the pgr_betweennessCentrality to compare the results. We are using the edge betwenness centrality, but instead of considering all possible pairs of nodes in the network, we created a set of origin and destination. In any case, the new function seems totally relevant for the analysis! thank you very much. I will see how I can use it (I still have a lot to learn)

Then, It sounds to me that it will be an important new feature. That’s great, looking forward for the 3.6.3 release.

Sharing the data using a database dump sounds now to me as the best option. I was not aware. Please, find here the link with the data. There are two files:

A database dump of the tables involved: osgeo_pgrouting_dump.sql
A query script that execute the different sizes on that database performance_quries.sql

Very well observed. It is indeed a limitation. Probably there is a way to add some distance to the boundingbox (line 252) depending on the scale. For example + 0.01º .

I tried two options:

Option A: Local machine with a total memory of 15G
Option B: Server machine with a total memory of 47G

Again, thank you very much for your prompt and kind responses. Please, if there is a problem with any query or the database dump, let me know.

cvvergara · October 12, 2024, 11:20pm

Just a clarification:
3.6.3 (soon to be released, like in the next week) and under is the one that makes the 3 copies
starting 3.7.0 (in develop branch, to be released in about a month) is the one that makes the 2 copies.
Try develop branch.

cvvergara · October 13, 2024, 12:49am

@ricardo_rs
That detail information about the structure of the queries, and question that you sent privately can you post it here please.

ricardo_rs · October 13, 2024, 11:18am

Of course.

The structure of the queries in the performance_quries.sql file that is in the data is the following:

Create spatial index
0.1. Spatial index for the network using dijkstra
0.2 Spatial index for the network using astar
Different methods using pgr_dijkstra
1.1. naive (measuring centrality)
1.1.1. 100 origen x 100 destination… (100,200,300,500)
1.2. array_agg()
1.2.1. Create array (100, 200,300,500)
1.2.2. dijskstra using array_agg() (100,200,300,500)
1.3. bounding box method (100,200,300,500)
Different algorithms using the naive approach
2.1. pgr_dijkstra()
2.2. pgr_bDijkstra() (200,300,500)
2.3. pgr_astar() (200,300,500)
2.4. pgr_bdAstar() (200,300,500)

To sum up, three methods are used applying pgr_dijkstra on different sizes (100*100, 200 * 200, 300 * 300 & 500 * 500). Similarly, the four algorithms for obtaining the shortest path (based on a cost provided by OpenRouteService ) are also applied to these 5 different size of Origin-Destination matrix.

The question is the following:

If this is a memory allocation problem, is there a way to process the shortest paths in “batches”?

The book PostGIS in action mentions in the listing “13.20 Function to load streets to batches”. There is also a post that shared the same problem and using declare and cursor was recommended. There is another approach without using cursors. Since it seems other GIS processes can be done in batches, I wonder if this is possible with pgrouting.

cvvergara · October 16, 2024, 1:45pm

I have some free time Friday 18th at noon Mexico time (18:00 UTC)
What about meeting while me doing the test on my computer.
Our meetings take place at meet

ricardo_rs · October 16, 2024, 2:55pm

That would be great!

Thank you so much, I will be there.

rnanclares · October 16, 2024, 5:27pm

I did some testing today and I’m getting malloc error doing the 500 naive and the 500 array_agg examples. With array_agg takes a lot more time to fail, like 10 minutes, while the naive method only takes ~5 seconds. With the naive method I see the RAM usage ramp up fast until it reaches 5.5 GB and then crashes.

SQL Error [XX000]: ERROR: invalid memory alloc request size 2450771848

which is exactly the same error you get.

To me it feels like one origin-destination combination in particular is causing the issue.

ricardo_rs · October 28, 2024, 6:25am

Hi

Thank you very much Raul Nanclares for testing. Good to know that we got same result and similar performance. I wonder if there is a way to check if one origin-destination may be causing the problem.

cvvergara and robe kidnly helped me with the issue (thousand thanks). I hope to post the solution and test in detail at the end of this week.

In summary:

OpenRouteService network structure: The network table can be reduced adding a reverse_cost column instead of duplicating the streets that have both ways. For example, for a same “bidirectid”, there are two rows with unique a “unidirectid”. This can be reduced adding the reverse_cost column
Batch queries: Instead of querying the 100*100 shortest paths at one time, query them in batches. There are two methods for this; a) add WHERE clauses in the origin parameter, only processing those that are less than a number, repeat until all are processed and then insert in one table; b) using pl/pgSQL DO … COMMIT.
Use the develop branch 3.7.0 of pgrouting : As cvvergara mentioned already in this post, this version only makes 2 copies instead of 3, reducing the costs.

ricardo_rs · November 3, 2024, 6:53pm

Hi, batch queries were a sucess.

I was able to test it using WHERE clauses, trying to do also the pl/pgsql option next week. These are the three steps:

Create a first table,“centrality_500_first_250” , that contains 250 destinations using an auxiliary column named “nrow”. This is done with row_number() over() as nrow. Probably using ORDER BY + LIMIT was enough, I decided to use nrow to be sure.
Create a second table,centrality_500_second_250 , with the rest of the 250 destinations
Merge them in the table “centrality_500_merged” and add the geometry from the network in the table “centrality_500_merged_with_geom”.

--- Step 1: Query first 250 points out of 500
CREATE TABLE centrality_500_first_250 AS
SELECT   
	*
 FROM  
 	pgr_dijkstra('SELECT  ogc_fid AS id,
                             fromid AS source,
                            toid AS target,
                            weight AS cost
                      FROM porto_alegre_net_pre',
                      ARRAY(SELECT net_id AS start_id FROM weight_sampling_500_origin WHERE nrow <= 250), 
                      ARRAY(SELECT net_id AS start_id FROM weight_sampling_500_destination WHERE nrow <= 250),
                      directed := TRUE) ;

---- Step 2: Query the 250 remaining points
CREATE TABLE centrality_500_second_250 AS
SELECT   
	*
 FROM  
 	pgr_dijkstra('SELECT  ogc_fid AS id,
                             fromid AS source,
                            toid AS target,
                            weight AS cost
                      FROM porto_alegre_net_pre',
                      ARRAY(SELECT net_id AS start_id FROM weight_sampling_500_origin WHERE nrow > 250), 
                      ARRAY(SELECT net_id AS start_id FROM weight_sampling_500_destination WHERE nrow > 250),
                      directed := TRUE) ;

--- Step 3: Merge both tables
CREATE TABLE centrality_500_merged AS                    
SELECT * FROM centrality_500_first_250
UNION ALL
SELECT * FROM centrality_500_second_250;
--- Add the geoemtry and calculate centrality
CREATE TABLE centrality_500_merged_with_geom AS
SELECT 
	 b.ogc_fid,
 b.the_geom,
 count(the_geom) as centrality 
FROM centrality_500_merged AS j
                      LEFT JOIN porto_alegre_net_pre AS b 
                      ON j.edge = b.ogc_fid
                      GROUP BY  b.ogc_fid, b.the_geom
                      ORDER BY centrality DESC;

I hope to write the pl/pgsql batch query and compare the performance with the deelop branch. Regarding transforming the OpenRouteService network structure, initially the reverse_cost contained the same values as cost, however, this is not always the case. In ocassions, two way streets have different cost depending on the direction so that option requires further adjustments.