[Geoserver-devel] Postgis estimated_extent completely off the mark?

Andrea_Aime3 · March 20, 2007, 1:32pm

Hi,
some of you may know that in geotools we want to
add the ability to compute the bounds using estimated
extents to have a nice speedup, should the user decide
that the approximation of estimated extents is ok for him.

Hum, well, I tried this on my local db with the tiger roads network,
and the result are the followings:

tiger2005fe=# vacuum analyze major_roads;
VACUUM
tiger2005fe=# select estimated_extent('major_roads','gen_full');
estimated_extent
----------------------------------------------------------------------------
BOX(-136.358154296875 17.6881694793701,-64.7429351806641 59.4498023986816)
(1 row)

tiger2005fe=# select extent(gen_full) from major_roads
tiger2005fe-# ;
extent
----------------------------------------------------------------------------
BOX(-170.837387084961 -14.3779020309448,144.914001464844 66.9222869873047)

Hem, this is not 5% error, it's totally off the mark.
What's going on? I'm on windows, postgres 8.1.5.
The difference is the same in other tables that do
use polygonal features.

Some more info about the tables:

tiger2005fe=# select postgis_version();
postgis_version
---------------------------------------
1.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
(1 row)

tiger2005fe=# \d major_roads;
Table "public.major_roads"
Column | Type | Modifiers

Hum, may it be because the real geometry type is not declared?
Cheers
Andrea

Justin_Deoliveira · March 20, 2007, 3:13pm

Hi Andrea,

I am not sure here but I think for the function to be accurate within 5% the stats for the table have to be up to date. This can be done with a an ANALYZE on the table. Any change if you try analyzing the tale first?

Andrea Aime wrote:

Hi,
some of you may know that in geotools we want to
add the ability to compute the bounds using estimated
extents to have a nice speedup, should the user decide
that the approximation of estimated extents is ok for him.

Hum, well, I tried this on my local db with the tiger roads network,
and the result are the followings:

tiger2005fe=# vacuum analyze major_roads;
VACUUM
tiger2005fe=# select estimated_extent('major_roads','gen_full');
                               estimated_extent
----------------------------------------------------------------------------
  BOX(-136.358154296875 17.6881694793701,-64.7429351806641 59.4498023986816)
(1 row)

tiger2005fe=# select extent(gen_full) from major_roads
tiger2005fe-# ;
                                    extent
----------------------------------------------------------------------------
  BOX(-170.837387084961 -14.3779020309448,144.914001464844 66.9222869873047)

Hem, this is not 5% error, it's totally off the mark.
What's going on? I'm on windows, postgres 8.1.5.
The difference is the same in other tables that do
use polygonal features.

Some more info about the tables:

tiger2005fe=# select postgis_version();
             postgis_version
---------------------------------------
  1.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
(1 row)

tiger2005fe=# \d major_roads;
                              Table "public.major_roads"
     Column | Type | Modifiers

--------------+----------+------------------------------------------------------
-----
  state | text |
  gen_full | geometry |
  gen_1 | geometry |
  gen_2 | geometry |
  gen_3 | geometry |
  interstate | integer |
  ushighway | integer |
  statehighway | integer |
  othername | text |
  fid | integer | not null default nextval('major_roads_fid_seq'::regcl
ass)
Indexes:
     "fid_pkey" PRIMARY KEY, btree (fid)
     "major_roads_spatial_ind" gist (gen_full)
     "major_roads_spatial_ind1" gist (gen_1)
     "major_roads_spatial_ind2" gist (gen_2)
     "major_roads_spatial_ind3" gist (gen_3)

tiger2005fe=# select * from geometry_columns;
  f_table_catalog | f_table_schema | f_table_name | f_geometry_column | coord
_dimension | srid | type
-----------------+----------------+------------------+-------------------+------
-----------+------+----------
          2 | 1 | GEOMETRY
                  | public | major_roads | gen_full |
          2 | 1 | GEOMETRY
                  | public | major_roads | gen_1 |
          2 | 1 | GEOMETRY
                  | public | major_roads | gen_2 |
          2 | 1 | GEOMETRY
                  | public | major_roads | gen_3

Hum, may it be because the real geometry type is not declared?
Cheers
Andrea

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Geoserver-devel mailing list
Geoserver-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

!DSPAM:4007,45ffe306282334820651628!

--
Justin Deoliveira
The Open Planning Project
http://topp.openplans.org

Andrea_Aime3 · March 20, 2007, 3:13pm

Andrea Aime ha scritto:

tiger2005fe=# vacuum analyze major_roads;
VACUUM
tiger2005fe=# select estimated_extent('major_roads','gen_full');
                               estimated_extent
----------------------------------------------------------------------------
  BOX(-136.358154296875 17.6881694793701,-64.7429351806641 59.4498023986816)
(1 row)

tiger2005fe=# select extent(gen_full) from major_roads
tiger2005fe-# ;
                                    extent
----------------------------------------------------------------------------
  BOX(-170.837387084961 -14.3779020309448,144.914001464844 66.9222869873047)

Ah, if you're wondering about how this is possible, my guess is that
estimated_extent is for some reason not considering Hawaii (still USA, but very very far away)

Cheers
Andrea

Andrea_Aime3 · March 20, 2007, 3:15pm

Justin Deoliveira ha scritto:

Hi Andrea,

I am not sure here but I think for the function to be accurate within 5% the stats for the table have to be up to date. This can be done with a an ANALYZE on the table. Any change if you try analyzing the tale first?

...

tiger2005fe=# vacuum analyze major_roads;
VACUUM
tiger2005fe=# select estimated_extent('major_roads','gen_full');
estimated_extent
----------------------------------------------------------------------------

BOX(-136.358154296875 17.6881694793701,-64.7429351806641

Look at the first statement in the psql log
Oh, I did run a full vacuum analyze too, without the table name.

Cheers
Andrea

pramsey · March 20, 2007, 3:17pm

Right, sampling.
Small enough that a random sample has a chance of missing them. Northern islands in Alaska, Hawaii, etc.

P

On 20-Mar-07, at 8:13 AM, Andrea Aime wrote:

Andrea Aime ha scritto:

tiger2005fe=# vacuum analyze major_roads;
VACUUM
tiger2005fe=# select estimated_extent('major_roads','gen_full');
                               estimated_extent
----------------------------------------------------------------------------
  BOX(-136.358154296875 17.6881694793701,-64.7429351806641 59.4498023986816)
(1 row)
tiger2005fe=# select extent(gen_full) from major_roads
tiger2005fe-# ;
                                    extent
----------------------------------------------------------------------------
  BOX(-170.837387084961 -14.3779020309448,144.914001464844 66.9222869873047)

Ah, if you're wondering about how this is possible, my guess is that
estimated_extent is for some reason not considering Hawaii (still USA, but very very far away)

Cheers
Andrea
_______________________________________________
postgis-users mailing list
postgis-users@anonymised.com
http://postgis.refractions.net/mailman/listinfo/postgis-users

Andrea_Aime3 · March 20, 2007, 3:20pm

Paul Ramsey ha scritto:

Right, sampling.
Small enough that a random sample has a chance of missing them. Northern islands in Alaska, Hawaii, etc.

So this is a feature, not a bug, apparently.
Heh, then the docs should say that estimated_extent is 5% off the
proper bounds if features are uniformly distributed in the actual
bounds
If you have data with strange distribution patters (such as USA
states) better not rely on it.

Cheers
Andrea

Mark_Cave-Ayland · March 21, 2007, 8:03pm

On Tue, 2007-03-20 at 16:20 +0100, Andrea Aime wrote:

Paul Ramsey ha scritto:
> Right, sampling.
> Small enough that a random sample has a chance of missing them.
> Northern islands in Alaska, Hawaii, etc.

So this is a feature, not a bug, apparently.
Heh, then the docs should say that estimated_extent is 5% off the
proper bounds if features are uniformly distributed in the actual
bounds
If you have data with strange distribution patters (such as USA
states) better not rely on it.

Cheers
Andrea

Yes, it's due to the way in which the sampling works. Note that you can
increase the number of sampled rows using ALTER TABLE x ALTER COLUMN y
SET STATISTICS z and then re-ANALYZING (the default value is 10, so
perhaps a value of 100 would provide better results). I'm not sure where
the figure of 5% from proper bounds comes from though - I would have
imagined it depends on the sample size relative to the population size,
but then I haven't studied statistics properly for several years now

Kind regards,

Mark.

pramsey · March 21, 2007, 8:10pm

On 21-Mar-07, at 1:03 PM, Mark Cave-Ayland wrote:

On Tue, 2007-03-20 at 16:20 +0100, Andrea Aime wrote:

Paul Ramsey ha scritto:

Right, sampling.
Small enough that a random sample has a chance of missing them.
Northern islands in Alaska, Hawaii, etc.

So this is a feature, not a bug, apparently.
Heh, then the docs should say that estimated_extent is 5% off the
proper bounds if features are uniformly distributed in the actual
bounds
If you have data with strange distribution patters (such as USA
states) better not rely on it.

Cheers
Andrea

Yes, it's due to the way in which the sampling works. Note that you can
increase the number of sampled rows using ALTER TABLE x ALTER COLUMN y
SET STATISTICS z and then re-ANALYZING (the default value is 10, so
perhaps a value of 100 would provide better results). I'm not sure where
the figure of 5% from proper bounds comes from though - I would have
imagined it depends on the sample size relative to the population size,
but then I haven't studied statistics properly for several years now

Counter-intuitively it is more tightly linked to the absolute size of the sample. A 1000 person sample of a population of 10000 does not have a markedly higher performance than a 1000 person sample of a population of 1000000.

P

Andrea_Aime3 · March 22, 2007, 2:23pm

Mark Cave-Ayland ha scritto:

Yes, it's due to the way in which the sampling works. Note that you can
increase the number of sampled rows using ALTER TABLE x ALTER COLUMN y
SET STATISTICS z and then re-ANALYZING (the default value is 10, so
perhaps a value of 100 would provide better results). I'm not sure where
the figure of 5% from proper bounds comes from though - I would have
imagined it depends on the sample size relative to the population size,
but then I haven't studied statistics properly for several years now

Nice tip. I tried it out:
alter table major_roads alter column gen_full set statistics 1000;
vacuum analyze major_roads;

tiger2005fe=# select estimated_extent('major_roads','gen_full');
estimated_extent
----------------------------------------------------------------------------
BOX(-136.358154296875 17.6865196228027,-64.5742340087891 59.6282997131348)

Sigh... the result is the same as before...
I tried setting a value bigger than 1000, the result is:

tiger2005fe=# alter table major_roads alter column gen_full set statistics 10000;
WARNING: lowering statistics target to 1000
ALTER TABLE

It seems there is no way to make it close to being correct...
If any developer is curious and wants the data, I do have a compressed postgres dump of this table, it's around 125MB.
Cheers
Andrea