[GRASS-dev] r.in.xyz: really big ints

Hi,

a user has just successfully imported a 92GB LIDAR data file with r.in.xyz
(2.4 billion data points; 4.5hrs). This has exposed a cosmetic bug, the
number of points processed is reported to the user as -1871174186.

The raster output is fine AFAIK, but the broken status message ain't a
good look.

wc -l reports 2,424,200,605 points which is bigger than (IIUC) the 32bit
limit for c90 int of 2,147,483,648. I do not know if the hardware/OS/build
was 32 or 64 bit. The "line" variable is defined simply as "int".

Apparently `wc` on their system can count higher than 2^31, can we?

Is it as simple as replacing printf %d with %u ?? (seems to work)
In that case should the variable be defined as "unsigned int" for
correctness? (%u seems to work correctly with plain signed int in a
little test program I wrote) Then we wait for the first 160GB dataset...

I could rewrite it to store the number of lines as a double and printf
%.0f, but hope for a cleaner solution.

side idea:
Would it be possible to add a flag to g.version to report some build
info? Like: 32/64 bits, endianness, build date, svn checkout date (if
applicable), `uname -a` of build machine, LFS, nls, and in general
./configure feature report stuff, ...

?

thanks,
Hamish

On Aug 29, 2008, at 3:19 AM, Hamish wrote:

Hi,

a user has just successfully imported a 92GB LIDAR data file with r.in.xyz
(2.4 billion data points; 4.5hrs). This has exposed a cosmetic bug, the
number of points processed is reported to the user as -1871174186.

The raster output is fine AFAIK, but the broken status message ain't a
good look.

wc -l reports 2,424,200,605 points which is bigger than (IIUC) the 32bit
limit for c90 int of 2,147,483,648. I do not know if the hardware/OS/build
was 32 or 64 bit. The "line" variable is defined simply as "int".

It was a 64 bit with GRASS6.4 compiled from 8/22/08 version
Doug, you could add more details and maybe also post your comparison
of GRASS performance on linux versus MS Windows,

Helena

Apparently `wc` on their system can count higher than 2^31, can we?

Is it as simple as replacing printf %d with %u ?? (seems to work)
In that case should the variable be defined as "unsigned int" for
correctness? (%u seems to work correctly with plain signed int in a
little test program I wrote) Then we wait for the first 160GB dataset...

I could rewrite it to store the number of lines as a double and printf
%.0f, but hope for a cleaner solution.

side idea:
Would it be possible to add a flag to g.version to report some build
info? Like: 32/64 bits, endianness, build date, svn checkout date (if
applicable), `uname -a` of build machine, LFS, nls, and in general
./configure feature report stuff, ...

having that would be greatly appreciated

?

thanks,
Hamish
_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Helena Mitasova wrote:

On Aug 29, 2008, at 3:19 AM, Hamish wrote:

Hi,

a user has just successfully imported a 92GB LIDAR data file with r.in.xyz
(2.4 billion data points; 4.5hrs). This has exposed a cosmetic bug, the
number of points processed is reported to the user as -1871174186.

The raster output is fine AFAIK, but the broken status message ain't a
good look.

wc -l reports 2,424,200,605 points which is bigger than (IIUC) the 32bit
limit for c90 int of 2,147,483,648. I do not know if the hardware/OS/build
was 32 or 64 bit. The "line" variable is defined simply as "int".

It was a 64 bit with GRASS6.4 compiled from 8/22/08 version
Doug, you could add more details and maybe also post your comparison
of GRASS performance on linux versus MS Windows,

Helena

Apparently `wc` on their system can count higher than 2^31, can we?

Is it as simple as replacing printf %d with %u ?? (seems to work)
In that case should the variable be defined as "unsigned int" for
correctness? (%u seems to work correctly with plain signed int in a
little test program I wrote) Then we wait for the first 160GB dataset...

%u only delays the problem to ~4 billion points. What you really want is a %lld and to store the line count as a 64-bit int. ISO C99 also includes <inttypes.h> which specifies %PRId64. Is there an agreed upon 64-bit int type in GRASS? I have found a couple of places in the past where row and column where 32-bit and the number of cells would overflow because the calculation is done using 32-bit arithmetic. It would be nice to have 64-bit support even on 32-bit architectures in GRASS 7. I recall r.info had some issue in printing cell counts.

I could rewrite it to store the number of lines as a double and printf
%.0f, but hope for a cleaner solution.

side idea:
Would it be possible to add a flag to g.version to report some build
info? Like: 32/64 bits, endianness, build date, svn checkout date (if
applicable), `uname -a` of build machine, LFS, nls, and in general
./configure feature report stuff, ...

having that would be greatly appreciated

+1

?

thanks,
Hamish

Hamish wrote:

a user has just successfully imported a 92GB LIDAR data file with r.in.xyz
(2.4 billion data points; 4.5hrs). This has exposed a cosmetic bug, the
number of points processed is reported to the user as -1871174186.

The raster output is fine AFAIK, but the broken status message ain't a
good look.

wc -l reports 2,424,200,605 points which is bigger than (IIUC) the 32bit
limit for c90 int of 2,147,483,648. I do not know if the hardware/OS/build
was 32 or 64 bit. The "line" variable is defined simply as "int".

Apparently `wc` on their system can count higher than 2^31, can we?

"unsigned int" goes up to 2^32-1.

Is it as simple as replacing printf %d with %u ?? (seems to work)

Yes.

In that case should the variable be defined as "unsigned int" for
correctness? (%u seems to work correctly with plain signed int in a
little test program I wrote)

The nature of two's-complement representation means that whether a
variable is declared as "int" or "unsigned int" doesn't actually have
that much effect upon most arithmetic operations. It mainly affects
division, comparisons[1] and right shifts[2].

[1] If you compare a signed value to an unsigned value, the signed
value will be cast to unsigned. Comparing <0 or >=0 will always be
false or true respectively.

[2] shifting a signed value will shift in copies of the topmost bit,
while shifting an unsigned value will shift in zeroes.

Then we wait for the first 160GB dataset...

I could rewrite it to store the number of lines as a double and printf
%.0f, but hope for a cleaner solution.

You can use "long", which will typically be 64 bits on a 64-bit system
(but not Windows, where "long" is 32 bits even on the 64-bit versions,
to maintain binary compatibility).

Every version of gcc in widespread use supports "long long", as does
C99. This will be 64 bits even on 32-bit systems. OTOH, we might need
to support platforms which don't support this.

If LFS is enabled, off_t may be 64 bits (and if it isn't, you're
likely to have trouble importing a 92GiB file; Linux simply won't let
you open a file >2GiB unless you use the LFS features). This saves you
the trouble of having to perform explicit checks for "long long", and
conditionalising the variable definition. You still need to
conditionalise the printf() format, though, e.g.:

  const char *fmt = sizeof(off_t) > sizeof(long) ? "%lld" : "%ld";

Using double is certainly the easiest solution. That can represent
integers up to 2^53 exactly, after which .... well it doesn't really
matter; 2^53 is just short of 10^16.

side idea:
Would it be possible to add a flag to g.version to report some build
info? Like: 32/64 bits, endianness, build date, svn checkout date (if
applicable), `uname -a` of build machine, LFS, nls, and in general
./configure feature report stuff, ...

If you can figure out a command to print the information, you can call
it from the Makefile with $(shell ...) and add a -D flag, e.g.:

  UNAME := $(shell uname -a)

  CFLAGS += -DUNAME=$(UNAME)

OTOH, if you want a lot of information, it would be better to make the
configure script store it in config.h.

Or you could just add:

  system("\"$GISBASE/etc/grocat\" < \"$GISBASE/include/Make/Platform.make\"");
  system("\"$GISBASE/etc/grocat\" < \"$GISBASE/include/grass/config.h\"");

--
Glynn Clements <glynn@gclements.plus.com>

Andrew Danner wrote:

>> Apparently `wc` on their system can count higher than 2^31, can we?
>>
>> Is it as simple as replacing printf %d with %u ?? (seems to work)
>> In that case should the variable be defined as "unsigned int" for
>> correctness? (%u seems to work correctly with plain signed int in a
>> little test program I wrote) Then we wait for the first 160GB dataset...

%u only delays the problem to ~4 billion points. What you really want is
a %lld and to store the line count as a 64-bit int. ISO C99 also
includes <inttypes.h> which specifies %PRId64. Is there an agreed upon
64-bit int type in GRASS?

Currently, we try to avoid requiring anything beyond ANSI C89, which
means that we can't assume the existence of a 64-bit integer type.

I have found a couple of places in the past
where row and column where 32-bit and the number of cells would overflow
because the calculation is done using 32-bit arithmetic. It would be
nice to have 64-bit support even on 32-bit architectures in GRASS 7. I
recall r.info had some issue in printing cell counts.

My inclination would be to declare e.g. count_t and COUNT_FMT, which
would be long long and %lld where available, long and %ld otherwise.

Even so, it will probably take some effort to catch bugs caused by
code which doesn't correctly handle integer types other than "int".
E.g.:

  offset = (off_t)((row * cols + col) * sizeof(CELL));

which should be e.g.:

  offset = ((off_t)row * cols + col) * sizeof(CELL);

If you use the former, the compiler won't complain, and you won't
discover the bug unless you actually test with >2GiB of data.

--
Glynn Clements <glynn@gclements.plus.com>

variables in the main loop that need looking at:
line, estimated_lines, count, count_total

line: at one point I use if(line % 10000 == 0), which I suspect would
prefer ints over FPs, and I'd like to keep that.

estimates_lines: is set to -1 when can't be estimated (data from stdin)
and thus is safely ignored if int has overflowed to a negative value.
So must remain signed or we need a new flag variable. The variable is
created with "estimated_lines = filesize / linesize;" where filesize
is off_t.

count, count_total just use ++ and += starting at 0.

G_alloc(rows*cols*sizeof(CELL)) is another thing, but unrelated to the
filesize so a matter for another thread & time.

If I change int to long and %d to %ld, life gets better on 64bit machines
but not 32bit machines -- they stay at 2^31-1.

If I use unsigned long int and %uld (does that exist?) would it mean we
can have 2^32-1 on 32bits and 2^64-1 on 64bits?

re. running times, r.in.xyz processing will be directly proportional to
the number of passes required. So it's not surprising that percent=50 (2 passes) is 10 times faster than percent=5 (20 passes).

I would not hold my breath for widespread 64bit MS Windows XP support
from commercial vendors. It is hard to justify spending effort porting
your drivers/software to a niche system which is already EOL'd, even if
current users demand it. At the same time I think people are dragging
their feet waiting to see what replaces Vista. (probably just a
superficial PR name change in 2009 for SP2, but who knows for sure?)

Hamish

Hamish wrote:

line: at one point I use if(line % 10000 == 0), which I suspect would
prefer ints over FPs, and I'd like to keep that.

The % operator doesn't work for FP; you can to use fmod() instead.

G_alloc(rows*cols*sizeof(CELL)) is another thing, but unrelated to the
filesize so a matter for another thread & time.

This isn't applicable to memory allocation. You can always fit the
size of a region of memory into a size_t. On Unix, it's safe to assume
that you will always be able to fit it into a "long".

If I change int to long and %d to %ld, life gets better on 64bit machines
but not 32bit machines -- they stay at 2^31-1.

If I use unsigned long int and %uld (does that exist?)

It's %lu. u is the unsigned version of d. u and d are specifiers, of
which you can only have one; l and ll are qualifiers, used in addition
to a specifier.

would it mean we can have 2^32-1 on 32bits and 2^64-1 on 64bits?

Yes.

re. running times, r.in.xyz processing will be directly proportional to
the number of passes required. So it's not surprising that
percent=50 (2 passes) is 10 times faster than percent=5 (20 passes).

I would not hold my breath for widespread 64bit MS Windows XP support
from commercial vendors. It is hard to justify spending effort porting
your drivers/software to a niche system which is already EOL'd, even if
current users demand it. At the same time I think people are dragging
their feet waiting to see what replaces Vista. (probably just a
superficial PR name change in 2009 for SP2, but who knows for sure?)

64-bit Windows is problematic because of Windows' reliance upon binary
compatibility. Ever since the 80286 became obsolete, Windows has
revolved around the 80386 architecture; 32-bit, little-endian, with no
alignment requirements.

OTOH, Unix has traditionally focused on source compatibility (although
the recent dominance of x86 has resulted in portability getting less
attention than it used to). Making architecture-specific assumptions
(e.g. fwriting()ing "struct"s) tends to be treated as a bug.

--
Glynn Clements <glynn@gclements.plus.com>

Hamish:

> Is it as simple as replacing printf %d with %u ?? (seems to work)

Glynn:

Yes.

ok, done in SVN r33163,4. I went with unsigned longs instead of floats for
the counts. This means 32bit users with datasets > 5 billion points will
get garbage counts in the summary message. I am willing to live with that
possibility. Everything should be fine now for 64bit users (untested).

regards,
Hamish