Markus Neteler wrote:
> The nature of the problems caused by enabling large files in code
> which can't handle them is such that it will typically take some
> investigation to determine that the problem is due to large file
> support.
>
> In the worst case, the user will silently get bad data without being
> aware that anything has gone wrong.
OK (I thought that it always print an error message).
No. If you *don't* compile with -D_FILE_OFFSET_BITS=64, calls to
open() will fail with EOVERFLOW if the file is larger than 2GiB.
Defining that macro causes open() to be an alias for open64(), which
won't complain about large files. It also defines off_t to be an alias
for off64_t, and lseek() and alias for lseek64().
However, it won't magically convert arbitrary calculations involving
int/long to 64 bits.
So, if you have some code like:
long row, row bytes;
...
lseek(fd, row * row_bytes, SEEK_SET);
the calculation will be done in 32 bits, i.e. it will behave like:
lseek(fd, (row * row_bytes) & 0xFFFFFFFF, SEEK_SET);
To fix this, you have to force the compiler to perform the
calculations in 64 bits, by casting values to off_t where necessary,
e.g.:
lseek(fd, row * (off_t) row_bytes, SEEK_SET);
This needs to be done wherever file offsets are used.
Note: C determines the type of an expression by the type of its widest
operand, so you need to cast at least one of the values before the
calculation, as above. Using:
(off_t) (row * row_bytes)
won't work; the calculation will be performed in 32 bits, then the
truncated value will be expanded to 64 bits (by which time, it's too
late).
> To get an idea of which files will have problems with large files,
> locate anything which uses lseek, fseek or ftell (either with grep, or
> use tools/sql.sh and query the import tables for those symbols).
>
> Also, one conseqence of enabling large file support is that a raster
> may have more than 2^31 cells in total. Code which counts cells (e.g.
> r.statistics) will need to use "long long int" to handle that case.
In the long run it would be nice to have GRASS code polished.
Sure, but there are a lot of places where these issues apply. Fixing
them is simple enough; it's finding them all that's awkward.
The file offsets can be dealt with by locating all modules which use
lseek() etc, fixing them, then defining the macro locally, e.g.
#include "config.h"
...
#ifdef HAVE_LARGEFILE
#define _FILE_OFFSET_BITS 64
#endif
[Assuming that the configure script defines HAVE_LARGEFILE in
config.h when --enable-largefile is used.]
The cases where cell counts might wrap is harder to identify.
[Sooner or later, someone will release a version of Linux/x86 where
"long" is 64 bits, then many of the problems will vanish. But not all
of them; I've encountered code where file offsets are computed as
"int"s.]
--
Glynn Clements <glynn@gclements.plus.com>