[GRASS-dev] Right way to use G_getl2

Hi,

this relates to the recent fix for r.in.ascii (r57649). The skip parameter [2] required one line more because when reading lines using G_getl2 [3] function and counting these lines it counted one line more.

The documentation of the function says that it reads max n-1 bytes. In other words, n is the size of the buffer, so it is clear that it can n-1 useful characters plus string terminating character ‘\0’. Quite standard way but…

The problem in v.in.ascii was that it the caller passed the previously determined maximum number of characters on line in file plus one (space for the terminating character). So this number is the same for all lines and it works well for all lines except for the longest one.

The reason is that G_getl2 stops reading characters for the longest line and leaves end line character in the stream. So, in fact, the longest line is in buffer and can be processed. However, the next line is than only the remaining end line character. This is not what the caller expected. There is even no way to find out that the line was not read completely.

So, finally: What is the right usage of G_getl2? Should the caller use buffer size equal to maximum number of expected characters on line plus end of line character (plus 2 on MS Windows) plus terminating character? Or should he just pass the n parameter increased by one (two on MS Windows) since in fact nothing will be stored to buffer? Or should he just allocate really huge buffer and than read chars from buffer if he wants to store them?

Thanks,
Vaclav

[1] https://trac.osgeo.org/grass/changeset/57649
[2] http://grass.osgeo.org/grass70/manuals/v.in.ascii.html

[3] http://grass.osgeo.org/programming7/getl_8c.html#a4bf86e666fda18059c42b0f5eca6f9bd

Note 1: Don’t be confused from documentation since the n parameter description for G_getl2 is wrong.

Note 2: In description I was taking only about Unix line endings. On Windows we need two bytes more to store it as noted in questions. The case of old Mac OS is not cover because it is even more tricky in G_getl2 function.

I think the confusion arises because G_getl2(char *buf, int n, FILE *
fd) reads in at most n - 1 characters, not n characters. The next
character is always set to '\0' and guaranteed to be at most the n-th
character. That also means that G_getl() does not check if the buffer
is large enough to hold the line to be read, which it should, IMHO.

To the example with v.in.ascii:
The module first finds the longest line, the length including the
terminating character. This number is then passed to G_getl2() as n,
which means that the end of the line is not read for the longest line,
because the end of the line is the n-th character.

I would suggest to modify G_getl2() to read at most n (not n - 1)
characters and include a check if the buffer is long enough:

--->
Index: getl.c

--- getl.c (revision 57537)
+++ getl.c (working copy)
@@ -67,7 +67,7 @@
     int c;
     int ret = 1;

- while (i < n - 1) {
+ while (i < n) {
     c = fgetc(fd);

     if (c == EOF) {
@@ -88,6 +88,11 @@
         }
         break;
     }
+
+ if (i == n - 1) {
+ G_warning("G_getl2(): buffer too small");
+ break;
+ }

     buf[i] = c;
<---

Markus M

On Fri, Sep 13, 2013 at 4:58 AM, Vaclav Petras <wenzeslaus@gmail.com> wrote:

Hi,

this relates to the recent fix for r.in.ascii (r57649). The skip parameter
[2] required one line more because when reading lines using G_getl2 [3]
function and counting these lines it counted one line more.

The documentation of the function says that it reads max n-1 bytes. In other
words, n is the size of the buffer, so it is clear that it can n-1 useful
characters plus string terminating character '\0'. Quite standard way but...

The problem in v.in.ascii was that it the caller passed the previously
determined maximum number of characters on line in file plus one (space for
the terminating character). So this number is the same for all lines and it
works well for all lines except for the longest one.

The reason is that G_getl2 stops reading characters for the longest line and
leaves end line character in the stream. So, in fact, the longest line is in
buffer and can be processed. However, the next line is than only the
remaining end line character. This is not what the caller expected. There is
even no way to find out that the line was not read completely.

So, finally: What is the right usage of G_getl2? Should the caller use
buffer size equal to maximum number of expected characters on line plus end
of line character (plus 2 on MS Windows) plus terminating character? Or
should he just pass the n parameter increased by one (two on MS Windows)
since in fact nothing will be stored to buffer? Or should he just allocate
really huge buffer and than read chars from buffer if he wants to store
them?

Thanks,
Vaclav

[1] https://trac.osgeo.org/grass/changeset/57649
[2] http://grass.osgeo.org/grass70/manuals/v.in.ascii.html
[3]
http://grass.osgeo.org/programming7/getl_8c.html#a4bf86e666fda18059c42b0f5eca6f9bd

Note 1: Don't be confused from documentation since the n parameter
description for G_getl2 is wrong.

Note 2: In description I was taking only about Unix line endings. On Windows
we need two bytes more to store it as noted in questions. The case of old
Mac OS is not cover because it is even more tricky in G_getl2 function.

_______________________________________________
grass-dev mailing list
grass-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/grass-dev

Markus Metz wrote:

I would suggest to modify G_getl2() to read at most n (not n - 1)
characters and include a check if the buffer is long enough:

mmmph, to me G_get2() does the right thing, as long as people follow
the given instructions all is fine. So I'd vote to keep it as-is.

FWIW, "at most one less than size characters from stream" follows
the behaviour of fgets() and (programmer)fails on the side of not
overflowing the given output buffer.

The \n and \r chars are not stored in the output string, and are not
left in the input stream for the next line. (if this wasn't the case
it's pretty hard to imagine that we wouldn't have noticed corrupted
ascii file input over the last 4 years)

is there a ticket for the 'v.in.ascii skip=' overflow problem? if not
can someone open one with a problematic input file attached? I suspect
the recent patch is perhaps fixing a problem actually rooted somewhere
else..

thanks,
Hamish

Vaclav Petras wrote:

So, finally: What is the right usage of G_getl2? Should the caller use
buffer size equal to maximum number of expected characters on line plus end
of line character (plus 2 on MS Windows) plus terminating character? Or
should he just pass the n parameter increased by one (two on MS Windows)
since in fact nothing will be stored to buffer? Or should he just allocate
really huge buffer and than read chars from buffer if he wants to store
them?

The short answer is that the buffer needs to be more than large
enough. As it turns out, it needs at least two more characters: one
for the NUL and one to ensure that the line terminator is read even
though it isn't stored.

FWIW, I suggest changing the existing behaviour so that it always
reads an entire line, even if it can't store it all. IOW, read until
EOL or EOF, stop storing characters once the buffer is full.

Unlike with fgets(), which stores the line terminator in the buffer,
there is no way for the caller of G_getl2() to determine that an
incomplete line was read. If the buffer is too small, you lose either
way; discarding the trailing characters is likely to be more robust
than returning them as if they were a subsequent line (the current
behaviour).

--
Glynn Clements <glynn@gclements.plus.com>

Glynn Clements wrote:

FWIW, I suggest changing the existing behaviour [of G_getl2()] so that it always
reads an entire line, even if it can't store it all. IOW, read until
EOL or EOF, stop storing characters once the buffer is full.

+1

Unlike with fgets(), which stores the line terminator in the buffer,
there is no way for the caller of G_getl2() to determine that an
incomplete line was read. If the buffer is too small, you lose either
way; discarding the trailing characters is likely to be more robust
than returning them as if they were a subsequent line (the current
behaviour).

The caller could know if a line was only partially read if G_getl2()
would return -1 (anything but 0, 1) in case the buffer is too small.

Markus M