[GRASS-dev] r.in.xyz can now read from stdin

Hi,

I just added redirection from stdin support to r.in.xyz. Everything seems to
work, but of course testing is appreciated.

This bypasses the bulk of the code which may trigger LFS issues for
very large input files (many GB). ie it skips scanning the filesize --
which was only needed for G_percent() anyway.

You can't rewind a piped stream so the percent= multi-pass option won't
work, and it must keep 100% of the raster map in memory. (limits region
size) Hopefully if you are working with massive datasets you already
have lots of RAM installed.

I used G_clicker()- I'm pretty sure it does not have GUI hooks like
G_percent() does for the progress bar, but then you can't feed data from
stdin using the GUI so you can't trigger it anyway. (this is the
\b\b\b\b\b... GUI window output problem)

Q: is realloc() needed here? it works for me but I'm not sure if it'll
segfault for someone someday.

char *infile;
...
infile = input_opt->answer;
...
if (strcmp ("-", infile) == 0) {
   from_stdin = TRUE;
   ...
   strcpy(infile, "stdin"); /* filename for history metadata */

?

I found that the new r.colors inverse + equalized color flags make the
Jockey's Ridge, NC sample LIDAR dataset look really nice. It brings
out the ground features (roads, homes) well without having to manually
filter-out the bogus outliers.

=============================
Re: [GRASS-user] Re: some interesting tools for working with LAS format

Dylan wrote:

> Perhaps we can talk to these people, and integrate a LAS reader into
> GRASS, or better yet figure out how to use it with r.in.xyz!

Bernhard Hoefle wrote:

the LAS Tools by M. Isenburg

...

The tool "las2txt" can be used to generate ASCII Files that can be
imported with r.in.xyz or v.in.ascii.

Dylan:

As a start we can include references to the LAS tools in the relevant
man pages and wiki, with build and useage examples. This would get
new users up to speed on the 'toolchain' approach to processing data
with GRASS+external tools.

now:
las2txt | r.in.xyz in=- fs=''

enjoy,

Hamish

Hamish wrote:

I just added redirection from stdin support to r.in.xyz. Everything seems to
work, but of course testing is appreciated.

This bypasses the bulk of the code which may trigger LFS issues for
very large input files (many GB). ie it skips scanning the filesize --
which was only needed for G_percent() anyway.

Reading from stdin and the progress indication should be orthogonal. I
would have thought that the easiest solution would be to attempt to
determine the file size with fseek/ftell, and disable progress if that
fails, e.g. because the input is a pipe or is too large for a "long"
(fseek/ftell use long, not off_t).

You can't rewind a piped stream so the percent= multi-pass option won't
work,

If you redirect stdin from a file, stdin will be a file, not a pipe.
There is no inherent reason why stdin cannot be rewound. Conversely,
someone could use input=/dev/tape, which cannot be rewound (in the
sense of rewind() or fseek()).

If you want to determine whether it's possible to seek on a stream,
either try seeking on it, or use fstat(), e.g.:

  struct stat st;
  if (fstat(filename, &st) != 0)
    /* error */
  if (S_ISREG(st.st_mode))
    /* it's a regular file */

and it must keep 100% of the raster map in memory. (limits region
size) Hopefully if you are working with massive datasets you already
have lots of RAM installed.

I used G_clicker()- I'm pretty sure it does not have GUI hooks like
G_percent() does for the progress bar, but then you can't feed data from
stdin using the GUI so you can't trigger it anyway. (this is the
\b\b\b\b\b... GUI window output problem)

Q: is realloc() needed here? it works for me but I'm not sure if it'll
segfault for someone someday.

char *infile;
...
infile = input_opt->answer;
...
if (strcmp ("-", infile) == 0) {
   from_stdin = TRUE;
   ...
   strcpy(infile, "stdin"); /* filename for history metadata */

Yes. There's no guarantee that infile will be large enough. I suggest:

  infile = G_store("stdin");

ISTR that the malloc() in GNU libc always effectively rounds up
allocations to multiples of 8 bytes (blocks are always aligned to
8-byte multiples, so nothing else will be stored in the 8 bytes
following the start of the block), so you'll get away with it on
Linux, but it's conceivable that other platforms might use 4 bytes
(the x86 architecture itself doesn't have *any* alignment
constraints).

--
Glynn Clements <glynn@gclements.plus.com>

Hello,

Hamish:

I just added redirection from stdin support to r.in.xyz. Everything seems to
work, but of course testing is appreciated.

Thanks very much for this. I've tested r.in.xyz on a massive Simrad EM710 swath bathymetry dataset. Results follow:

$ g.region -p
projection: 1 (UTM)
zone: 20
datum: wgs84
ellipsoid: wgs84
north: 5027110
south: 4878260
west: 169140
east: 404580
nsres: 10
ewres: 10
rows: 14885
cols: 23544
cells: 350452440

$ r.out.xyz input=Matthew_EM710_2007_July4_10m.grd fs=, output=- | r.in.xyz input=- output=TESTING_R_IN_XYZ fs=,
Scanning data ...
100%
Writing to map ...
100%
r.in.xyz complete. 35315212 points found in region.

$ r.info TESTING_R_IN_XYZ
+----------------------------------------------------------------------------+
| Layer: TESTING_R_IN_XYZ Date: Thu Jul 12 15:30:19 2007 |
| Mapset: 2007006 Login of Creator: epatton |
| Location: FundyBathy |
| DataBase: /home/epatton/Projects |
| Title: Raw x,y,z data binned into a raster grid by cell mean ( TESTING_ |
| Timestamp: none |
|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: FCELL |
| Rows: 14885 |
| Columns: 23544 |
| Total Cells: 350452440 |
| Projection: UTM (zone 20) |
| N: 5027110 S: 4878260 Res: 10 |
| E: 404580 W: 169140 Res: 10 |
| Range of data: min = -278.542236 max = -3.504807 |
| |
| Data Source: |
| stdin |
| |
| |
| Data Description: |
| generated by r.in.xyz |
| |
| Comments: |
| r.in.xyz input="-" output="TESTING_R_IN_XYZ" method="mean" type="FCE\ |
| LL" x=1 y=2 z=3 percent=100 |
| |
+----------------------------------------------------------------------------+

The newly-gridded bathy looks fine; identical to the original.

Before I leave work today I'm going to halve the resolution (quadruple the number of cells), and start another r.out.xyz --> r.in.xyz run, and let it run overnight. I'll let you know how it goes tomorrow morning (wee hours Saturday morning your time). Just for fun, I'll 'time' it as well.

Cheers,

~ Eric.

Well, never mind. We won't have to wait that long. I ran out of memory trying to pipe the text file in question to r.in.xyz:

$ g.region -p
projection: 1 (UTM)
zone: 20
datum: wgs84
ellipsoid: wgs84
north: 5027110
south: 4878260
west: 169140
east: 404580
nsres: 5
ewres: 5
rows: 29770
cols: 47088
cells: 1401809760

$ r.out.xyz input=Matthew_EM710_2007_July4_10m.grd fs=, output=TESTING_R_IN_XYZ_5m.txt
100%

# Input: TESTING_R_IN_XYZ_5m.txt Size: ~4.6GB

$ cat TESTING_R_IN_XYZ_5m.txt | r.in.xyz input=- output=TESTING_R_IN_XYZ_5m fs=, percent=100
ERROR: G_calloc: out of memory

I realize I'm using a ridiculously large region. Should r.in.xyz have been able to accept a file this large?

~ Eric.

Hamish:
>I just added redirection from stdin support to r.in.xyz. Everything
>seems to work, but of course testing is appreciated.

Eric wrote:

Thanks very much for this. I've tested r.in.xyz on a massive Simrad
EM710 swath bathymetry dataset. Results follow:

$ g.region -p

..

nsres: 10
ewres: 10
rows: 14885
cols: 23544
cells: 350452440

$ r.out.xyz input=Matthew_EM710_2007_July4_10m.grd fs=, output=- |
r.in.xyz input=- output=TESTING_R_IN_XYZ fs=, Scanning data ...
100%
Writing to map ...
100%
r.in.xyz complete. 35315212 points found in region.

$ r.info TESTING_R_IN_XYZ

..

The newly-gridded bathy looks fine; identical to the original.

r.univar results the same? (n and sum)
or
r.mapcalc diff_map=old-new
r.univar diff_map

maybe we need a r.md5sum option added to r.info?

Before I leave work today I'm going to halve the resolution (quadruple
the number of cells), and start another r.out.xyz --> r.in.xyz run,
and let it run overnight. I'll let you know how it goes tomorrow
morning (wee hours Saturday morning your time). Just for fun, I'll
'time' it as well.

..

Well, never mind. We won't have to wait that long. I ran out of memory
trying to pipe the text file in question to r.in.xyz:

..

nsres: 5

..

cells: 1401809760

..

I realize I'm using a ridiculously large region. Should r.in.xyz have
been able to accept a file this large?

It's not anything to do with the input file size, it's the region size
that is the problem. r.in.xyz from stdin has to keep the entire region
in memory at once. See the (updated) help page section about memory
issues.

If the input comes from a file and not a stream, you can use the input=
filename and percent=25 to run it using 4 passes using a quarter of the
memory. You can get the same effect by importing a number of subregions
and r.patch-ing them back together.

thanks for the feedback,
Hamish

Hamish:

I just added redirection from stdin support to r.in.xyz. Everything
seems to work, but of course testing is appreciated.

Eric wrote:

Thanks very much for this. I've tested r.in.xyz on a massive Simrad
EM710 swath bathymetry dataset. Results follow:

<snip>

The newly-gridded bathy looks fine; identical to the original.

r.univar results the same? (n and sum)
or
r.mapcalc diff_map=old-new
r.univar diff_map

Hamish:

maybe we need a r.md5sum option added to r.info?

I finally got around to running some long overdue r.univar comparisons between a bathymetry dataset exported from
r.out.xyz and stdin'd to r.in.xyz:

~/Projects >g.region -pg
n=5030730
s=5006690
w=372650
e=404020
nsres=10
ewres=10
rows=2404
cols=3137
cells=7541348

~/Projects >r.out.xyz input=Matthew_2007006_AllBathy_10m.grd | r.in.xyz input=-
output=testing_r_in_xyz

~/Projects >r.univar -g map=Matthew_2007006_AllBathy_10m.grd
n=2497361
null_cells=5043987
min=-171.591
max=-8.47771
range=163.113
mean=-46.3785
mean_of_abs=46.3785
stddev=21.4064
variance=458.235
coeff_var=-46.1559
sum=-115823957.9045209885

~/Projects >r.univar -g testing_r_in_xyz
n=2497361
null_cells=5043987
min=-171.591
max=-8.47771
range=163.113
mean=-46.3785
mean_of_abs=46.3785
stddev=21.4064
variance=458.235
coeff_var=-46.1559
sum=-115823957.9045209885

~/Projects >r.mapcalc diff_map = 'Matthew_2007006_AllBathy_10m.grd -
testing_r_in_xyz'

~/Projects >r.univar -g diff_map
n=2497361
null_cells=5043987
min=0
max=0
range=0
mean=0
mean_of_abs=0
stddev=0
variance=0
coeff_var=nan
sum=0

Looks pretty good to me!

Thanks again,

--
Eric Patton
Technologist
Geological Survey of Canada (Atlantic)
Bedford Institute of Oceanography
Dartmouth, Nova Scotia, Canada
902.426.7732
epatton@nrcan.gc.ca