[GeoNetwork-devel] Geonetwork 2.2 and 2.4RC0 gives StackOverflowError when validating metadata elements >3000 chars?

Fellow Geonetworkers,

I'm mystified!

If we have an FGDC record with an element, any element, longer than about 3000 characters, it won't validate. (And we have many metadata records we receive from other institutions like this.) We get the "Validation Error...Metadata is not valid." error page. The geonetwork.log file shows:

    StackOverflowError in com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression.matchString()

As soon as I delete enough characters to get below 3000 in length, the record validates.

Has anyone seen this error?

We have discovered that adding "-Xss2M" to start-geonetwork.sh to increase the Java stack size seems to solve the problem. Is this a reasonable solution, and should it be added to the 2.4 standard installation?

Thanks for any input,
--Rich

--

Richard Fozzard, Computer Scientist
  Geospatial Metadata at NGDC: http://www.ngdc.noaa.gov/metadata

Cooperative Institute for Research in Environmental Sciences (CIRES)
Univ. Colorado & NOAA National Geophysical Data Center, Enterprise Data Systems 325 S. Broadway, Skaggs 1B-305, Boulder, CO 80305
Office: 303-497-6487, Cell: 303-579-5615, Email: richard.fozzard@anonymised.com

Details:

We're running GN 2.2 using Tomcat/5.5.23, Java 1.6.0_13-b03, and Oracle 10.2 on Red Hat Linux EL3. I've confirmed that this bug also appears on a plain GN2.2, Jetty, McKoi installation on Red Hat EL5, on GN 2.2 and 2.4RC0 on Windows.

geonetwork.log:
2009-04-09 14:44:18,170 INFO [jeeves.request] - HTML Request (from 140.172.184.34) : /geonetwork/srv/en/metadata.update.validate
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Method : POST
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Content type : application/x-www-form-urlencoded
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Accept : text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2009-04-09 14:44:18,175 INFO [jeeves.service] - Dispatching : metadata.update.validate
2009-04-09 14:44:18,178 DEBUG [jeeves.service] - -> parameters are :
<request>
...
  <_73>LiDAR DEM Quality Control Report. The accuracy of a LiDAR DEM is estimated by determining the root mean square error (RMSE). RMSE is the square root of the average of the set of squared differences between dataset co-ordinate values and co-ordinate values from an independent source of higher accuracy for identical points. If those differences are normally distributed and average zero, 95 percent of any sufficiently large sample should be less than 1.96 times the RMSE. Therefore 15-centimeter RMSE is often referred to as &amp;quot;30-centimeter accuracy at the 95-percent confidence level&amp;quot;. Following that convention, the vertical accuracy of any DEM is defined as 1.96 times the RMSE of linearly interpolated elevations in the DEM, as compared with known elevations from high-accuracy test points. DEMs should have a maximum RMSE of 15 centimeters, which is roughly equivalent to 1-foot accuracy. Field verification of the vertical accuracy of this DEM to ensure that the 15-centimeter RMSE requirement was satisfied for all major vegetation categories that were predominate a) Bare-earth and low grass (plowed fields, lawns, golf courses); b) High grass and crops (hay fields, cornfields, wheat fields); c) Brush lands and low trees (chaparrals, mesquite, mangrove swamps); d) Fully covered by trees (hardwoods, evergreens, mixed forests); and e) Urban areas (high, dense man-made structures). An even distribution of sample points throughout each category area evaluated was collected and not grouped in a small subarea. The RMSE calculated from a sample of test points is not the RMSE of the DEM. The calculated value may be higher or it may be lower than that of the DEM. Confidence in the calculated value increases with the number of test points. If the errors (lack of accuracy) associated with the DEM are normally distributed and unbiased, the confidence in the calculated RMSE can be determined as a function of sample size. Similarly, the sample RMSE necessary to obtain 95-percent confidence that the DEM RMSE is less than 15 centimeters can also be determined as a function of sample size. For each major vegetation category, a sample of points was tested to show the test points have an RMSE less than where n is the number of test points in the sample. A minimum of 20 test points for each major vegetation category was identified. Therefore, a minimum of 100 test points was selected for the five major vegetation categories. The test points were to be selected in areas to evaluate DEM accuracy under trees and in vegetation representative of the study area. The PDOP during the LiDAR data collection was consistently less than 3.0 and was determined to be of no issue. Test points on sloping or irregular terrain would be unreasonably affected by the linear interpolation of test points from surrounding DEM points and, therefore, were not selected. Test points were collected by RTK (Real-Time Kinematic) GPS techniques. Three thousand Two Hundred and Sixty points were collected in total covering each of the five main categories of ground cover in the survey areas. Furthermore, six of the forty-eight control monuments falling within the project area and installed as part of the survey network were used as a further check. All RMSE calculations were performed on the bare-earth, orthometric surface. Results The comparisons between each validation point and the LiDAR DEM are shown in Appendix A. The comparisons between each control point and the LiDAR DEM are shown in Appendix B. The RMSE was determined for the project area. US Survey Feet Meters Average dz 0.144 0.044 Average magnitude</_73>
...
</request>
2009-04-09 14:44:18,544 DEBUG [jeeves.service] - Raised exception while executing service
<error id="error">
  <message />
  <class>StackOverflowError</class>
  <stack>
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1914" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1940" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1914" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1940" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1914" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1940" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1914" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1940" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1914" method="matchString" />
    <at class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression" file="RegularExpression.java" line="1940" method="matchString" />
  </stack>
  <request>
    <language>en</language>
    <service>metadata.update.validate</service>
  </request>
</error>
2009-04-09 14:44:18,545 INFO [jeeves.service] - -> dispatching to error for : metadata.update.validate

Hi Richard,
Thanks for that report! Did you try any other stack sizes? http://java.sun.com/docs/hotspot/HotSpotFAQ.html#threads_oom lists the default sizes per OS 32-bit or 64-bit. The default sizes seem quite a bit lower and it may be worth to not increase the stack size too much!? What default stack size did you experience the problems with?
Cheers,
Jeroen

On Apr 10, 2009, at 1:20 AM, Richard Fozzard wrote:

Fellow Geonetworkers,

I'm mystified!

If we have an FGDC record with an element, any element, longer than
about 3000 characters, it won't validate. (And we have many metadata
records we receive from other institutions like this.) We get the
"Validation Error...Metadata is not valid." error page. The
geonetwork.log file shows:

   StackOverflowError in
com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression.matchString()

As soon as I delete enough characters to get below 3000 in length, the
record validates.

Has anyone seen this error?

We have discovered that adding "-Xss2M" to start-geonetwork.sh to
increase the Java stack size seems to solve the problem. Is this a
reasonable solution, and should it be added to the 2.4 standard
installation?

Thanks for any input,
--Rich

--

Richard Fozzard, Computer Scientist
Geospatial Metadata at NGDC: Index of /waf

Cooperative Institute for Research in Environmental Sciences (CIRES)
Univ. Colorado & NOAA National Geophysical Data Center, Enterprise Data Systems
325 S. Broadway, Skaggs 1B-305, Boulder, CO 80305
Office: 303-497-6487, Cell: 303-579-5615, Email: richard.fozzard@anonymised.com

Details:

We're running GN 2.2 using Tomcat/5.5.23, Java 1.6.0_13-b03, and Oracle
10.2 on Red Hat Linux EL3. I've confirmed that this bug also appears on
a plain GN2.2, Jetty, McKoi installation on Red Hat EL5, on GN 2.2 and
2.4RC0 on Windows.

geonetwork.log:
2009-04-09 14:44:18,170 INFO [jeeves.request] - HTML Request (from
140.172.184.34) : /geonetwork/srv/en/metadata.update.validate
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Method : POST
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Content type :
application/x-www-form-urlencoded
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Accept :
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2009-04-09 14:44:18,175 INFO [jeeves.service] - Dispatching :
metadata.update.validate
2009-04-09 14:44:18,178 DEBUG [jeeves.service] - -> parameters are :
<request>
...
<_73>LiDAR DEM Quality Control Report. The accuracy of a LiDAR DEM is
estimated by determining the root mean square error (RMSE). RMSE is the
square root of the average of the set of squared differences between
dataset co-ordinate values and co-ordinate values from an independent
source of higher accuracy for identical points. If those differences are
normally distributed and average zero, 95 percent of any sufficiently
large sample should be less than 1.96 times the RMSE. Therefore
15-centimeter RMSE is often referred to as &amp;quot;30-centimeter
accuracy at the 95-percent confidence level&amp;quot;. Following that
convention, the vertical accuracy of any DEM is defined as 1.96 times
the RMSE of linearly interpolated elevations in the DEM, as compared
with known elevations from high-accuracy test points. DEMs should have a
maximum RMSE of 15 centimeters, which is roughly equivalent to 1-foot
accuracy. Field verification of the vertical accuracy of this DEM to
ensure that the 15-centimeter RMSE requirement was satisfied for all
major vegetation categories that were predominate a) Bare-earth and low
grass (plowed fields, lawns, golf courses); b) High grass and crops (hay
fields, cornfields, wheat fields); c) Brush lands and low trees
(chaparrals, mesquite, mangrove swamps); d) Fully covered by trees
(hardwoods, evergreens, mixed forests); and e) Urban areas (high, dense
man-made structures). An even distribution of sample points throughout
each category area evaluated was collected and not grouped in a small
subarea. The RMSE calculated from a sample of test points is not the
RMSE of the DEM. The calculated value may be higher or it may be lower
than that of the DEM. Confidence in the calculated value increases with
the number of test points. If the errors (lack of accuracy) associated
with the DEM are normally distributed and unbiased, the confidence in
the calculated RMSE can be determined as a function of sample size.
Similarly, the sample RMSE necessary to obtain 95-percent confidence
that the DEM RMSE is less than 15 centimeters can also be determined as
a function of sample size. For each major vegetation category, a sample
of points was tested to show the test points have an RMSE less than
where n is the number of test points in the sample. A minimum of 20 test
points for each major vegetation category was identified. Therefore, a
minimum of 100 test points was selected for the five major vegetation
categories. The test points were to be selected in areas to evaluate DEM
accuracy under trees and in vegetation representative of the study area.
The PDOP during the LiDAR data collection was consistently less than 3.0
and was determined to be of no issue. Test points on sloping or
irregular terrain would be unreasonably affected by the linear
interpolation of test points from surrounding DEM points and, therefore,
were not selected. Test points were collected by RTK (Real-Time
Kinematic) GPS techniques. Three thousand Two Hundred and Sixty points
were collected in total covering each of the five main categories of
ground cover in the survey areas. Furthermore, six of the forty-eight
control monuments falling within the project area and installed as part
of the survey network were used as a further check. All RMSE
calculations were performed on the bare-earth, orthometric surface.
Results The comparisons between each validation point and the LiDAR DEM
are shown in Appendix A. The comparisons between each control point and
the LiDAR DEM are shown in Appendix B. The RMSE was determined for the
project area. US Survey Feet Meters Average dz 0.144 0.044 Average
magnitude</_73>
...
</request>
2009-04-09 14:44:18,544 DEBUG [jeeves.service] - Raised exception while
executing service
<error id="error">
<message />
<class>StackOverflowError</class>
<stack>
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
</stack>
<request>
   <language>en</language>
   <service>metadata.update.validate</service>
</request>
</error>
2009-04-09 14:44:18,545 INFO [jeeves.service] - -> dispatching to
error for : metadata.update.validate

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Hi Rich,

Haven't seen this before but I spent a bit of time looking at this as I changed the validation stuff in 2.4 and felt I might be vaguely responsible :slight_smile: - but I don't know how much extra I can add as I suspect you've probably been through all the same hoops :slight_smile:

Fortunately it doesn't happen with large fields in other metadata standards (which is good) - but I suspect thats because fgdc has a schema definition (for some elements eg. abstract with type FGDCstring) that forces validation against regular expressions (hence the stack trace in a regular expression engine). A bit of a hack would be to remove/change the regular expression patterns on the fields with long content (eg. abstract where I tested mine with a 19500 odd character field - I used a new simpleType in the fgdc schema.xsd that didn't have the regular expression check and everything worked fine then as you'd expect).

I also thought it might be because the xalan stuff being used was the internal sun stuff in the jdk/jre rt.jar (which is apparently an older xerces/xalan - 2.6.?/?) so after wresting control away from Sun's internal xalan stuff (which was very stubborn) and giving it back to xerces 2.9.0, I still got the same problem so its not a 'bug' in an older version of xerces/xalan.

Using saxon doesn't help because validation uses the SchemaEngine and the public domain saxon doesn't have one (I don't think anyway) so in that case xalan/xerces is used for validation and saxon for xslt. Didn't try saxonica saxon.

I also tried a different validationHandler in Jeeves ie. thinking that there might be a bug in the JDOM SAXOutputter - but no, same problem :frowning:

So yep I agree it seems like upping the stack size is quickest. You could fiddle with the fgdc xsds/regular expressions or (last choice maybe) delve into the xerces code and see if you can make that work more efficiently (see matchString in src/org/apache/xerces/impl/xpath/regex/RegularExpression.java) :slight_smile:

Cheers,
Simon

________________________________________
From: Richard Fozzard [Richard.Fozzard@anonymised.com]
Sent: Friday, 10 April 2009 9:20 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] Geonetwork 2.2 and 2.4RC0 gives StackOverflowError when validating metadata elements >3000 chars?

Fellow Geonetworkers,

I'm mystified!

If we have an FGDC record with an element, any element, longer than
about 3000 characters, it won't validate. (And we have many metadata
records we receive from other institutions like this.) We get the
"Validation Error...Metadata is not valid." error page. The
geonetwork.log file shows:

    StackOverflowError in
com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression.matchString()

As soon as I delete enough characters to get below 3000 in length, the
record validates.

Has anyone seen this error?

We have discovered that adding "-Xss2M" to start-geonetwork.sh to
increase the Java stack size seems to solve the problem. Is this a
reasonable solution, and should it be added to the 2.4 standard
installation?

Thanks for any input,
--Rich

--

Richard Fozzard, Computer Scientist
  Geospatial Metadata at NGDC: http://www.ngdc.noaa.gov/metadata

Cooperative Institute for Research in Environmental Sciences (CIRES)
Univ. Colorado & NOAA National Geophysical Data Center, Enterprise Data Systems
325 S. Broadway, Skaggs 1B-305, Boulder, CO 80305
Office: 303-497-6487, Cell: 303-579-5615, Email: richard.fozzard@anonymised.com

Details:

We're running GN 2.2 using Tomcat/5.5.23, Java 1.6.0_13-b03, and Oracle
10.2 on Red Hat Linux EL3. I've confirmed that this bug also appears on
a plain GN2.2, Jetty, McKoi installation on Red Hat EL5, on GN 2.2 and
2.4RC0 on Windows.

geonetwork.log:
2009-04-09 14:44:18,170 INFO [jeeves.request] - HTML Request (from
140.172.184.34) : /geonetwork/srv/en/metadata.update.validate
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Method : POST
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Content type :
application/x-www-form-urlencoded
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Accept :
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2009-04-09 14:44:18,175 INFO [jeeves.service] - Dispatching :
metadata.update.validate
2009-04-09 14:44:18,178 DEBUG [jeeves.service] - -> parameters are :
<request>
...
  <_73>LiDAR DEM Quality Control Report. The accuracy of a LiDAR DEM is
estimated by determining the root mean square error (RMSE). RMSE is the
square root of the average of the set of squared differences between
dataset co-ordinate values and co-ordinate values from an independent
source of higher accuracy for identical points. If those differences are
normally distributed and average zero, 95 percent of any sufficiently
large sample should be less than 1.96 times the RMSE. Therefore
15-centimeter RMSE is often referred to as &amp;quot;30-centimeter
accuracy at the 95-percent confidence level&amp;quot;. Following that
convention, the vertical accuracy of any DEM is defined as 1.96 times
the RMSE of linearly interpolated elevations in the DEM, as compared
with known elevations from high-accuracy test points. DEMs should have a
maximum RMSE of 15 centimeters, which is roughly equivalent to 1-foot
accuracy. Field verification of the vertical accuracy of this DEM to
ensure that the 15-centimeter RMSE requirement was satisfied for all
major vegetation categories that were predominate a) Bare-earth and low
grass (plowed fields, lawns, golf courses); b) High grass and crops (hay
fields, cornfields, wheat fields); c) Brush lands and low trees
(chaparrals, mesquite, mangrove swamps); d) Fully covered by trees
(hardwoods, evergreens, mixed forests); and e) Urban areas (high, dense
man-made structures). An even distribution of sample points throughout
each category area evaluated was collected and not grouped in a small
subarea. The RMSE calculated from a sample of test points is not the
RMSE of the DEM. The calculated value may be higher or it may be lower
than that of the DEM. Confidence in the calculated value increases with
the number of test points. If the errors (lack of accuracy) associated
with the DEM are normally distributed and unbiased, the confidence in
the calculated RMSE can be determined as a function of sample size.
Similarly, the sample RMSE necessary to obtain 95-percent confidence
that the DEM RMSE is less than 15 centimeters can also be determined as
a function of sample size. For each major vegetation category, a sample
of points was tested to show the test points have an RMSE less than
where n is the number of test points in the sample. A minimum of 20 test
points for each major vegetation category was identified. Therefore, a
minimum of 100 test points was selected for the five major vegetation
categories. The test points were to be selected in areas to evaluate DEM
accuracy under trees and in vegetation representative of the study area.
The PDOP during the LiDAR data collection was consistently less than 3.0
and was determined to be of no issue. Test points on sloping or
irregular terrain would be unreasonably affected by the linear
interpolation of test points from surrounding DEM points and, therefore,
were not selected. Test points were collected by RTK (Real-Time
Kinematic) GPS techniques. Three thousand Two Hundred and Sixty points
were collected in total covering each of the five main categories of
ground cover in the survey areas. Furthermore, six of the forty-eight
control monuments falling within the project area and installed as part
of the survey network were used as a further check. All RMSE
calculations were performed on the bare-earth, orthometric surface.
Results The comparisons between each validation point and the LiDAR DEM
are shown in Appendix A. The comparisons between each control point and
the LiDAR DEM are shown in Appendix B. The RMSE was determined for the
project area. US Survey Feet Meters Average dz 0.144 0.044 Average
magnitude</_73>
...
</request>
2009-04-09 14:44:18,544 DEBUG [jeeves.service] - Raised exception while
executing service
<error id="error">
  <message />
  <class>StackOverflowError</class>
  <stack>
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
  </stack>
  <request>
    <language>en</language>
    <service>metadata.update.validate</service>
  </request>
</error>
2009-04-09 14:44:18,545 INFO [jeeves.service] - -> dispatching to
error for : metadata.update.validate

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!

_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net

GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Sorry Richard, I can't compete with such detailed analysis! Probably the best proof of the value of an open source project! :slight_smile: Thanks Simon!
Ciao,
Jeroen

On Apr 10, 2009, at 4:46 PM, <Simon.Pigot@anonymised.com> <Simon.Pigot@anonymised.com > wrote:

Hi Rich,

Haven't seen this before but I spent a bit of time looking at this as I changed the validation stuff in 2.4 and felt I might be vaguely responsible :slight_smile: - but I don't know how much extra I can add as I suspect you've probably been through all the same hoops :slight_smile:

Fortunately it doesn't happen with large fields in other metadata standards (which is good) - but I suspect thats because fgdc has a schema definition (for some elements eg. abstract with type FGDCstring) that forces validation against regular expressions (hence the stack trace in a regular expression engine). A bit of a hack would be to remove/change the regular expression patterns on the fields with long content (eg. abstract where I tested mine with a 19500 odd character field - I used a new simpleType in the fgdc schema.xsd that didn't have the regular expression check and everything worked fine then as you'd expect).

I also thought it might be because the xalan stuff being used was the internal sun stuff in the jdk/jre rt.jar (which is apparently an older xerces/xalan - 2.6.?/?) so after wresting control away from Sun's internal xalan stuff (which was very stubborn) and giving it back to xerces 2.9.0, I still got the same problem so its not a 'bug' in an older version of xerces/xalan.

Using saxon doesn't help because validation uses the SchemaEngine and the public domain saxon doesn't have one (I don't think anyway) so in that case xalan/xerces is used for validation and saxon for xslt. Didn't try saxonica saxon.

I also tried a different validationHandler in Jeeves ie. thinking that there might be a bug in the JDOM SAXOutputter - but no, same problem :frowning:

So yep I agree it seems like upping the stack size is quickest. You could fiddle with the fgdc xsds/regular expressions or (last choice maybe) delve into the xerces code and see if you can make that work more efficiently (see matchString in src/org/apache/xerces/impl/xpath/regex/RegularExpression.java) :slight_smile:

Cheers,
Simon

________________________________________
From: Richard Fozzard [Richard.Fozzard@anonymised.com]
Sent: Friday, 10 April 2009 9:20 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] Geonetwork 2.2 and 2.4RC0 gives StackOverflowError when validating metadata elements >3000 chars?

Fellow Geonetworkers,

I'm mystified!

If we have an FGDC record with an element, any element, longer than
about 3000 characters, it won't validate. (And we have many metadata
records we receive from other institutions like this.) We get the
"Validation Error...Metadata is not valid." error page. The
geonetwork.log file shows:

   StackOverflowError in
com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression.matchString()

As soon as I delete enough characters to get below 3000 in length, the
record validates.

Has anyone seen this error?

We have discovered that adding "-Xss2M" to start-geonetwork.sh to
increase the Java stack size seems to solve the problem. Is this a
reasonable solution, and should it be added to the 2.4 standard
installation?

Thanks for any input,
--Rich

--

Richard Fozzard, Computer Scientist
Geospatial Metadata at NGDC: http://www.ngdc.noaa.gov/metadata

Cooperative Institute for Research in Environmental Sciences (CIRES)
Univ. Colorado & NOAA National Geophysical Data Center, Enterprise Data Systems
325 S. Broadway, Skaggs 1B-305, Boulder, CO 80305
Office: 303-497-6487, Cell: 303-579-5615, Email: richard.fozzard@anonymised.com

Details:

We're running GN 2.2 using Tomcat/5.5.23, Java 1.6.0_13-b03, and Oracle
10.2 on Red Hat Linux EL3. I've confirmed that this bug also appears on
a plain GN2.2, Jetty, McKoi installation on Red Hat EL5, on GN 2.2 and
2.4RC0 on Windows.

geonetwork.log:
2009-04-09 14:44:18,170 INFO [jeeves.request] - HTML Request (from
140.172.184.34) : /geonetwork/srv/en/metadata.update.validate
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Method : POST
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Content type :
application/x-www-form-urlencoded
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Accept :
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2009-04-09 14:44:18,175 INFO [jeeves.service] - Dispatching :
metadata.update.validate
2009-04-09 14:44:18,178 DEBUG [jeeves.service] - -> parameters are :
<request>
...
<_73>LiDAR DEM Quality Control Report. The accuracy of a LiDAR DEM is
estimated by determining the root mean square error (RMSE). RMSE is the
square root of the average of the set of squared differences between
dataset co-ordinate values and co-ordinate values from an independent
source of higher accuracy for identical points. If those differences are
normally distributed and average zero, 95 percent of any sufficiently
large sample should be less than 1.96 times the RMSE. Therefore
15-centimeter RMSE is often referred to as &amp;quot;30-centimeter
accuracy at the 95-percent confidence level&amp;quot;. Following that
convention, the vertical accuracy of any DEM is defined as 1.96 times
the RMSE of linearly interpolated elevations in the DEM, as compared
with known elevations from high-accuracy test points. DEMs should have a
maximum RMSE of 15 centimeters, which is roughly equivalent to 1-foot
accuracy. Field verification of the vertical accuracy of this DEM to
ensure that the 15-centimeter RMSE requirement was satisfied for all
major vegetation categories that were predominate a) Bare-earth and low
grass (plowed fields, lawns, golf courses); b) High grass and crops (hay
fields, cornfields, wheat fields); c) Brush lands and low trees
(chaparrals, mesquite, mangrove swamps); d) Fully covered by trees
(hardwoods, evergreens, mixed forests); and e) Urban areas (high, dense
man-made structures). An even distribution of sample points throughout
each category area evaluated was collected and not grouped in a small
subarea. The RMSE calculated from a sample of test points is not the
RMSE of the DEM. The calculated value may be higher or it may be lower
than that of the DEM. Confidence in the calculated value increases with
the number of test points. If the errors (lack of accuracy) associated
with the DEM are normally distributed and unbiased, the confidence in
the calculated RMSE can be determined as a function of sample size.
Similarly, the sample RMSE necessary to obtain 95-percent confidence
that the DEM RMSE is less than 15 centimeters can also be determined as
a function of sample size. For each major vegetation category, a sample
of points was tested to show the test points have an RMSE less than
where n is the number of test points in the sample. A minimum of 20 test
points for each major vegetation category was identified. Therefore, a
minimum of 100 test points was selected for the five major vegetation
categories. The test points were to be selected in areas to evaluate DEM
accuracy under trees and in vegetation representative of the study area.
The PDOP during the LiDAR data collection was consistently less than 3.0
and was determined to be of no issue. Test points on sloping or
irregular terrain would be unreasonably affected by the linear
interpolation of test points from surrounding DEM points and, therefore,
were not selected. Test points were collected by RTK (Real-Time
Kinematic) GPS techniques. Three thousand Two Hundred and Sixty points
were collected in total covering each of the five main categories of
ground cover in the survey areas. Furthermore, six of the forty-eight
control monuments falling within the project area and installed as part
of the survey network were used as a further check. All RMSE
calculations were performed on the bare-earth, orthometric surface.
Results The comparisons between each validation point and the LiDAR DEM
are shown in Appendix A. The comparisons between each control point and
the LiDAR DEM are shown in Appendix B. The RMSE was determined for the
project area. US Survey Feet Meters Average dz 0.144 0.044 Average
magnitude</_73>
...
</request>
2009-04-09 14:44:18,544 DEBUG [jeeves.service] - Raised exception while
executing service
<error id="error">
<message />
<class>StackOverflowError</class>
<stack>
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
   <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
</stack>
<request>
   <language>en</language>
   <service>metadata.update.validate</service>
</request>
</error>
2009-04-09 14:44:18,545 INFO [jeeves.service] - -> dispatching to
error for : metadata.update.validate

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Simon, Jeroen

Thanks for the in depth analysis and quick replies!

I did try some of the same things, and have since done some crude tests of different stack and XML element sizes (details below).

To sum up, the largest XML element we have in our collection (many thousands of records of very different kinds of science datasets) is about 16k. To correctly validate twice that size (for a safety margin), I found that a Java stack size of 10MB (-Xss10M) was needed. There seemed to be a roughly linear relationship of stack size to element size (I was hoping for an asymptote!), so for now we've settled on the 10MB stack.

It does have a big effect on the virtual memory usage of Jetty and Tomcat in our installation: up from 900MB to about 1800MB, but actual resident memory usage isn't much different: about 400MB. That's with the limited testing I've done. In a real production environment, we'll want to watch memory more carefully with this larger stack size!

So should we consider a larger default stack for Geonetwork 2.4? Or at least a "tech note" explaining how to solve the problem if users experience it?

Thanks for all the help!
--Rich

Details:

- testing: try diff stacks on 32k elem (twice size of lidar example)
    read out 'top' mem usage of jetty after validation, two browsers connected
    (default stack size)
     - FAILS, 920m virt, 380m res
    -Xss2M
     - 4k WORKS
     - 8k FAILS
     - FAILS, 1030m virt, 400m res
    -Xss4M
     - 8k WORKS
     - 16k FAILS
     - FAILS, 1150m virt, 390m res
    -Xss8M
     - 16k WORKS
     - FAILS, 1320m virt, 410m res
    -Xss10M
     - WORKS, 1470m virt, 420m res
     - 64k FAILS
    -Xss12M
     - WORKS, 1610m virt, 430m res
     - 64k FAILS
    -Xss16M
     - WORKS, 1830m virt, 430m res
     - 64k FAILS

Simon.Pigot@anonymised.com said the following on 04/10/2009 08:46 AM:

Hi Rich,

Haven't seen this before but I spent a bit of time looking at this as I changed the validation stuff in 2.4 and felt I might be vaguely responsible :slight_smile: - but I don't know how much extra I can add as I suspect you've probably been through all the same hoops :slight_smile:

Fortunately it doesn't happen with large fields in other metadata standards (which is good) - but I suspect thats because fgdc has a schema definition (for some elements eg. abstract with type FGDCstring) that forces validation against regular expressions (hence the stack trace in a regular expression engine). A bit of a hack would be to remove/change the regular expression patterns on the fields with long content (eg. abstract where I tested mine with a 19500 odd character field - I used a new simpleType in the fgdc schema.xsd that didn't have the regular expression check and everything worked fine then as you'd expect).

I also thought it might be because the xalan stuff being used was the internal sun stuff in the jdk/jre rt.jar (which is apparently an older xerces/xalan - 2.6.?/?) so after wresting control away from Sun's internal xalan stuff (which was very stubborn) and giving it back to xerces 2.9.0, I still got the same problem so its not a 'bug' in an older version of xerces/xalan.

Using saxon doesn't help because validation uses the SchemaEngine and the public domain saxon doesn't have one (I don't think anyway) so in that case xalan/xerces is used for validation and saxon for xslt. Didn't try saxonica saxon.

I also tried a different validationHandler in Jeeves ie. thinking that there might be a bug in the JDOM SAXOutputter - but no, same problem :frowning:

So yep I agree it seems like upping the stack size is quickest. You could fiddle with the fgdc xsds/regular expressions or (last choice maybe) delve into the xerces code and see if you can make that work more efficiently (see matchString in src/org/apache/xerces/impl/xpath/regex/RegularExpression.java) :slight_smile:

Cheers,
Simon

________________________________________
From: Richard Fozzard [Richard.Fozzard@anonymised.com]
Sent: Friday, 10 April 2009 9:20 AM
To: geonetwork-devel@lists.sourceforge.net
Subject: [GeoNetwork-devel] Geonetwork 2.2 and 2.4RC0 gives StackOverflowError when validating metadata elements >3000 chars?

Fellow Geonetworkers,

I'm mystified!

If we have an FGDC record with an element, any element, longer than
about 3000 characters, it won't validate. (And we have many metadata
records we receive from other institutions like this.) We get the
"Validation Error...Metadata is not valid." error page. The
geonetwork.log file shows:

    StackOverflowError in
com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression.matchString()

As soon as I delete enough characters to get below 3000 in length, the
record validates.

Has anyone seen this error?

We have discovered that adding "-Xss2M" to start-geonetwork.sh to
increase the Java stack size seems to solve the problem. Is this a
reasonable solution, and should it be added to the 2.4 standard
installation?

Thanks for any input,
--Rich

--

Richard Fozzard, Computer Scientist
  Geospatial Metadata at NGDC: http://www.ngdc.noaa.gov/metadata

Cooperative Institute for Research in Environmental Sciences (CIRES)
Univ. Colorado & NOAA National Geophysical Data Center, Enterprise Data Systems
325 S. Broadway, Skaggs 1B-305, Boulder, CO 80305
Office: 303-497-6487, Cell: 303-579-5615, Email: richard.fozzard@anonymised.com

Details:

We're running GN 2.2 using Tomcat/5.5.23, Java 1.6.0_13-b03, and Oracle
10.2 on Red Hat Linux EL3. I've confirmed that this bug also appears on
a plain GN2.2, Jetty, McKoi installation on Red Hat EL5, on GN 2.2 and
2.4RC0 on Windows.

geonetwork.log:
2009-04-09 14:44:18,170 INFO [jeeves.request] - HTML Request (from
140.172.184.34) : /geonetwork/srv/en/metadata.update.validate
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Method : POST
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Content type :
application/x-www-form-urlencoded
2009-04-09 14:44:18,170 DEBUG [jeeves.request] - Accept :
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2009-04-09 14:44:18,175 INFO [jeeves.service] - Dispatching :
metadata.update.validate
2009-04-09 14:44:18,178 DEBUG [jeeves.service] - -> parameters are :
<request>
...
  <_73>LiDAR DEM Quality Control Report. The accuracy of a LiDAR DEM is
estimated by determining the root mean square error (RMSE). RMSE is the
square root of the average of the set of squared differences between
dataset co-ordinate values and co-ordinate values from an independent
source of higher accuracy for identical points. If those differences are
normally distributed and average zero, 95 percent of any sufficiently
large sample should be less than 1.96 times the RMSE. Therefore
15-centimeter RMSE is often referred to as &amp;quot;30-centimeter
accuracy at the 95-percent confidence level&amp;quot;. Following that
convention, the vertical accuracy of any DEM is defined as 1.96 times
the RMSE of linearly interpolated elevations in the DEM, as compared
with known elevations from high-accuracy test points. DEMs should have a
maximum RMSE of 15 centimeters, which is roughly equivalent to 1-foot
accuracy. Field verification of the vertical accuracy of this DEM to
ensure that the 15-centimeter RMSE requirement was satisfied for all
major vegetation categories that were predominate a) Bare-earth and low
grass (plowed fields, lawns, golf courses); b) High grass and crops (hay
fields, cornfields, wheat fields); c) Brush lands and low trees
(chaparrals, mesquite, mangrove swamps); d) Fully covered by trees
(hardwoods, evergreens, mixed forests); and e) Urban areas (high, dense
man-made structures). An even distribution of sample points throughout
each category area evaluated was collected and not grouped in a small
subarea. The RMSE calculated from a sample of test points is not the
RMSE of the DEM. The calculated value may be higher or it may be lower
than that of the DEM. Confidence in the calculated value increases with
the number of test points. If the errors (lack of accuracy) associated
with the DEM are normally distributed and unbiased, the confidence in
the calculated RMSE can be determined as a function of sample size.
Similarly, the sample RMSE necessary to obtain 95-percent confidence
that the DEM RMSE is less than 15 centimeters can also be determined as
a function of sample size. For each major vegetation category, a sample
of points was tested to show the test points have an RMSE less than
where n is the number of test points in the sample. A minimum of 20 test
points for each major vegetation category was identified. Therefore, a
minimum of 100 test points was selected for the five major vegetation
categories. The test points were to be selected in areas to evaluate DEM
accuracy under trees and in vegetation representative of the study area.
The PDOP during the LiDAR data collection was consistently less than 3.0
and was determined to be of no issue. Test points on sloping or
irregular terrain would be unreasonably affected by the linear
interpolation of test points from surrounding DEM points and, therefore,
were not selected. Test points were collected by RTK (Real-Time
Kinematic) GPS techniques. Three thousand Two Hundred and Sixty points
were collected in total covering each of the five main categories of
ground cover in the survey areas. Furthermore, six of the forty-eight
control monuments falling within the project area and installed as part
of the survey network were used as a further check. All RMSE
calculations were performed on the bare-earth, orthometric surface.
Results The comparisons between each validation point and the LiDAR DEM
are shown in Appendix A. The comparisons between each control point and
the LiDAR DEM are shown in Appendix B. The RMSE was determined for the
project area. US Survey Feet Meters Average dz 0.144 0.044 Average
magnitude</_73>
...
</request>
2009-04-09 14:44:18,544 DEBUG [jeeves.service] - Raised exception while
executing service
<error id="error">
  <message />
  <class>StackOverflowError</class>
  <stack>
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1914" method="matchString" />
    <at
class="com.sun.org.apache.xerces.internal.impl.xpath.regex.RegularExpression"
file="RegularExpression.java" line="1940" method="matchString" />
  </stack>
  <request>
    <language>en</language>
    <service>metadata.update.validate</service>
  </request>
</error>
2009-04-09 14:44:18,545 INFO [jeeves.service] - -> dispatching to
error for : metadata.update.validate

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork