[GeoNetwork-devel] The solution to search related problems: operators, quotes, phrase, chinese.

hi Acarboni, Ticheler, GN developers:
Did you encounter the problems search related on web in Advanced Search?
I did.

  1. Problem.
    These are the search related problems I found:
  1. operators: the operators( and, or, not ) can not take any effect.
  2. quotes: also can not take any effect.
  3. the phrase query: must use quotes, but quotes,…
  4. the character query in Asian Language like chinese:
    can not get the exact result, GN found the metadata which includes each character in query, not the query phrase.
    the effect is like: “any more”, and Geonetwork found “any” and “more”.
  1. WHY?
    ok, why? what are the reasons? I
    The analyzer is the main reason for the problems.
    In the java class file of services.main.Search,
    I saw that the query sentence will be send to MainUtil.splitWord function to split the word, like below.
    if (any != null)
    any.setText(MainUtil.splitWord(any.getText()));
    Take a look at the splitWord function, it used StandardAnalyzer.
    public static String splitWord(String requestStr)
    {
    Analyzer a = new StandardAnalyzer();

    }
    We know, the StandardAnalyzer will filter some strings like “and”, “or”, “not”, “as”…,
    and it also filter the quotes ("), so the return of this function will ignore the operator and quotes.
    As default operator “and”, the GN will use “and” to query in Lucene.
    So, the problems become.

3.Solution.
How to resolve that?
Just do not use the StandardAnalyzer? No, we need it to analyze the query sentence, for example,
the phrase in the quotes. So we must find the quotes before analyze, and send the phrase between
quotes to analyzer. My solution can let the quotes, operators, phrase take effect,
it can resolve the problem, implement the search function and Chinese involved. Below is my solution,

if (any != null)
{
any.setText( splitWord(any.getText()) );
}

Use the splitWord to replace the MainUtil.splitWord, and MainUtil.splitWord will be used in splitWord.
Below is the splitWord function in Search.java

//code from here, these code will be in .service.main.Search.java file
private static final String OPER_AND = " and ";
private static final String OPER_OR = " or ";
private static final String OPER_NOT = " not ";

private String splitWord( String strValue )
{
//basic process string: trim, multi whitespace changed to one.
String strQuoteSg = “'”;
String strQuoteDb = “"”;
//single quote to double quote mark
strValue = strValue.replaceAll( strQuoteSg, strQuoteDb);

//trim
strValue = strValue.trim();
//union the continued whitespace to one single
strValue = strValue.replaceAll(“\s\s+”, " ");

//toLowerCase, the search is not case sensitive
strValue = strValue.toLowerCase();

if( strValue.length()>0 )
{
int nFirstIndex = strValue.indexOf(strQuoteDb);

if( nFirstIndex<0 )
{
//no quotes, must use the operator and, or, not to supple the quotes
strValue = replaceComponent( strValue );
}
return splitString( strValue );

}
else
return strValue;
}
// " " → " and "
private String replaceComponent( String strValue )
{
String strQuoteDb = “"”;
String strWhitespace = " ";

//add quotes to head and tail
strValue = strQuoteDb +strValue+ strQuoteDb;

//find the whitespace index
int nIndex = strValue.indexOf( strWhitespace );
if( nIndex<0 )
return strValue;
else
{
//and ,or ,not
strValue = checkKeyword( strValue );

//if not inclucde, just use add as default.
if( strValue.contains( OPER_AND ) || strValue.contains( OPER_OR )
|| strValue.contains( OPER_NOT ))
{
return strValue;
}
else
{
return strValue.replace( strWhitespace,
strQuoteDb+ strWhitespace+“and”+strWhitespace+strQuoteDb);
}
}
}

private String checkKeyword(String strValue)
{
strValue = checkKeywordComponent( strValue, OPER_AND );
strValue = checkKeywordComponent( strValue, OPER_OR );
strValue = checkKeywordComponent( strValue, OPER_NOT );
return strValue;
}

//add quotes to the head and tail of the string
//the strValue and keyword must be lowercase
private String checkKeywordComponent(String strValue, String keyword)
{
StringBuffer sb = new StringBuffer();
sb.append( strValue );

int nIndex = sb.indexOf( keyword );
int offset = keyword.length();

String strQuoteDb = “"”;
while( nIndex >=0 )
{
//check the quote
if( !sb.substring( nIndex-1, nIndex).equals( strQuoteDb ))
{
sb.insert( nIndex, strQuoteDb );
offset++;
}
if( !sb.substring( nIndex+offset, nIndex+offset+1).equals( strQuoteDb ))
{
sb.insert( nIndex+offset, strQuoteDb );
}
nIndex = sb.indexOf(keyword, nIndex+2 );
offset = keyword.length();
}

return sb.toString();
}

private String splitString(String strValue)
{
//clear the whitespace of head and tail
strValue = strValue.trim();
//continued whitespace to one
strValue = strValue.replaceAll(“\s\s+”, " ");
//add quotes for operator: and ,or ,not
strValue = checkKeyword( strValue );

String strQuoteDb = “"”;

StringBuffer sb = new StringBuffer();
int nStartIndex = 0;
int nFirstIndex = strValue.indexOf( strQuoteDb );

while( nFirstIndex>=0 )
{
sb.append( strValue.substring( nStartIndex, nFirstIndex+1 ) );

int nSecondQuote = strValue.indexOf(strQuoteDb, nFirstIndex+1 );

nStartIndex = (nFirstIndex<strValue.length()-1)? nFirstIndex+1 : strValue.length();
if( nSecondQuote<0 ) //the last quote not exist
{
String strLast = strValue.substring(nStartIndex, strValue.length() );
strLast = MainUtil.splitWord( strLast );
sb.append( strLast );
sb.append( strQuoteDb );
nStartIndex = strValue.length()-1;
break;
}
else
{
String strLast = strValue.substring( nStartIndex, nSecondQuote );
strLast = MainUtil.splitWord( strLast );
sb.append( strLast );
sb.append( strQuoteDb );
nStartIndex = nSecondQuote+1;
}
//find the third "
nFirstIndex = strValue.indexOf( strQuoteDb, nStartIndex );
}

if( nStartIndex+1 < strValue.length() )
{
sb.append( strValue.substring(nStartIndex+1));
}
return sb.toString();
}
You can have a test.

  1. COMMIT?
    who can commit this to GN source?
    Or how can i commit this ?

ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡

hi Zhuhua,

as for your problems 1 - 3, you’re right, even though the documentation states that “or”, “not” and “phrase” operators in queries are supported, they are not. If we’d like these operators in that way (i.e. operating in a query from a single search field), we should use a Lucene QueryParser, and we don’t.

However some time ago I alternatively implemented “or”, “without” and “phrase” queries by adding extra search fields for each of these – kind of like Google’s advanced search page where you also have separate input fields for this. These fields are normally “hidden” (invisible) in the advanced search section; if you set their “display” property to either “inline” or “block”, you should be able to use them straight away.

As for your 4th problem, I’m not entirely sure what you mean…

Kind regards
Heikki Doeleman

On Tue, Sep 9, 2008 at 6:19 AM, zhuhua zha <zhuhuazha2004@…310…> wrote:

hi Acarboni, Ticheler, GN developers:
Did you encounter the problems search related on web in Advanced Search?
I did.

  1. Problem.
    These are the search related problems I found:
  1. operators: the operators( and, or, not ) can not take any effect.
  2. quotes: also can not take any effect.
  3. the phrase query: must use quotes, but quotes,…
  4. the character query in Asian Language like chinese:
    can not get the exact result, GN found the metadata which includes each character in query, not the query phrase.
    the effect is like: “any more”, and Geonetwork found “any” and “more”.
  1. WHY?
    ok, why? what are the reasons? I
    The analyzer is the main reason for the problems.
    In the java class file of services.main.Search,
    I saw that the query sentence will be send to MainUtil.splitWord function to split the word, like below.
    if (any != null)
    any.setText(MainUtil.splitWord(any.getText()));
    Take a look at the splitWord function, it used StandardAnalyzer.
    public static String splitWord(String requestStr)
    {
    Analyzer a = new StandardAnalyzer();

    }
    We know, the StandardAnalyzer will filter some strings like “and”, “or”, “not”, “as”…,
    and it also filter the quotes ("), so the return of this function will ignore the operator and quotes.
    As default operator “and”, the GN will use “and” to query in Lucene.
    So, the problems become.

3.Solution.
How to resolve that?
Just do not use the StandardAnalyzer? No, we need it to analyze the query sentence, for example,
the phrase in the quotes. So we must find the quotes before analyze, and send the phrase between
quotes to analyzer. My solution can let the quotes, operators, phrase take effect,
it can resolve the problem, implement the search function and Chinese involved. Below is my solution,

if (any != null)
{
any.setText( splitWord(any.getText()) );
}

Use the splitWord to replace the MainUtil.splitWord, and MainUtil.splitWord will be used in splitWord.
Below is the splitWord function in Search.java

//code from here, these code will be in .service.main.Search.java file
private static final String OPER_AND = " and ";
private static final String OPER_OR = " or ";
private static final String OPER_NOT = " not ";

private String splitWord( String strValue )
{
//basic process string: trim, multi whitespace changed to one.
String strQuoteSg = “'”;
String strQuoteDb = “"”;
//single quote to double quote mark
strValue = strValue.replaceAll( strQuoteSg, strQuoteDb);

//trim
strValue = strValue.trim();
//union the continued whitespace to one single
strValue = strValue.replaceAll(“\s\s+”, " ");

//toLowerCase, the search is not case sensitive
strValue = strValue.toLowerCase();

if( strValue.length()>0 )
{
int nFirstIndex = strValue.indexOf(strQuoteDb);

if( nFirstIndex<0 )
{
//no quotes, must use the operator and, or, not to supple the quotes
strValue = replaceComponent( strValue );
}
return splitString( strValue );

}
else
return strValue;
}
// " " → " and "
private String replaceComponent( String strValue )
{
String strQuoteDb = “"”;
String strWhitespace = " ";

//add quotes to head and tail
strValue = strQuoteDb +strValue+ strQuoteDb;

//find the whitespace index
int nIndex = strValue.indexOf( strWhitespace );
if( nIndex<0 )
return strValue;
else
{
//and ,or ,not
strValue = checkKeyword( strValue );

//if not inclucde, just use add as default.
if( strValue.contains( OPER_AND ) || strValue.contains( OPER_OR )
|| strValue.contains( OPER_NOT ))
{
return strValue;
}
else
{
return strValue.replace( strWhitespace,
strQuoteDb+ strWhitespace+“and”+strWhitespace+strQuoteDb);
}
}
}

private String checkKeyword(String strValue)
{
strValue = checkKeywordComponent( strValue, OPER_AND );
strValue = checkKeywordComponent( strValue, OPER_OR );
strValue = checkKeywordComponent( strValue, OPER_NOT );
return strValue;
}

//add quotes to the head and tail of the string
//the strValue and keyword must be lowercase
private String checkKeywordComponent(String strValue, String keyword)
{
StringBuffer sb = new StringBuffer();
sb.append( strValue );

int nIndex = sb.indexOf( keyword );
int offset = keyword.length();

String strQuoteDb = “"”;
while( nIndex >=0 )
{
//check the quote
if( !sb.substring( nIndex-1, nIndex).equals( strQuoteDb ))
{
sb.insert( nIndex, strQuoteDb );
offset++;
}
if( !sb.substring( nIndex+offset, nIndex+offset+1).equals( strQuoteDb ))
{
sb.insert( nIndex+offset, strQuoteDb );
}
nIndex = sb.indexOf(keyword, nIndex+2 );
offset = keyword.length();
}

return sb.toString();
}

private String splitString(String strValue)
{
//clear the whitespace of head and tail
strValue = strValue.trim();
//continued whitespace to one
strValue = strValue.replaceAll(“\s\s+”, " ");
//add quotes for operator: and ,or ,not
strValue = checkKeyword( strValue );

String strQuoteDb = “"”;

StringBuffer sb = new StringBuffer();
int nStartIndex = 0;
int nFirstIndex = strValue.indexOf( strQuoteDb );

while( nFirstIndex>=0 )
{
sb.append( strValue.substring( nStartIndex, nFirstIndex+1 ) );

int nSecondQuote = strValue.indexOf(strQuoteDb, nFirstIndex+1 );

nStartIndex = (nFirstIndex<strValue.length()-1)? nFirstIndex+1 : strValue.length();
if( nSecondQuote<0 ) //the last quote not exist
{
String strLast = strValue.substring(nStartIndex, strValue.length() );
strLast = MainUtil.splitWord( strLast );
sb.append( strLast );
sb.append( strQuoteDb );
nStartIndex = strValue.length()-1;
break;
}
else
{
String strLast = strValue.substring( nStartIndex, nSecondQuote );
strLast = MainUtil.splitWord( strLast );
sb.append( strLast );
sb.append( strQuoteDb );
nStartIndex = nSecondQuote+1;
}
//find the third "
nFirstIndex = strValue.indexOf( strQuoteDb, nStartIndex );
}

if( nStartIndex+1 < strValue.length() )
{
sb.append( strValue.substring(nStartIndex+1));
}
return sb.toString();
}
You can have a test.

  1. COMMIT?
    who can commit this to GN source?
    Or how can i commit this ?

雅虎邮箱,您的终生邮箱!

This SF.Net email is sponsored by the Moblin Your Move Developer’s challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/


GeoNetwork-devel mailing list
GeoNetwork-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

the 4th problem is chinese related.
In english one word is a word or term, like "helllo
body", it is two words. when we use StandardAnalyzer,
like MainUtil.splitWord, the result will be " hello
body ". "hello" still "hello", "body" still "body".
But in chinese "XX YY", "X" "Y" stand for one
character, the result will be " X X Y Y ", so "XX"
result in " X X ". but we need the phrase "XX" indeed,
so we need the phrase query in lucene.

hi Zhuhua,

as for your problems 1 - 3, you're right, even
though the documentation
states that "or", "not" and "phrase" operators in
queries are supported,
they are not. If we'd like these operators in that
way (i.e. operating in a
query from a single search field), we should use a
Lucene QueryParser, and
we don't.

However some time ago I alternatively implemented
"or", "without" and
"phrase" queries by adding extra search fields for
each of these -- kind of
like Google's advanced search page where you also
have separate input fields
for this. These fields are normally "hidden"
(invisible) in the advanced
search section; if you set their "display" property
to either "inline" or
"block", you should be able to use them straight
away.

As for your 4th problem, I'm not entirely sure what
you mean..

Kind regards
Heikki Doeleman

      ___________________________________________________________
ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡
http://cn.mail.yahoo.com/

Okay,

if this is the case I think we maybe better start using a Lucene QueryParser, which I think probably should handle multi-lingual cases better than that MainUtil.splitWord thing that we have, which is somewhat unfortunate anyway.

Do you, or anyone on this list have any experience in using Lucene QueryParsers with languages in non-western writing like Chinese ?

Certainly this issue must have come up with users of the very popular Lucene search library… let’s look for a solution that’s already there, is my opinion.

Did my answer to your problems 1-3 address those issues, or is everything blocked by your problem 4 ?

Are there no other implementations of GeoNetwork in Chinese ? And if anyone has one, how do you solve this problem ?

Kind regards
Heikki Doeleman

On Tue, Sep 9, 2008 at 5:30 PM, zhuhua zha <zhuhuazha2004@…310…> wrote:

the 4th problem is chinese related.
In english one word is a word or term, like “helllo
body”, it is two words. when we use StandardAnalyzer,
like MainUtil.splitWord, the result will be " hello
body ". “hello” still “hello”, “body” still “body”.
But in chinese “XX YY”, “X” “Y” stand for one
character, the result will be " X X Y Y ", so “XX”
result in " X X ". but we need the phrase “XX” indeed,
so we need the phrase query in lucene.

hi Zhuhua,

as for your problems 1 - 3, you’re right, even
though the documentation
states that “or”, “not” and “phrase” operators in
queries are supported,
they are not. If we’d like these operators in that
way (i.e. operating in a
query from a single search field), we should use a
Lucene QueryParser, and
we don’t.

However some time ago I alternatively implemented
“or”, “without” and
“phrase” queries by adding extra search fields for
each of these – kind of
like Google’s advanced search page where you also
have separate input fields
for this. These fields are normally “hidden”
(invisible) in the advanced
search section; if you set their “display” property
to either “inline” or
“block”, you should be able to use them straight
away.

As for your 4th problem, I’m not entirely sure what
you mean…

Kind regards
Heikki Doeleman


雅虎邮箱,您的终生邮箱!
http://cn.mail.yahoo.com/