SearchEngine - Frequently asked questions
|
- Some common problems have occurred when using the SearchEngine. This chapter lists these
problems and their solutions. Questions have been divided into two categories; the
SearchEngine and the Search applet.
The FAQ index
- Files are not being excluded
- The SearchEngine is reading files excluded with the -xu flag.
- SearchEngine: tags or tag attributes are being stored in the
database
- The SearchEngine is storing words which look suspiciously like tags or tag attributes.
- SearchEngine: keywords in titles and headers are missing
- The SearchEngine is not storing words which appear in HTML tags like <TITLE>,
<H1..H6>, etc.
- SearchEngine: runs fine for a while, then slows down
- The SearchEngine parses the first few hundred files, then slows down and starts
thrashing (repeatedly using) the hard-disk.
- SearchEngine: stops with an OutOfMemoryException
- The SearchEngine parses the first few hundred files, then displays a long list of error
messages, starting with OutOfMemoryException.
- SearchEngine: stops with a 'Too many files for the search
applet database' message
- The SearchEngine parses many hundreds of files, then displays a 'Too many files for the
search applet database' message.
- Applet: Search button remains gray, or an error message
appears
- The applet starts up, but after a few seconds, the search button appears grayed out, or
an error message is displayed.
- Applet: Clicking on a title causes the browser to issue a
'document not found' error
- When the user double clicks on a found document title, instead of the browser opening
the document, it issues a 'document not found' error message.
Questions about the SearchEngine
- Files are not being excluded
- The SearchEngine is reading files excluded with the -xu flag.
- Take care when using the wildcard character '*'.
- The wildcard character '*' can appear at the start of the URL,
and/or at the end of the URL, anywhere else it is treated as an ordinary
character.
No other combinations of the wildcard character '*' are valid. A filter
definition of */extawt/*remove.* will result in a (probably useless) filter
to ignore all URLs containing /extawt/*remove., and not
the probable intention of ignoring all URLs containing /extawt/ and
also remove..
- The SearchEngine uses case sensitive URLs when filtering.
- Some operating systems (Windows) are case insensitive to file names, however, the
SearchEngine is not. If for example, the filter
-xu *.zip
was used, then all files ending in .zip will be removed, but files
ending in .ZIP will not. Use both lower case and upper case to filter
file extensions:
-xu *.zip
-xu *.ZIP
- Tags or tag attributes are being stored in the
database
- The SearchEngine is storing words which look suspiciously like tags or tag attributes.
- The HTML documents may indeed contain the tag keywords as text, if the
argument is about HTML
- Check the documents for the offending keywords, and ensure that they are or are not
inside HTML markup, watch out for incorrectly formed comment syntax.
- The HTML document may have syntax errors, which caused the SearchEngine
to store the words in the body, or ignore them completely.
- Check the documents for the offending keywords, and ensure that they are inside the
correct HTML markup, watch out for incorrectly formed comment syntax.
- Keywords in titles and headers are missing
- The SearchEngine is not storing words which appear in HTML tags like <TITLE>,
<H1..H6>, etc.
- The HTML document may have syntax errors, which caused the SearchEngine
to store the words in the body, or ignore them completely.
- Check the documents for the offending keywords, and ensure that they are inside the
correct HTML markup, watch out for incorrectly formed comment syntax.
- Runs fine for a while, then slows down
- The SearchEngine parses the first few hundred files, then slows down and starts
thrashing (repeatedly using) the hard-disk.
- The SearchEngine is running out of virtual memory.
- The SearchEngine requires about 1.5 to 2.0 times the virtual memory, as the size of the
documents being parsed. If, say, you have 9 MB of documents, then you will require about
15 to 18 MB of virtual memory.
Start the Java interpreter with as much virtual memory
as needed using the -mx switch (the default is 16 MB):
java -mx24m ruptools.SearchEngine ...
- Not enough virtual memory.
- Possible solutions are:
- Split the files up into sub-groups, and create databases for each.
- Remove word groups, -nb, -nl, -nh (in that order).
- Do both, a restricted global search, with complete sub-search.
- Increase the word exclusion list (english.exclude.html is very generic)
- Stops with an OutOfMemoryException
- The SearchEngine parses the first few hundred files, then displays a long list of error
messages, starting with OutOfMemoryException.
- The SearchEngine ran out of virtual memory.
- The SearchEngine requires about 1.5 to 2.0 times the virtual memory, as the size of the
documents being parsed. If, say, you have 9 MB of documents, then you will require about
15 to 18 MB of virtual memory.
Start the Java interpreter with as much virtual memory
as needed using the -mx switch (the default is 16 MB):
java -mx24m ruptools.SearchEngine
- Not enough virtual memory.
- Possible solutions are:
- Split the files up into sub-groups, and create databases for each.
- Remove word groups, -nb, -nl, -nh (in that order).
- Do both, a restricted global search, with complete sub-search.
- Increase the word exclusion list (english.exclude.html is very generic)
- Stops with a 'Too many files for the search
applet database' message
- The SearchEngine parses many hundreds of files, then displays a 'Too many files for the
search applet database' message.
- The SearchEngine exceeded the applet database maximum file size.
- The applet database can hold information on up to a maximum of 4096 HTML
documents.
Questions about the Search applet
- Search button remains gray, or an error message
appears
- The applet starts up, but after a few seconds, the search button appears grayed out, or
an error message is displayed.
- The cause of this problem is that the applet failed to find or load the database.
- Check that the file path is correct.
- The applet will look in the path made up from the codebase plus database
parameter value. Supposing the applet definition is:
<applet codebase=".." archive="Search.zip"
code="ruptools.Search.class" width=100 height=20>
<param name=database value="docsearch">
and assuming the applet file is in the /search directory, then the applet
will look for the file in /search/../classes/docsearch.ws or, when reduced /classes/docsearch.ws
If this is not the correct location of the database file, then either copy the database to
that location, or change the database parameter value.
Remember that the database file must appear in the codebase path of the
applet, otherwise some browsers may refuse access to the file, causing the applet to fail.
- Check the file path for spelling.
- On some operating systems, the filename is case insensitive (Windows), whilst on others
it is not (Unix). Ensure that the codebase path and database
parameter path have the same case as the directories and filename. The database file
extension is .ws, in lower case.
- Check that the file path is within the codebase path.
- As for checking the file path, ensure that the reduced file path is the same or a child
directory of the codebase, otherwise some browsers may refuse access to the
file, causing the applet to fail.
- Check the database file.
- The database file may have become corrupt, or have been replaced. Recompile the
database, and copy the file, then try running the applet again in the browser or appletviewer.
- Clicking on a title causes the browser to issue
a 'document not found' error
- When the user double clicks on a found document title, instead of the browser opening
the document, it issues a 'document not found' error message.
- The path parameter is probably wrong or missing.
- The path parameter is used to correct the database document URL
with respect to the search applet HTML file URL.
If, for example, when compiling the database the root file is specified as: -f /rational/application/search/doc/index.htm
and the root URL as:
-u http://www.ruptools.com/rup/rational/application/search/doc/index.htm
then the root file URL will be stored in the database as:
rational/application/search/doc/index.htm
which corresponds to the identical path in both options:
-f rational/application/search/doc/index.htm
-u http://www.ruptools.com/rup/rational/application/search/doc/index.htm
If we now suppose the search applet HTML file to be at:
/rational/application/search/doc/docsearch.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/docsearch.htm
for the Internet URL, then we need to correct the document URL
references in the applet database file to move back three directories:
<param name=path value="../../../../">
Now, when the user clicks on a link, the browser will construct the URL as
follows:
rational/application/search/doc/../../../../rational/application/search/doc/index.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/../../../../rational/application/search/doc/index.htm
for the Internet URL, which reduces to:
/rational/application/search/doc/index.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/index.htm
for the Internet URL.
|
| |

|