Overview of the SearchEngine

This chapter explains the workings of the SearchEngine, and the iterative process of generating, and later, regenerating the final applet word database.

The SearchEngine reads one or more HTML files, parses the words within the markup tags, and then parses all linked HTML files. Each word is checked for word removal and word reduction, and the resulting word list for the HTML file is stored internally. When all the linked HTML files have been parsed, the word database is constructed, together with the applet tag for the HTML applet search page.

Though in theory this could be achieved the first time 'round the buoy', in practice, it is usually an iterative process. When compiling a new database, the parser may signal HTML syntax errors, which you may want to correct. There may be some non-text files linked, which the SearchEngine should be told not to parse, or sections of linked HTML documents which should be excluded. Finally, there may be filenames, acronyms, or other words which you may not wish to have appear in the database.

The command line application performs the function above by typing:

java ruptools.SearchEngine -r search.response -gw search

or

SearchEngine.exe -r search.response -gw search

The SearchEngine has a rather lengthy, but necessary list of options:

-f filename    the root HTML filename (required)
-gw filename   generate Web applet files
-lu filename   list dependency URLs to filename
-lw filename   list words to filename
-nt	       exclude <TITLE> tagged words from database
-nh	       exclude <H1..H6><CAPTION> tagged words from database
-nl	       exclude <DT><LI> tagged words from database
-nb	       exclude <BODY> tagged words from database
-p filepath    intermediate data filepath
-r filename    execute response file
-s	       suppress HTML syntax error reporting
-u url	       the WWW URL equivalent of the root HTML document
-xn	       exclude numbers from word list
-xu url	       exclude URL from dependency list
-xwf filename  word exclusion HTML filename
-xwu url       exclude URL from word list

Options are separated by white space, so if you have a filename, or URL which contains a white space character, you must place that parameter in double quotes:

-lu "/html/Site dependency list"

The following options control how the dependency list is constructed:

-f filename    the root HTML filename (required)
-u url	       the WWW URL equivalent of the root HTML document
-xu url	       exclude URL from dependency list

The resulting dependency list can be output to a file using:

-lu filename   list dependency URLs to filename

The intermediate parsed data files are stored in the directory specified by:

-p filepath    intermediate data filepath

if this argument is not specified the current working directory is used.

These options are further explained in the chapter Building the dependency list.

The following options control how the word list is constructed:

-nt	       exclude <TITLE> tagged words from database
-nh	       exclude <H1..H6><CAPTION> tagged words from database
-nl	       exclude <DT><LI> tagged words from database
-nb	       exclude <BODY> tagged words from database
-xwf filename  word exclusion HTML filename
-xwu url       exclude URL from word list
-xn	       exclude numbers from word list

The resulting word list can be output to a file using:

-lw filename   list words to filename

These options are further explained in the chapter Eliminating words.

The following option create the applet tag file, and search database:

-gw filename   generate Web applet files

The option are explained in the chapter Building the applet database.

Since the SearchEngine acts on a series of options, these options can be placed for commodity, in one or more text files. In addition to reducing keystrokes, these files can also contain comments. The following is an extract from the response file used to build the database for this manual:

Response file for the SearchEngine manual
(where on the hard disk)
-f \www\rational\application\search\search\TOC.html

(where on the World Wide Web)
-u http://www.ruptools.com/rup/rational/application/search/search/TOC.html

Dependency exclusions:
(ignore any links to zip files, java files, and the link to my java page)
-xu *.zip
-xu */javapage.html
-xu *.java

Word count exclusions:
(ignore the search page, and table of contents)
-xwu */docsearch.html
-xwu */TOC.html

Standard word exclusion filters:
(ignore all numbers)
-xn

(standard english language exclusion list)
-xwf exclude.english.html

(specific exclusion list for the manual)
-xwf search.exclude.html

The SearchEngine parses a response file, ignoring all lines which do not begin with a hyphen as the first non-white space character. Any valid SearchEngine option can appear in a response file, invalid or illegal options produce an error message.

The -r filename option itself can also appear in a response file, so that, for example, you can create standard dependency or word file exclusion filters, which can be used to generate multiple databases.

Each option and its associated parameters must appear on a single, separate line of the response file.

The SearchEngine can generate several output files, as well as HTML syntax error messages to the standard output device.

Command line errors appear on the standard output. Most errors are due to missing or incorrect options, or option parameters.

Syntax errors appear on the standard output. The line and column of the syntax error is also provided. This is described further in The HTML parser; syntax errors.

The dependency list shows all document links, external document links, data links (such as images, and applets), and missing links. This is described in detail in the chapter Building the dependency list.

The word list shows all parsed words in the database, after word removal and word filtering have been carried out. This is described in detail in the chapter Eliminating words.

The applet HTML file is the SearchEngine generated <APPLET> tag, which can then be cut and pasted into your HTML document. This is described in detail in the chapter Personalizing the applet.

Rational Unified Process 5.1 (build 43)

Overview of the SearchEngine

The purpose of the SearchEngine

The SearchEngine options

Dependency options

Word elimination options

Applet generation options

Using response files

Understanding the output

Command line errors

HTML syntax errors

The dependency list

The word list

The applet HTML file