| 
 
  
    | Overview of the SearchEngine |  
    | 
      
      
 This chapter explains the workings of the SearchEngine, and the iterative process of
        generating, and later, regenerating the final applet word database.
  The purpose of the SearchEngineThe SearchEngine reads one or more HTML files, parses the words within the
        markup tags, and then parses all linked HTML files. Each word is checked for
        word removal and word reduction, and the resulting word list for the HTML
        file is stored internally. When all the linked HTML files have been parsed,
        the word database is constructed, together with the applet tag for the HTML
        applet search page. Though in theory this could be achieved the first time 'round the
        buoy', in practice, it is usually an iterative process. When compiling a new database, the
        parser may signal HTML syntax errors, which you may want to correct. There
        may be some non-text files linked, which the SearchEngine should be told not to parse, or
        sections of linked HTML documents which should be excluded. Finally, there
        may be filenames, acronyms, or other words which you may not wish to have appear in the
        database.  The command line application performs the function above by typing:  
java ruptools.SearchEngine -r search.response -gw search
or
SearchEngine.exe -r search.response -gw search
  The SearchEngine optionsThe SearchEngine has a rather lengthy, but necessary list of options: 
-f filename    the root HTML filename (required)
-gw filename   generate Web applet files
-lu filename   list dependency URLs to filename
-lw filename   list words to filename
-nt	       exclude <TITLE> tagged words from database
-nh	       exclude <H1..H6><CAPTION> tagged words from database
-nl	       exclude <DT><LI> tagged words from database
-nb	       exclude <BODY> tagged words from database
-p filepath    intermediate data filepath
-r filename    execute response file
-s	       suppress HTML syntax error reporting
-u url	       the WWW URL equivalent of the root HTML document
-xn	       exclude numbers from word list
-xu url	       exclude URL from dependency list
-xwf filename  word exclusion HTML filename
-xwu url       exclude URL from word list
 Options are separated by white space, so if you have a filename, or URL
        which contains a white space character, you must place that parameter in double quotes:  
-lu "/html/Site dependency list"
Dependency optionsThe following options control how the dependency list is constructed: 
-f filename    the root HTML filename (required)
-u url	       the WWW URL equivalent of the root HTML document
-xu url	       exclude URL from dependency list
 The resulting dependency list can be output to a file using:  
-lu filename   list dependency URLs to filename
 The intermediate parsed data files are stored in the directory specified by:  
-p filepath    intermediate data filepath
 if this argument is not specified the current working directory is used.  These options are further explained in the chapter Building
        the dependency list. Word elimination optionsThe following options control how the word list is constructed: 
-nt	       exclude <TITLE> tagged words from database
-nh	       exclude <H1..H6><CAPTION> tagged words from database
-nl	       exclude <DT><LI> tagged words from database
-nb	       exclude <BODY> tagged words from database
-xwf filename  word exclusion HTML filename
-xwu url       exclude URL from word list
-xn	       exclude numbers from word list
 The resulting word list can be output to a file using:  
-lw filename   list words to filename
 These options are further explained in the chapter Eliminating
        words. Applet generation optionsThe following option create the applet tag file, and search database: 
-gw filename   generate Web applet files
 The option are explained in the chapter Building the applet
        database.   Using response filesSince the SearchEngine acts on a series of options, these options can be placed for
        commodity, in one or more text files. In addition to reducing keystrokes, these files can
        also contain comments. The following is an extract from the response file used to build
        the database for this manual: 
Response file for the SearchEngine manual
(where on the hard disk)
-f \www\rational\application\search\search\TOC.html
(where on the World Wide Web)
-u http://www.ruptools.com/rup/rational/application/search/search/TOC.html
Dependency exclusions:
(ignore any links to zip files, java files, and the link to my java page)
-xu *.zip
-xu */javapage.html
-xu *.java
Word count exclusions:
(ignore the search page, and table of contents)
-xwu */docsearch.html
-xwu */TOC.html
Standard word exclusion filters:
(ignore all numbers)
-xn
(standard english language exclusion list)
-xwf exclude.english.html
(specific exclusion list for the manual)
-xwf search.exclude.html
 The SearchEngine parses a response file, ignoring all lines which do not begin
        with a hyphen as the first non-white space character. Any valid SearchEngine option can
        appear in a response file, invalid or illegal options produce an error message.  The -r filename option itself can also appear in a response file,
        so that, for example, you can create standard dependency or word file exclusion filters,
        which can be used to generate multiple databases.  Each option and its associated parameters must appear on a single, separate line of the
        response file.   Understanding the outputThe SearchEngine can generate several output files, as well as HTML syntax
        error messages to the standard output device. 
          Command line errorsCommand line errors appear on the standard output. Most errors are due to missing or
            incorrect options, or option parameters.  HTML syntax errorsSyntax errors appear on the standard output. The line and column of the syntax error is
            also provided. This is described further in The HTML parser; syntax
            errors. The dependency listThe dependency list shows all document links, external document links, data links (such
        as images, and applets), and missing links. This is described in detail in the chapter Building the dependency list. The word listThe word list shows all parsed words in the database, after word removal and word
        filtering have been carried out. This is described in detail in the chapter Eliminating words. The applet HTML fileThe applet HTML file is the SearchEngine generated <APPLET> tag,
        which can then be cut and pasted into your HTML document. This is described
        in detail in the chapter Personalizing the applet.  |  |  | 
 
 |