Eliminating words
|
-
This chapter discusses the filters available for eliminating words from entire files,
useless words such as "and" or "the", reducing words such as
"www.javasoft.com", and removing words within specific HTML tags. Before
looking at the various methods of eliminating words, it is necessary to describe what the
compiler considers a 'word' to be. The word parser, incorporated into the compiler, parses
words according to two separate algorithms.
Numbers
- Any numeric value (0 to 9 or a valid ISO-Latin1 numeric value) followed by other numeric
values, or "." or "," is considered to be a
number. Trailing "." or "," characters are
ignored.
Words
- Any letter, followed by letters, numeric values, ".", "-",
or "_" is considered to be a word. Trailing ".",
"-", or "_" characters are ignored.
If you wish that a hyphenated word be split into its components, use the ­
(­) ampersand entity, also known as a soft hyphen, instead of the
hyphen character '-', such as profitmargin.
Values such as "1.0" or "1,000" or even
dewey decimal values such as "1.2.3" would all be considered to be
numbers. Note however, that "1..6" would also be considered to be a
number.
The compiler provides the -xn option, which removes all numbers from the
word list.
Values such as "wasn't" would be considered to be two separate
words; "wasn" and "t". The apostrophe is not
tested by the word parser, as it would then have been required to understand single quoted
phrases. Since there are no syntactical rules in HTML for #PCDATA
(the text within tags), it would be impossible to tell when an apostrophe marks the start
or end of a single quoted phrase, and when it is, well, just an apostrophe. Some people
also prefer to use the "`" character to start a single quoted
phrase.
Removing documents from the word list
- A table of contents (TOC) document is an ideal candidate for word removal. Although
needed to generate the dependency list, it would be unproductive for the TOC document
contents to appear in the word database, since the descriptors (words) in that document
invariably link the user to other pages.
In this case, all words within a document can be removed from the word list in the same
way as documents are removed from the dependency list, described below.
Removing a specific document from the word list
- To remove all words in a specific document from the word list, use the -xwu
option, and specify the document's URL path and filename components, for
example:
-xwu /www/rational/application/search/doc/TOC.html
Removing multiple documents from the word list
- To remove all words in multiple documents from the word list, use the -xwu
option, and a filter using the wildcard character '*'. For example:
-xwu */TOC.html
In this example, all words in all URLs ending with /TOC.html
will be excluded from the word list.
Another more dangerous example of filtering is:
-xwu /www/extawt/*
In the above example, all words in all URLs beginning with /www/extawt/
will be excluded from the word list.
Finally an even more dangerous example of filtering is:
-xwu */extawt/*
In this example, all words in all URLs containing /extawt/
will be excluded from the word list.
No other combinations of the wildcard character '*' are valid. A filter
definition of */extawt/*remove.* will result in a (probably useless) filter
to remove all words from URLs containing /extawt/*remove., and not
the probable intention of removing all words in all URLs containing /extawt/
and also remove.
The wildcard character '*' can appear at the start of the URL,
and/or at the end of the URL, anywhere else it is treated as an ordinary
character.
Generating a word list
- Before individual words can be removed, you have to know what words appear in the search
database. The compiler provides the -lw filename option, which lists
all filtered words in HTML document format to the specified filename.
The
following is an excerpt from the generated word list:
<dl>
<dt>absolute
<dt>accept
<dt>acceptable
<dt>access
<dt>according
<dt>accumulates
<dt>achieve
<dt>achieved
<dt>acronyms
<dt>add
<dt>added
<dt>addition
<dt>address
...
</dl>
Creating word filter documents
- Common usage words, or useless words, can be removed from the database using word lists,
which are stored in an HTML document, known as a word filter document. The
same format is used as the parsed documents of the dependency list, so that HTML
entity characters (&) can be used to represent ISO-Latin1 characters
in ASCII files. The current list of valid ampersand entities is given in the appendix Ampersand entities.
Since the word filter document (see
below) and generated word list file are both in HTML format, you can use your
favorite text editor to cut and paste words to be removed from the word list to the word
filter document.
Eliminating a word
- A specific word can be eliminated by simply having the word appear in a word filter
document. This is a file in HTML format, which lists the specific words or
word filters to be used when removing words. It is a good idea to list them one per line,
for readability, and ease of editing. The following is an excerpt from the exclude.english.html
file:
<dl>
<dt>a
<dt>able
<dt>about
<dt>above
<dt>accomplish
<dt>accomplished
<dt>accomplishes
<dt>across
<dt>act
<dt>acts
<dt>actual
...
</dl>
Word filter documents are specified using the -xwf option, for example:
-xwf exclude.english.html
Reducing words
- The compiler also provides for simple though potentially dangerous word reduction
filters, which trim or reduce words. Generally, word reduction filters should be avoided,
since they can have unexpected side-effects, similar to the filters used for eliminating URLs
from the dependency list or word list.
In addition, word reduction filters slow down
the speed of compilation, since each word parsed (there may be several thousand of them)
has to be checked against each filter, until a filter is matched, or all the filters have
been checked.
Word reduction filters have the same form as URL filters, only that,
instead of being declared on the command line, they are placed in a word filter document.
If a word matches a filter, that word is not eliminated, but reduced and put back into the
word list.
For example, after a first compilation, the word list might produce words (taken from
the text of links), such as:
ftp.javasoft.com
...
splash.javasoft.com
...
www.javasoft.com
In this case, say you are interested in keeping the javasoft part as a
word in the database, and discarding the rest. You can achieve this by creating the word
reduction filter (in your word filter document) as follows:
<dl>
<dt>*javasoft*
...
</dl>
You might think that such filters can be used for reducing plurals, or reducing
adjectives, but this is not the case. If you create word reduction filters such as:
<dl>
<dt>*s
<dt>*ing
...
</dl>
they will reduce for example cards to card and playing
to play, but will also reduce miss to mis and king
to k. Caveat emptor.
Removing words in specific HTML tags
- The compiler can remove words found in specific tags. There are four such tag groups:
- -nt
- exclude <TITLE> tagged words.
- -nh
- exclude <H1..H6> and <CAPTION> tagged words.
- -nl
- exclude <DT> and <LI> tagged words.
- -nb
- exclude words not inside the above listed tags.
The order of filtering
- The compiler takes the parsed word list, and filters them for the final word list in the
following order:
- All words are converted to lower case.
- If any of -nb, -nh, -nl, or -nt
flags are set, all words corresponding to those HTML tags are removed from the list.
- If the -xn flag is set, all numbers are removed from the list.
- The resulting word list is tested against word reduction filters, matches are removed,
reduced and put back into the list.
- The resulting word list is tested against the exclusion word lists, and matching words
are removed.
This ordering allows for words which were reduced to then be removed.
|
| |

|