June 2000
~ Altavista ~
Let's have a look at the STRUCTURE of Altavista.
Searchers should reverse their tools, eh... :-)
(this is still in fieri, of course)
[Basic recall]
[Advanced insight]
[AV querystrings]
[Altavista forms]
- !! ALWAYS use the advanced search with boolean operators and result ranking options.
- !! Phrase searching - through heavy use of "" - is the sine qua non when cutting through the web.
- !! In Altavista (and almost nowhere else) the
rule is: case sensitive name searching. Lower case retrieves either lower
or upper case. Upper case kills all other case occurrencies. Note that Altavista IS
accent and character sensitive.
- !! Boolean logic (AND, OR, AND NOT, NEAR) should be
used only in advanced search, do not use this in altavista's
simple search form (which is almost useless for serious searching), nor in
the advanced search 'sort' box. Always remember that the so-called simple search always
defaults to OR (you can use it nevertheless for a 'quick and dirty' seek).
- !! Subsearching inside sort box
Result
ranking trough sortbox... you MUST understand this! You should use
the "Results Ranking Criteria" box whenever possible or you'll get a stupid
unsorted list. For instance, if you're looking for all "searching and seeking" related
pages but have a particular interest in filtering, type filtering
in the "Results
Ranking Criteria" space in order to bring those useful listings to the top. Incredibly useful
addition!
- !! Field limiting: title: ~ url: ~ link: ~ host: ~ domain: ~ anchor: ~ text: ~ image: ~
applet:
You will realize how
important this "field limiting" approach is only once you will have tried it, therefore
just try it: probieren
geht Über studieren
- !! Truncation: * ~
** ~ ?
Either you use truncation or you are truncated.
Write it down and remember it: for Altavista's queries the best method is ALWAYS advanced search with boolean logic (and field limiting) and then
heavy use of the sort box.
For any search engine, ergo for Altavista as well, it is critically important to assign location values
properly during indexing. Inside Altavista, the assigning of
locations is fully automatic in the simplest case... where a function called
avs_addword does all the work.
In this
case the words of the document are laid out end to end and are numbered sequentially starting
with the value returned by other functions (avs_newdoc or avs_startdoc).
The same is true for field boundaries
and for values (indexed quantities like dates that can be range-searched).
The following diagram shows how two very short documents would be stored inside altavista's
index
database.
| document 1 | document 2 |
word | here | you | have | a | short | page | Thisnotwithstanding thisnotwithstanding | here | you | have | another | short | page |
location | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
As the figure illustrates, each word is actually stored as a word-location pair. The index also
contains information about the beginning and ending locations of each document. Document1
starts at location 1, and Document2 starts at location 7.
In Document2, the first word contains an uppercase letter, so the word is indexed twice: once
with case preserved and once in all lowercase. Both versions of the word are at the same
location, so that the word would be found appropriately regardless of whether a query is case
sensitive or case-insensitive.
The words are added sequentially, every so many documents,
or when the last
document of a linked bunch has been processed, the actual update to the index is made, using
avs_makestable.
The avs_newdoc procedure defines a block of
text as a document and establishes an identifier
with which the document can be found in the index. The avs_newdoc procedure also defines a
filter, which does the bulk of the work of preparing the document to be indexed. It is at the filter
stage where any necessary document type conversion takes place.
The
filter function is called using the following arguments:
IN avshdl_t idx (index handle)
IN void *pFname (information sufficient for the filter to access
the document contents)
IN unsigned long startloc (starting location for adding words)
OUT unsigned long *pNumWords (number of words added to the index)
Once the filter is finished processing a block of text, it can pass the text (in
the form of a line, a
paragraph, or even the entire document), to the avs_addword procedure. The avs_addword
procedure parses the text into words and adds those words to the index. It interprets as a word
any sequence of letters or digits that is surrounded by spaces or other non-alphanumeric
characters. When it adds a word to the index, the avs_addword procedure preserves the case of
the word as it appears in the document. If the word contains any uppercase letters, the software
also indexes a lowercase version of the word, to support case-insensitive searching.
That were it... re-read the snippet above and you will know more about search engines that
many self-proclamed experts do.
Of course there is MUCH more to learn, though: knowledge is a
never ending downhill run on your sledge, eh.
In fact many more "menial" tasks are performed, for instance the following ones:
Set a date for the document
~ Specify a data string to be returned as a search result
~ Set a date and time for the document.
~ Identify certain words to be indexed as fields.
~ Add a single word exactly as entered to a document index.
~ Index the supplied date at the specified location.
~ Index the supplied value at the specified location.
~ Add a numeric value to a document index that can be used for custom ranking.
Ranking values are very important, when retrieving results. For example, suppose
you want a value type called rlines
to order search results by the the
number of lines per document. You must supply the name (rlines), the lowest and
highest possible values. The following code example defines the value type for
extended ranking of search results, in this case, the number of lines per document.
error=avs_define_valtype ( "rlines", 0, 10000, NULL, &rlinesvaltype);
if (error != AVS_OK) {
printf ("avs_define_valtype returned %s\n", avs_errmsg(error));
return 1;
}
Dates
When indexing documents, a date can be set for each one through the avs_setdocdate
or the avs_setdocdatetime procedures. Once the dates are in the index, it is possible to
use dates or
date ranges to limit searches. The date is returned in the search results.
Altavista is capable of storing dates from 01/01/0100 through 12/31/2148.
Searchers can limit a query with a date range added as an extra Boolean term.
The format of the
date range is [dd/mm/yyyy-dd/mm/yyyy]. If a searcher omits the beginning date, the query will
return everything in the index with a date before the end date. If a searcher omits the end date, your
query result will contain all documents with dates after the beginning date. If a searcher
wants only the
documents indexed on one date, he should use the same beginning and ending dates.
The end dates are
part of the range.
There are various types of possible searches:- simple
- advanced - using Boolean terms
- advanced - using the ranking mechanism
- advanced - using a combination of Boolean and ranking
The search engine ranks the results of a search based on a weight value assigned to each word
in the query, and a resulting overall relevance rating of each document that meets the search
criteria.
A document earns a relevance rating based on the number of words in the search query that it
contains, and the weight value of each of those words. The document containing the most
words with the highest weight value is considered most relevant. The closer the relevance
rating is to a value of 1, the more likely it is that a document meets the search criteria.
The weight of a word is determined by the number of occurrences of that word in the entire
index. A word that occurs less frequently in the index earns a higher weight, based on the
assumption that it is more precise and specific than a word that occurs frequently.
For example, the word "searching" might occur many times in an index, whereas the word
"combing" would probably occur less frequently. "combing" would be given a higher weight
than "searching" in a search query containing both words, because a document containing
only the word "combing" would be more likely to match the searcher's interest than a document
containing only the word "searching." A document containing both "combing" and
"searching" would earn the highest relevancy ranking.
The position of the word in the document, and the frequency of occurrence of the word
in a single document, have some bearing on the ranking of a document. The most significant
factor in determining ranking is the combined weight of words in the search query. Also, the
search engine considers only words without an operator preceding them when it does ranking.
If operators precede all words in the search query, the results are returned in no particular order.
Basic searches
As you know, to perform a basic search, a seeker uses the operators plus (+) and minus (-)
to indicate words or phrases
that are required or prohibited in the search results. For example, the following query
expression requests documents that must
contain the word hints and can also contain the
phrase how to search:
"how to search" +hints
Boolean
Boolean Query Syntax
For Boolean searches, use the logic operators AND, OR, NOT, NEAR, and WITHIN. For
example, the following query requests that either of the
words find or target appear in the same
document with either of the words search or seek.
(find OR target) AND (search OR seek)
The following query requests that both the words search and seek appear in a
document's title: field.
title:(search AND seek)
Rules for Query Processing
Both the ranking and Boolean search procedures follow the same basic rules for processing
queries:-
Like the indexer, the search engine interprets a word as any string of letters and digits
that is delineated by non-alphanumeric characters. Consequently, AltaVista Search
ignores punctuation except to interpret it as a separator for words.
- A group of two or more words enclosed in double quotes indicates a phrase. Phrasing
ensures that the search engine finds the words together, instead of looking for separate
instances of each word individually.
- An asterisk (*), double asterisk (**), or question mark (?) following three or more
characters indicates a wildcard; the search engine will find all words that match the
specified pattern.
- Case sensitivity of a search is based on the case of each word in the query. A word in
all lowercase letters results in a case-insensitive search, whereas if a word contains
any uppercase letters, the software searches for an exact-case match.
- Fetching procedures assign to each matching document a
score based on how well that document matches the set of ranking terms provided in
the search call. If no ranking terms were provided, the results are presented in the
same order as they were added to the index.
-
If the string you are looking for contains
special characters (for example, the forward slash (/)), you can use curly braces ({}) in the
query string as in the following example: {numega/sice}. All characters between the matching curly
braces are treated as part of a word except the asterisk (*) which still works as a wildcard.
- Asterisk (*): After 3 specified characters will search
for matches in up to 5 trailing letters.
- Question Mark (?): After 3 specified characters will match
exactly one more character.
- Double Asterisk (**)
More flexible as it will search for
matches for an unlimited number of
trailing characters.
- You
also have the ability use the wildcards interchangeably and more than once in the same search
string, for example:
sear*gl?r*
This could possibly find the word searchinglores :-)
You can also determine whether to limit to 50 the number of words found by the wildcard
character search or allow all instances of the word stem in the index you are searching. In the
avs_parameter block of your bots, set the unlimited_wild_words flag to 1 to avoid the
50 word limit.
- Note that the "unlimited wild word searching" has limits:
The normal behavior of the wild card search expansion is that each wild-carded term
will match a maximum of 50 words. If there are more than 50 words that match, the 50
most frequent words in the index will be used. THEREFORE BEYOND 50 POSSIBLE WILDCARD-COMBINATIONS
YOU WONT GET ALL POSSIBLE WILDCARD MATCHES
Altavista provides support for Boolean searches, including AND, OR,
NOT and NEAR (proximity) searches. This -as you know- allows for phrase searching and proximity
searching to be performed on indexed documents.
Note that you can use the WITHIN ## (where ## is the number of words)
command to control the number of words apart the words in your query string can be. For
example, if you want to find the word Mary within 5 words of lamb, use the Boolean query
string:
"fravia WITHIN 5 searchlores"
This query will bring a result for fravia and searchlores when they are not more than
5 words apart
instead of the default of 10 words apart.
Thus using NEAR in your search is the same as using
WITHIN 10.
How the Public AltaVista Search Site Sets the Virtual Memory Attributes
The AltaVista Search site on the web has the following setting for its
virtual
memory attributes:
vm-mapentries = 1000
vm-maxvas = 1337438953472
ubc-maxpercent = 70
The following are settings for processes:
max-per-proc-address-space = 137438953472
max-per-proc-data-size = 17179869184
per-proc-address-space = 137438953472
per-proc-data-size = 17179869184
max-proc-per-user = 256
max-threads-per-user = 2048
Typically these machines are larger than average: 8-processor, 6-8 GB.
Note that there are limits to the "Ranking word maximum frequency":
the ignore_thresh parameter is expressed in one hundredths of a percent, for example,
1000 = 10% Any word that occurs in the index more frequently than this percentage
is not counted for ranking purposes (but the word is still counted for Boolean ranking purposes).
This is intended to be a performance optimization: if this value would be set as
smaller than the default
(1000), ranked searches would run faster but the ranking would be less precise.
If the value would be set
higher than the default, the ranked search would be slower, but the ranking would be more precise.
The range for this parameter is 1- 1000.
shoot an altavista query and then look at the resulting URL you asked for
-
pg=aq page is query (Altavista main search page)
-
what=web not Usenet, duh
-
text=yes no graphic frills, thanks
-
stype=stext search type = s-text
-
kl=en language is english
-
sc=on site compression is on
-
q=AltaVista query is AltaVista
-
stq=20 start query at 20 (see results #21 to 29)
This is only an appetizer :-) Serious seekers may enjoy having a look at a special classroom:
[c_fourth.htm]:
Spelunking altavista's acronyms by Humphrey P., Gregor Samsa & Iefaf, June 2000, part of the
[classroo.htm]
section: A fundamental 'search engines reversing' classroom.
ALTAVISTA ADVANCED SEARCH
Very quick! Text-only version, of course!
Simple search -
Graphic Version
ALTAVISTA SIMPLE SEARCH
Very quick! Text-only version, of course!
Simple search: no boolean! defaults to OR, use advanced instead!
Still quite in fieri, I'm afraid...
(c) 2000: [fravia+], all rights
reserved