altavi1.htm: How to search the web, by fravia+ altavi1.htm

June 2000

~ Altavista ~
Let's have a look at the STRUCTURE of Altavista.
Searchers should reverse their tools, eh... :-)

(this is still in fieri, of course)

[Basic recall] [Advanced insight] [AV querystrings] [Altavista forms]

Basic recall

!! ALWAYS use the advanced search with boolean operators and result ranking options.
!! Phrase searching - through heavy use of "" - is the sine qua non when cutting through the web.
!! In Altavista (and almost nowhere else) the rule is: case sensitive name searching. Lower case retrieves either lower or upper case. Upper case kills all other case occurrencies. Note that Altavista IS accent and character sensitive.
!! Boolean logic (AND, OR, AND NOT, NEAR) should be used only in advanced search, do not use this in altavista's simple search form (which is almost useless for serious searching), nor in the advanced search 'sort' box. Always remember that the so-called simple search always defaults to OR (you can use it nevertheless for a 'quick and dirty' seek).
!! Subsearching inside sort box
Result ranking trough sortbox... you MUST understand this! You should use the "Results Ranking Criteria" box whenever possible or you'll get a stupid unsorted list. For instance, if you're looking for all "searching and seeking" related pages but have a particular interest in filtering, type filtering in the "Results Ranking Criteria" space in order to bring those useful listings to the top. Incredibly useful addition!
!! Field limiting: title: ~ url: ~ link: ~ host: ~ domain: ~ anchor: ~ text: ~ image: ~ applet:
You will realize how important this "field limiting" approach is only once you will have tried it, therefore just try it: probieren geht Über studieren
!! Truncation: * ~ ** ~ ?
Either you use truncation or you are truncated.

Write it down and remember it: for Altavista's queries the best method is ALWAYS advanced search with boolean logic (and field limiting) and then heavy use of the sort box.

Advanced insight

For any search engine, ergo for Altavista as well, it is critically important to assign location values properly during indexing. Inside Altavista, the assigning of locations is fully automatic in the simplest case... where a function called avs_addword does all the work.
In this case the words of the document are laid out end to end and are numbered sequentially starting with the value returned by other functions (avs_newdoc or avs_startdoc). The same is true for field boundaries and for values (indexed quantities like dates that can be range-searched). The following diagram shows how two very short documents would be stored inside altavista's index database.

	document 1						document 2
word	here	you	have	a	short	page	Thisnotwithstanding thisnotwithstanding	here	you	have	another	short	page
location	1	2	3	4	5	6	7	8	9	10	11	12	13

As the figure illustrates, each word is actually stored as a word-location pair. The index also contains information about the beginning and ending locations of each document. Document1 starts at location 1, and Document2 starts at location 7. In Document2, the first word contains an uppercase letter, so the word is indexed twice: once with case preserved and once in all lowercase. Both versions of the word are at the same location, so that the word would be found appropriately regardless of whether a query is case sensitive or case-insensitive.

The words are added sequentially, every so many documents, or when the last document of a linked bunch has been processed, the actual update to the index is made, using avs_makestable.

The avs_newdoc procedure defines a block of text as a document and establishes an identifier with which the document can be found in the index. The avs_newdoc procedure also defines a filter, which does the bulk of the work of preparing the document to be indexed. It is at the filter stage where any necessary document type conversion takes place. The filter function is called using the following arguments:

IN avshdl_t idx (index handle)
IN void *pFname (information sufficient for the filter to access
                 the document contents)
IN unsigned long startloc (starting location for adding words)
OUT unsigned long *pNumWords (number of words added to the index)

Once the filter is finished processing a block of text, it can pass the text (in the form of a line, a paragraph, or even the entire document), to the avs_addword procedure. The avs_addword procedure parses the text into words and adds those words to the index. It interprets as a word any sequence of letters or digits that is surrounded by spaces or other non-alphanumeric characters. When it adds a word to the index, the avs_addword procedure preserves the case of the word as it appears in the document. If the word contains any uppercase letters, the software also indexes a lowercase version of the word, to support case-insensitive searching.
That were it... re-read the snippet above and you will know more about search engines that many self-proclamed experts do.
Of course there is MUCH more to learn, though: knowledge is a never ending downhill run on your sledge, eh.

In fact many more "menial" tasks are performed, for instance the following ones: Set a date for the document ~ Specify a data string to be returned as a search result ~ Set a date and time for the document. ~ Identify certain words to be indexed as fields. ~ Add a single word exactly as entered to a document index. ~ Index the supplied date at the specified location. ~ Index the supplied value at the specified location. ~ Add a numeric value to a document index that can be used for custom ranking.

Ranking values are very important, when retrieving results. For example, suppose you want a value type called rlines to order search results by the the number of lines per document. You must supply the name (rlines), the lowest and highest possible values. The following code example defines the value type for extended ranking of search results, in this case, the number of lines per document.

error=avs_define_valtype ( "rlines", 0, 10000, NULL, &rlinesvaltype);
if (error != AVS_OK) {
   printf ("avs_define_valtype returned %s\n", avs_errmsg(error));
   return 1;
                     }

Dates
When indexing documents, a date can be set for each one through the avs_setdocdate or the avs_setdocdatetime procedures. Once the dates are in the index, it is possible to use dates or date ranges to limit searches. The date is returned in the search results.
Altavista is capable of storing dates from 01/01/0100 through 12/31/2148.
Searchers can limit a query with a date range added as an extra Boolean term. The format of the date range is [dd/mm/yyyy-dd/mm/yyyy]. If a searcher omits the beginning date, the query will return everything in the index with a date before the end date. If a searcher omits the end date, your query result will contain all documents with dates after the beginning date. If a searcher wants only the documents indexed on one date, he should use the same beginning and ending dates. The end dates are part of the range.

There are various types of possible searches:

simple
advanced - using Boolean terms
advanced - using the ranking mechanism
advanced - using a combination of Boolean and ranking

The search engine ranks the results of a search based on a weight value assigned to each word in the query, and a resulting overall relevance rating of each document that meets the search criteria.
A document earns a relevance rating based on the number of words in the search query that it contains, and the weight value of each of those words. The document containing the most words with the highest weight value is considered most relevant. The closer the relevance rating is to a value of 1, the more likely it is that a document meets the search criteria.
The weight of a word is determined by the number of occurrences of that word in the entire index. A word that occurs less frequently in the index earns a higher weight, based on the assumption that it is more precise and specific than a word that occurs frequently. For example, the word "searching" might occur many times in an index, whereas the word "combing" would probably occur less frequently. "combing" would be given a higher weight than "searching" in a search query containing both words, because a document containing only the word "combing" would be more likely to match the searcher's interest than a document containing only the word "searching." A document containing both "combing" and "searching" would earn the highest relevancy ranking.

The position of the word in the document, and the frequency of occurrence of the word in a single document, have some bearing on the ranking of a document. The most significant factor in determining ranking is the combined weight of words in the search query. Also, the search engine considers only words without an operator preceding them when it does ranking. If operators precede all words in the search query, the results are returned in no particular order.

Basic searches
As you know, to perform a basic search, a seeker uses the operators plus (+) and minus (-) to indicate words or phrases that are required or prohibited in the search results. For example, the following query expression requests documents that must contain the word hints and can also contain the phrase how to search:
"how to search" +hints

Boolean
Boolean Query Syntax For Boolean searches, use the logic operators AND, OR, NOT, NEAR, and WITHIN. For example, the following query requests that either of the words find or target appear in the same document with either of the words search or seek.
(find OR target) AND (search OR seek)
The following query requests that both the words search and seek appear in a document's title: field.
title:(search AND seek)

Rules for Query Processing
Both the ranking and Boolean search procedures follow the same basic rules for processing queries:

Like the indexer, the search engine interprets a word as any string of letters and digits that is delineated by non-alphanumeric characters. Consequently, AltaVista Search ignores punctuation except to interpret it as a separator for words.
A group of two or more words enclosed in double quotes indicates a phrase. Phrasing ensures that the search engine finds the words together, instead of looking for separate instances of each word individually.
An asterisk (*), double asterisk (**), or question mark (?) following three or more characters indicates a wildcard; the search engine will find all words that match the specified pattern.
Case sensitivity of a search is based on the case of each word in the query. A word in all lowercase letters results in a case-insensitive search, whereas if a word contains any uppercase letters, the software searches for an exact-case match.
Fetching procedures assign to each matching document a score based on how well that document matches the set of ranking terms provided in the search call. If no ranking terms were provided, the results are presented in the same order as they were added to the index.
If the string you are looking for contains special characters (for example, the forward slash (/)), you can use curly braces ({}) in the query string as in the following example: {numega/sice}. All characters between the matching curly braces are treated as part of a word except the asterisk (*) which still works as a wildcard.
Asterisk (*): After 3 specified characters will search for matches in up to 5 trailing letters.
Question Mark (?): After 3 specified characters will match exactly one more character.
Double Asterisk (**) More flexible as it will search for matches for an unlimited number of trailing characters.
You also have the ability use the wildcards interchangeably and more than once in the same search string, for example:
sear*gl?r*
This could possibly find the word searchinglores :-)
You can also determine whether to limit to 50 the number of words found by the wildcard character search or allow all instances of the word stem in the index you are searching. In the avs_parameter block of your bots, set the unlimited_wild_words flag to 1 to avoid the 50 word limit.
Note that the "unlimited wild word searching" has limits: The normal behavior of the wild card search expansion is that each wild-carded term will match a maximum of 50 words. If there are more than 50 words that match, the 50 most frequent words in the index will be used. THEREFORE BEYOND 50 POSSIBLE WILDCARD-COMBINATIONS YOU WONT GET ALL POSSIBLE WILDCARD MATCHES

Altavista provides support for Boolean searches, including AND, OR, NOT and NEAR (proximity) searches. This -as you know- allows for phrase searching and proximity searching to be performed on indexed documents.

Note that you can use the WITHIN ## (where ## is the number of words) command to control the number of words apart the words in your query string can be. For example, if you want to find the word Mary within 5 words of lamb, use the Boolean query string:
"fravia WITHIN 5 searchlores" This query will bring a result for fravia and searchlores when they are not more than 5 words apart instead of the default of 10 words apart.
Thus using NEAR in your search is the same as using WITHIN 10.

How the Public AltaVista Search Site Sets the Virtual Memory Attributes
The AltaVista Search site on the web has the following setting for its virtual memory attributes:
vm-mapentries = 1000
vm-maxvas = 1337438953472
ubc-maxpercent = 70

The following are settings for processes:
max-per-proc-address-space = 137438953472
max-per-proc-data-size = 17179869184
per-proc-address-space = 137438953472
per-proc-data-size = 17179869184
max-proc-per-user = 256
max-threads-per-user = 2048

Typically these machines are larger than average: 8-processor, 6-8 GB.

Note that there are limits to the "Ranking word maximum frequency": the ignore_thresh parameter is expressed in one hundredths of a percent, for example, 1000 = 10% Any word that occurs in the index more frequently than this percentage is not counted for ranking purposes (but the word is still counted for Boolean ranking purposes).
This is intended to be a performance optimization: if this value would be set as smaller than the default (1000), ranked searches would run faster but the ranking would be less precise. If the value would be set higher than the default, the ranked search would be slower, but the ranking would be more precise. The range for this parameter is 1- 1000.

AV querystrings

shoot an altavista query and then look at the resulting URL you asked for

pg=aq page is query (Altavista main search page)
what=web not Usenet, duh
text=yes no graphic frills, thanks
stype=stext search type = s-text
kl=en language is english
sc=on site compression is on
q=AltaVista query is AltaVista
stq=20 start query at 20 (see results #21 to 29)

This is only an appetizer :-) Serious seekers may enjoy having a look at a special classroom:
[c_fourth.htm]: Spelunking altavista's acronyms by Humphrey P., Gregor Samsa & Iefaf, June 2000, part of the [classroo.htm] section: A fundamental 'search engines reversing' classroom.

Altavista forms

ALTAVISTA ADVANCED SEARCH
Very quick! Text-only version, of course!

Boolean query:

            Sort by:

        Language:          Show one result per Web site

                From: To: (e.g. 31/12/99)

Simple search - Graphic Version
ALTAVISTA SIMPLE SEARCH
Very quick! Text-only version, of course!
Simple search: no boolean! defaults to OR, use advanced instead!

Ask AltaVista a question.  Or enter a few words in

search refine

Search - Advanced - Usenet

Still quite in fieri, I'm afraid...

(c) 2000: [fravia+], all rights reserved