Back to where
[you came from]
[fravia's tips]
[evaluation lore]
[main portal]

This is part one of the 're-ranking' trilogy.
two:[synecdoc.htm]: The synecdochical searching method
three:[epanalep.htm]: The epanaleptical approach

And a discussion about search engines' depth
by fravia+
first published @ searchlores in December 2001
Updated in January 2002
[General]   [Details]   [textual]   [graphical]   [examples]
[Yo-yo Index]   [Yo-yo wand (searchbot)]

Tackling the 'down yonder' problem (general)

What I like to call the 'down yonder' problem is well known to searchers. Being easily spammed, the main search engines suffer a terrible drawback: in fact, thanks to their huge databases, some interesting results may indeed be listed somewhere along their huge, never-ending lists, yet -alas!- huge amounts of commercial and/or bogus 'spamcram' sites almost always pop up in the first positions... while the juicy targets you are looking for lie "buried" somewhere inside those huge lists of crammy results, somewhere 'down yonder'.

To tackle this problem I exploit an (IMO interesting) approach, that can be used for most  main search engines. After all they are  always more or less subject to spam, the most infamous one being Altavista... "Hic alta, hic salta" is a well-known proverb among seekers, meaning that you should jump straight towards a 'lower' (deeper) page of results when searching with Altavista, since pages that are listed in the first postions are - mostly without exceptions - irrelevant, or even abominable paid scum.
This -fairly old- observation is the base for the following seeking approach, valid for MANY among the 'main' search engines... by all means not only for the easily spammed Alta.

Since terminology is always crucially important ("Verba movent, exempla trahunt"), I decided to name this small snippet of 'evaluation lore' the "Yo-yo" approach, or the "Yo-yo technique".

Tackling the 'down yonder' problem (details)

Do you at times search the web using the [main] search engines? Altavista, Google, Fast, Whatever?
Yes of course: even accomplished seekers do use them. No one relies only on his own [bots] or [scrolls].
Yet you should never forget that the bastards that own these services are NOT providing graciously some free new searching algos to the web-population for the glory of knowledge and for the sake of the web of old... They are just trying to scrap as much money as possible out of your USE of such engines, taking advantage of lusers' ("user+loser") lack of knowledge and using all the dirtiest tricks of the trade: paid positions ranked first, snooping private data (gathering & grepping your data without telling you), "false" links weighting, noise increase/decrease in order to favour specific sites, active censorship against free 'open source' & sound knowledge sites, priority to .com crap versus more useful .edu and .org sites... And so on, sliding happily downhill on the commercial awful slope.

They are ALWAYS actively stealing your private data, or at the very least logging, gathering and grepping WHAT you did search for... in order to make some money out of it. That's the ONLY reason search engines exist and someone is paying money for their bandwith, duh.

But - hear and behold! - search engines MUST nevertheless provide some useful snippets of information, else no one in his right mind would use them. Let's take advantage of this simple truth using the "yo-yo" approach in order to "re-rank" and ameliorate their biased ranking results...

(Tackling the 'down yonder' problem)
text explained

1.   Look CRITICALLY at the first page of results, every time you use a search engine.
2.   If the results look spammy, leave and jump to the middle of the whole list (50% of the total)
3.   Look at the page of results you got there.
4a.   If they DO NOT look spammy, begin using them and then move (investigate next results) UPWARDS (go to the previous page: 50% of the whole list minus one page) and continue UPWARDS... as long as you keep finding useful the pages you are examining, that's it.
4b.   If the results 'at the middle' look spammy, leave this area as well and jump to the middle of the FOLLOWING part of the list (75% of the total... Smack in the middle of the "second half" of the list of ranked results you are allowed to peruse)
5.   Again, now, if the results situated at 3/4 (75%) of the ranked list DO NOT look spammy, use them and then move (investigate next results) UPWARDS (go to the previous page: 75% of the whole list minus one page) and continue UPWARDS... as long as you keep finding useful the pages you are examining, that's it.
6.   If the results still look spammy, leave the search engine altogether: good riddance.


See the above   [textual] snippet for explanations...

red     A three-step approach in order to survive "ranking briberies" and PPC shames:

1.  Ok first results?
2.   No? Go down to the middle
3.   No? Descend to 3/4 of the list
Still Nothing? Leave, it's a crap engine anyway.


[Google]   [Altavista]   [Lycos]   [Fast]   [Wisenut]  
[Northernlight]   [Hotbot]   [Teoma]   [Excite]   [Yahoo]

Unfortunately the above is a tag 'theoretical'... In the reality the main search engines DO NOT ALLOW YOU TO SEARCH THE WHOLE 'ALLEGED' REPORTED RESULTS. Hence the importance of refining queries to the range of the 'allowed' searching margin.
In other words: if your query claims to give you 5000 results, this DOES NOT mean that you can really examine them...

A bottomless chasm? An immeasurable depth? :-)

Let's search in GOOGLE for "advanced searching"
26100 (alleged) results, the first bunch at the very top is not that bad after all, lotta paid ads, but some .edu (always a signal of good relevance) and some 'classics'. Nevertheless, let's apply our yo-yo approach to these results...
As the first 10 results correspond to "start=0"
one would suppose to be able to get "at the middle" of the 26100 alleged results with something like "start=1305" in the string...
Noway! The maximum depth you are allowed to inspect 'down yonder' with Google is -once you repeat the search with the omitted results included- 999 results!
Therefore Google's "allowed area" for this query corresponds to 999 out of 26100 results (3,82% out of the apparent total of results)

I will call this 3,82 Google's yo-yo index

So let's go straight to start=490 (more or less in the middle of the 'allowed' area):
Not bad... is it?
Let's imagine you are still not satisfied (omne trinum est perfectum)...
Then let's go deeper to start=660 (more or less in the 3/4 of the 'allowed' area):
Admit it: juicy results... Even for such a simple - and after all not heavily spammed - query... The yo-yo approach seems to work!

As a sidenote, in order to calculate the MAXIMUM amount of pages Google is supposed to cover, you could use the following trick: search?&q=a, that will return all the pages where the vowel 'a' is present (~a milliard and a third)

Google was easy! They offer mostly relevant results anyway. Let's repeat the experiment with some other engines, beginning with the heavily spammed ALTAVISTA:
&nbq=10&pg=aq&search=Search&stq=0: lotta spam.

Unfortunately altavista offers only 20 pages of results (400 out of supposedly 31834 pages. Thus the "allowed area" for this query corresponds to just 1,27% of the total results, that is how shallow 'down yonder' you can get in Alta). Hence Altavista's yo-yo index = 1,27 (worse than Google).
You cannot 'trick' altavista into deeper seeking fathoms:
&nbq=10&pg=aq&search=Search&stq=3200 will give you THE SAME SPAM that has been ranked in the first ten positions

Are all search engines limiting real browsing to their alleged results like Google and Altavista?
Not at all.
If you search LYCOS, for instance:
&first=1&lpv=1&query=%22advanced+searching%22&t=all, you'll get around 18240 ~ 23940 sites (yes, it variates THAT much at any given moment, could depend from servers' overload or from the moon phase :-) See the ad hoc explanation of search engines' tides in part three of this trilogy.
If you use the query-string
you'll get results 1102-1111... and if you use the query
you'll get results 2122-2131. And so on.

So we would have in this case an OPTIMUM yo-yo index of 100%
Do not forget however that of course the RELEVANCE of the results we gather is much more important than any 'depth' index parameter we may dream of :-)

Let's give a spin at FAST (ALLTHEWEB), one of the best search engines around.
Yumm: 23419 results, first page is at &q=%22advanced+searching%22&c=web&o=0.
Let's try origin 13210 (more or less in the middle):
No way! You are limited to 4010 results with a yo-yo index of 17,08, hence the middle depth will be around results 2000-2010 and, trusting the yo-yo technique, we may also have a look at the "3/4" depth, around results 3000-3010.
Alas! Fast is a good engine, but quite heavily 'commercial spammed' if you don't 'trim' your searches. You will obtain MUCH more relevant results using its advanced search functions with a more complex searchquery like THIS ONE, that would exclude .com sites, ask for the preferred presence of the words "techniques", "tips", "hints" and "searching strategies" in the text and for the preferred presence of the word "searching" in the title of the retrieved pages. Note that in this way we would retrieve 8853 results and that, having chosen to display 100 results per page, we would be able to come to a maximum hyperbaric Alltheweb depth of 4100 results, with an high yo-yo index of 46,31. Moreover with such a 'seeker correct' query Alltheweb's results at the middle (allowed) depth (i.e. 2000-2100) and at 3/4 depth (i.e. 3000-3100) would be MUCH MORE relevant for you.
Thus, never forget that the yo-yo technique alone is no guarantee of good results, you must still know how to cut your web-mustard when seeking :-)

And what about WISENUT? This relatively recent search engine is trying to attack Google and Fast positions as the best main search engines.
For "advanced searching" wisenut fetches 41776 pages! First page is q=%22advanced+searching%22&p=0, so let's try q=%22advanced+searching%22&p=20800 (should be more or less in the middle)
No way! Once more we are limited (in this case to 300 results: 30 pages), so to swing to our "middle depth" we'll have to calibrate our yo-yo at 15.
The discrepance between alleged results and reality is in Wisenut even more striking, with the consequence that our yo-yo index will be set at 0,72 for Wisenut.

On the other hand NORTHERNLIGHT will give us 19344 items, but their inside-URL counting method is weird (to say the least), for instance modifying &nth=1 into &nth=155+129+109+90+79+68+49+39+26+11, we get results 101-110 (that is: the pages from position 101 to position 110) but this sequence is valid for THIS SPECIFIC QUERY only.
This makes it extremely awkward to navigate Northernlight towards the bottom: for instance
&nth=371+359+343+321+278+229+214+202+180+167+155+129+109+90+79+68+49+39+26+11 corresponds to results 201-210, whereas
&nth=475+465+455+445+435+425+415+405+395+382+371+... will correspond to results 301-310, and since
&nth=575+565+555+545+535+525+515+505+495+485+475+... corresponds to 401-410, we realize that the numeration 'has stabilized' from &nth=395 downwards, probably due to the end - after that depth - of any results that have been automatically put inside separate folders (the 'speciality' of Northernlight).
This URL-numeration makes it impossible for us to search Northernlight depths effectively.
For instance results 101-110 of a search for the term searchlores will have a "specific" &nth=461+451+441+431+421+411+401+391+367+93 which differs - probably due to the different folders configuration - from the &nth sequence of our previous query for "advanced searching", where results 101-110 in the same depth positions corresponded to &nth=155+129+109+90+79+68+49+39+26+11.

Of course once the count-strings (the lists of "&nth=" numbers above) 'stabilize' in your Northernlight searches, you can add depth per hand.
Searching for the term "searchlores", for instance, given that the counter stabilized at &nth=391, you could input
&nth=501+491+481+471+461+451+441+431+421+411+401+391+367+93 in order to jump to results 141-150. Add 511+ to the previous string and you'll get resuls 151-160, and so on... page after page, towards Northernlight's mysterious depths.

But this seems rather awkward and I am not sure we will be able to implement this approach in our bots (see the [yo-yo wand] for an example of automated yo-yo approach).
Moreover it would not wonder me if Northernlight's depth-searches linked above would NOT work out correctly in a couple of months time, given the fact that any 'new folder' (that may appear through Northernlight's algos every time their databases will be updated and reindexed) may screw the first part of the counting string (in which case you will get per default just the 'surface' results 1-10).

As a sidenote, in order to calculate the MAXIMUM amount of pages that Northernlight is supposed to cover, you could use the following trick: search OR NOT search, that will return all the pages with or without the term 'search (~400 millions)

The good Inktomi-based HOTBOT counts 19200 matches, if you use &first=201 &recordcount=100 in the string you'll fetch records 201-300, and if you use &first=901 &recordcount=100 you'll get results 901-1000, and since &first=1001 does not work, 1000 would "seem" to be the maximum depth allowed. In fact 1000 is not the limit: you can reach a MAXIMUM DEPTH of 1397 scores using &first=999 &recordcount=399: try jumping into this hyperbaric hotbot search.
It would surely be interesting to know why hotbot did put a limit at &recordcount=399...

The relatively recent search engine TEOMA will give us a meager 7540 results for "advanced searching", yet we'll be allowed to check only the (even meagerer) first 194 results. This gives us a 'middle depth' at Search&f=100&i=0&s=100&y=0&l= and a '3/4 depth' position at Search&f=150&i=0&s=150&y=0&l= with a yo-yo index of 2,57. Note also how Teoma does not respect the 'Find this phrase' constraint even if you check it (and is therefore quite imprecise and not very useful for 'depth' fishing).

Poor old Excite is DYING (or at least seriously ill) due to the crap 'merging' plans of the commercial bastards.
It will give you just 2303 hits (if correctly approached), but you'll be able to investigate a maximum of 1000 ranked results. Hence 'midlle depth' will be at 501 and '3/4 depth' will be at 751 both -alas- with "squalid to average" results. This is a 43,42 yo-yo index value, yet, due to the limited campioning and to the fact that Excite is a shadow of itself, we would be well advised to ditch it for the time being.

Good ole YAHOO claims to deliver us 13200 results. But, hear and behold, the total of alleged results remains indicated as 13200 until result 500 and goes back to 594 if you try to go deeper. Strangely enough you may descend deeper nevertheless to 659, and I managed to reach the maximum depth of 677, but this is the real maximum depth. There's no air here, we must go back to the surface :-)
This has probably to do with the fact that YAHOO is now, and probably google is castrating Yahoo's results just in case.

(Based on the broad query: "advanced searching")

s.e. Yo-yo indexreal maxmiddle3/4Alleged Total
Fast17,08 (46,31)40102005267023419 (8853)
Northernlightn/a (high)n/a (high)n/a (high)n/a (high)19344

Broad results quantity, average variations
Average   =20866

Wisenut 41776 200.21
Altavista 31834 152.57
Google 26100 125.09
Lycos 23940 114.73
Fast 23419 112.24
Northernlight 19344 92.71
Hotbot 19200 92.02
Yahoo 13200 63.26
Teoma 7540 36.14
Excite 2303 11.04

The YO-YO wand
(Another incredile Laurent's deed)

In order to obtain more information and examples about searching bots you may want to visit the 'scroll room' and the bots section of searchlores.

Among the snowy mountains of our PHP lab, wizard Laurent has produced a beautiful
[yo-yo wand]
that will allow you to search directly, at 1/4, 1/2 and 3/4 depth levels, a whole series of important search engines (fast, google, lycos, wisenut).

Note that at the bottom of the wand, once you execute a query, you'll also find interesting STATISTICAL DATA about the commercial spamming of a given engine (how many .com sites, how many .org sites, and so on, in each search depth level).

(look spammy): An explanation
Query results 'look spammy' when:
  • There are a lot of *.com sites ranked (always a bad sign)
  • There are repetitions of the same basic URL with slight variants ("spammers' triumph" parameter)
  • There are obvious "ad hoc" pages with huge lists of 'catching' words. Note that pages plagued by any sort of "flash" crap and/or the "" javascript snippets are also classical spam results, that you should by all means avoid.
  • There are few russian, chinese, indian, korean, and few ".jp", ".de", ".fr", ".pl", or ".it" pages among the ranked results (hence either "content" that does not pay enough money to the search engines owners has been excluded... or the ranking algos have been "castrated" to include on priority english language sites... in both cases you better ditch such a search engine for the time being :-).
  • Many more clues and parameters apply. You'll have to 'feel' it (and send some feedback here, eh :-)

red fravia+, 17 November 2001