~ London ~
         to basic    London
Version February 2003
    
A workshop in London
I have been invited to held a 2.30 hours workshop in a London university on 28 February 2003, these data and images could prove useful.


Introduction          Learning to transform questions into effective queries

Searching for disappeared sites    Netcraft

rabbits.htm    slides    snippets    Opera "opensourcing"

fast, google and teoma pro and contra        Some reading material


How to find anything on the web  (London, end february 2003)

I an going to make you all a present: I will give you today COSMIC POWER. No kidding :-)

"I can fix problems I don't understand.  There are no words to convey the feeling of omnipotence this generates.  The combined knowledge of the entire human species is at my beck and call."

Yes, you will be able to find quickly solutions to problems you did not even knew did exist a few minutes before you started your search...
You have a noisy airport, with annoying night flights, that you want to shut down? Learn how people did it elsewhere, all over the globe. Find the best methods, guaranteed to give some results.
You have bought a second hand photocamera, and you do not have the instructions? Find the manuals, all possible editions, in all languages, and find people discussing on some messageboard, maybe many years ago, the advantages and the problems of your recent buy.
Your kid would like to have the 'Las Ketchup' song? Your mother remembers vaguely an old lullaby text? Your father seeks the image of a Caravaggio picture? You want an ebook edition of Pratchett's discworld? It's all there. Everything is there.

" I've heard legends about information that's supposedly "not online", but have never managed to locate any myself.  I've concluded that this is merely a rationalization for inadequate search skills. Poor searchers can't find some piece of information and so they conclude it's 'not online'"

As we have seen even the best among the largest search engines (google, teoma and fast) do index only a small proportion of the "static pages" Web bulk (around a quarter: 3 billions versus 13 billions) and do not search at all (with rare exceptions) the hidden databases that are estimated to be hundreds of times larger than the Web bulk. Since the "main" search engines (google, fast and company) DO NOT provide a comprehensive and up-to-date search service of the Web we'll HAVE TO recur to other techniques, and as you will see, some of them are not 'kosher'.

The generic "main" search engines are currently among the most important means used to find information on the Web, but there are compelling reasons for using a battery of various OTHER tools in order to search more effectively the web: local groups, regional (i.e. georgraphically or language-specific) and "specialized" search engines.
The improvements of your "search ability" are manifold.
The point is that it is becoming tecnically impossible, for one single search engine to ever hope to index the entire contents of the Web. There is of course also a major economic hurdle - it is simply not cost-effective given the rather meager revenues a search engine can hope to receive from its commercial prostitution, to build such a complete index.
Moreover the main search engines often do not thoroughly search through a Web site. They may (may) index all pages that are a few links down from the site's entrance but often do not go beyond. Hence the 'deep content' of many sites is not searchable from the main search engines.
Hence the need for, and the utility of, those "local" (or 'specialized') search engines: these will (try to) index the whole of all relevant and pertinent sites.

A "specialized" search index can be created in a number of ways. The simplest method is to constrain the very contents of the search index to a specific topic, also retrieve and index only those documents that are fully related to the specific topic or category of interest.
You may retrieve pages from known relevant sites, using focused bots or even, simply, taking advantage of the main search engines to find relevant documents.

The second approach is to build an index of specialized messageboards and databases, relevant to the topic of interest. In both case, some targeted crawling at query time will go a long way to give a sensation of 'freshness' to the presented data. This in fieri indexing is typically difficult for all main search engines, since there are -simply put- too many possible pages to attempt any real crawl at query-time. It may however be possible when a local, specialized search engine is dealing with a specific topic and knows where to look.

According to onestat this is the traffic repartition (in Amsterdam) among the 7 largest search engines of the web:
1. Google 54.7% 
2. Yahoo 22.1% 
3. MSN Search 9.5% 
4. AOL Search 3.7% 
5. Terra Lycos 2.8% 
6. Altavista 2.5% 
7. Askjeeves 1.5%


The www is a hypertext corpus of enormous complexity, and it continues to expand at a phenomenal rate. Moreover, it can be viewed as an intricate form of populist hypermedia, in which millions of on-line participants, with diverse and often con.icting goals, are continuously creating hyperlinked content. Thus, while individuals can impose order at an extremely local level, its global organization is utterly unplanned — high-level structure can emerge only through a posteriori analysis.

We could define roughly "Searching on the www" as the process of discovering pages that are relevant to a given query.

The problem is that if the query is too specific, there will be too few pages that will contain the required information, if the query is too broad, on the other hand, the opposite problem will be to filter correctly the too many pages that will be given you as a result.

Anyway, even when all your searching techniques have failed, when all your cunny approaches did not fish anything, your lonely searches seems endless, when all your tricks have brought you no reward... even when a rude database dares deny you access, even when your target has been pulled off the web, jailed, destroyed, censored, annichilated by the powers that be... even in those dire moments you will always know that you can find what you are looking for even if it is no more there!

SEARCHING FOR DISAPPEARED SITES

http://webdev.archive.org/ ~ The 'Wayback' machine, explore the Net as it was!


Visit The 'Wayback' machine at Alexa, or try your luck with the form below.


Alternatively learn how to navigate through [Google's cache]!

NETCRAFT SITE SEARCH

(http://www.netcraft.com/ ~ Explore 15,049,382 web sites)

VERY useful: you find a lot of sites based on their own name and then, as an added commodity, you also discover immediately what are they running on... verbum sapienti sat est eheh, I mean... a word to the wise is enough...
Search Tips
Example: site contains [searchengi] (a thousand sites eh!)





Learning to transform questions into effective queries    slides


Slides

Languages on the web    Growth of the web * 1000    Structure of the web    Short and long term seeking

    Weblanguages
Languages on the web
 



    Growth of the web * 1000
growth of the web * 1000


   Structure of the web
Structure of the web


   Short and long term seeking
Short and long term seeking

Opera 6.1 (windoze opensourcing)


Traditionally I conclude my workshops examining ways to remove advertisement from this FANTASTIC Web-browser, that beats hands down any other browser on the market.
You'll find a Linux version as well (at
http://www.opera.com/linux/), where you can use the ad-infested version or purchase the "non-ad" version instead.
Below I'll explain you how to eliminate advertisement in Opera 6.1 for windoze, but I would advice you to purchase, as I did, your own version, and help slick, good and fast Opera against the awful Netscapian and Microblowian browsersaurii.

I do not believe that the following information will damage Opera, on the contrary: