the seeker's apprentice badge
Da seekars' badge wit da seekars' arrow, lolx
  
Long term searching: rules and advice
Estote parati!
by Fravia+
(With some sound linguistic corrections by Ann)

First published at searchlores in September 2005
Version 0.35, January 2008
Part of the searching essays & of the evaluating results sections.


Time and again, seekers realize that many fellow humans don't even know how to use the exclusion operator on google - this usually leaves us feeling hollow, sick, and ashamed that we inhabit the same planet (let alone that we belong to the same species) as such scum :-) Time to throw some knowledgeballs down the webhills!

Introduction
Rules and advices

Prepare a written plan   Prune your query   Feel the searchscape   Comb the deep deep web
Mow the grey areas   Try different approaches...   ..& different languages   Keep records   Check your mistakes


Caveat   

The following 'tips' are intended mostly for LONGTERM searching. If you embark in a shortterm searching project, be aware that some of the searching techniques will be different. Ditto for "deep web" searching purposes.

Introduction   

I think we should slowly try to 'structure & organize' a little our searching Academia. Many, during my lectures, keep telling me that some of the techniques we explain are "waay too complicated for their needs", they want to know what I would do "in practice" when searching for this or that specific target.
The answer is of course that first of all your searching approach depends on your targets, second, that your needs shouldn't be too modest (finding at once any book, any music and any software you want is the minimum to start with) and that -besides- there's nothing complicated at all in our techniques, since the web was -from the beginning- MADE for sharing :-)

On the contrary, our searching and seeking techniques usually do appear rather simple -even banal- once explained and understood (In fact a posteriori most things are quite banal, duh).

Yet the request for more "systematic" explanations remains legitimate. So I will do my best to explain our approaches a little more, and I will begin with some points I believe searchers should take account of, when embarking on their "long term searches". Note that this is of course an essay in fieri: other seekers' contributions are urgently needed. Note also that most advice applies for any kind of search, not only web-related searches.

What is a long term search?
There is a huge difference between the "usual" "on the fly" searches (say when you want a specific book, see the shortterm searching tips) and your own (usually few) "long term" "thematic" searches. These are the two/three "searching passions" that anyone has in a given year (or given life): those targets that one most cherishes, maybe for work purposes, maybe (better imho) for his/her pleasure and for the knowledge.
While for the usual "everyday" searches if you do not find what you are looking for in 15 minutes it probably means that your search strategy is wrong, for any long term search, the strategies themselves will change, evolve and be refined along the process. Hence for a long term search a week is not enough, a month is not enough, a year is not enough. For my own searches I fear that my life won't be enough.

For long term searches the old truth that searching is a process applies in spades.
In fact you need to know something to find out something more: this means that for those targets you really cherish you will have to become an expert, both for searching and for evaluation purposes.
Anyway the most important thing is to prepare -at least a little- your long term searches: Estote parati!


Long term searching: rules and advice   


1. Develop your search strategy: prepare a written plan

This is VERY important. A simple five minutes long "brainstorming" before beginning your search, writing down your scope and a preliminary list of keywords will pay jackpots whenever you will risk to run amok and then get lost inside the dark web-woods.
Break the query down into its components, record the various terms (or concepts) and their definitions and prepare a first small list of synonims and/or synecdoches.
The synecdochical searching method may (may) help when some components are too difficult to describe linguistically and cannot therefore be searched effectively.
For instance, imagine you'r searching for ways and techniques to understand which is the real mother tongue of a writer of an English web-snippet or email (there's an old essay on my site about this). This is a concept so vague and complex that it cannot be captured with a specific set of terms. In such cases you will have to try "to spiral" around your signal, finding querystrings along the way and sharpening them while you go. Use clear and specific terms at the beginning of your search process, you'll refine later your query, again and again.
However you should always try your best to prepare a written list and a quiver full of sharp querystrings before even beginning your search, even if we all know how the very definition of the topic will change gradually in the course of the search process.


2. Prioritize queryterms and concepts: determine whether the feathers in your arrows are really necessary or are just optional criteria. Prune your query!

Now that you have that list of terms, you need to put some order into it.
You should specify not only your scope, but also the limits of your search (careful when limiting to specific document types!). You should list as many concepts that should not be included as possible, and limit your query, if necessary, by publication date (using any good daterange utility). Be careful with date limits as well: systematic searches should try to gather all relevant material regardless of date.
So the most effective "limiting" approach will mostly be term-related, and based on the exclusion of some specific operators. For instance, for google: -intitle:, -allintitle:, -inurl:, -allinurl:, -allintext:, -inanchor: and so on. For yahoo, -path:, -inurl: and so on.
Here a classical (short term and book related) "four-pruned" example:
"title:index title:of" -originurlextension:htm -originurlextension:html -papers -copyright +Oreilly
A gardener prunes the branches of his trees to contain their growth to the proper scale of the garden. Prune your queries!
Search term selection and usage are as important as the selection and usage of operators or booleans, and maybe even more: in fact simply increasing the complexity of a given query adding advanced operators has usually a less remarkable effect on your query results. To make a simple example, while adding to your querystrings the operator -"site:com" will often times improve your search (eliminating all ".com" sites: good riddance :-), a careful choice of a new search term (related to your specific query) will probably "cut" the web to your signal waay better, pushing your target silhouette much more clearly against the background noise of the commercial (crap) web.
Therefore strong attention must be paid to all issues related to the selection and use of search terms, clearly the selection of search terms matters quite a lot, and should be well thought (and evaluated carefully) when looking for the best possible search results. If you are not pleased with your SERPs, even some small swapping and/or change may lead to a significantly different (and potentially better) outcome. You'll of course get relevance differences through the use of complex operators as well, but the results may not always be worth the time invested.
A graphical approach may be very useful in this context: a simple column structure (on the left the concepts, on the right the terms), will help you as a first quick "frame reference", while some Euler diagrams will allow you to quickly grasp the "form" and the "color" of you query.
The terms should also depend on your recall needs (recall is the percentage of all relevant documents available that is retrieved by your query). For high recall results, it is often necessary to include terms that will -alas- capture many unwanted items in order to fetch a few relevant ones.
To gather more (relevant) terms you may also try to use a couple of graphical and clustering search engines (For instance Kart00, Touch, Ujiko, Dicy or Mooter). These will widen the "image" of your query. In fact you should now begin to smell your "scent" even before beginning!


3. Run preliminary searches in order to "feel the searchscape"

Use at least three among the main search engines. Check the results, you will soon notice that they do not overlap very much. Check if your "quiver" of queries is rich enough or if you need more arrows to begin with.
Remember, however, that expanding a search including broader and/or narrower concepts will (mostly) increase the number of results. There's a trade off you must be aware of: in some cases this will be necessary in order to fetch all relevant documents, in other cases this may mean that you will have many irrelevant results.
It is often best to begin with those querystrings (collection of terms) that are the most specific, most important and/or the least frequently occurring.
Not all concepts identified in the search plan need to be included in all searches. When you increase the number of terms required in your target documents, the number of documents retrieved by your queries will of course decrease. Your imagination in choosing and/or swapping the query terms when cross-searching, and your capacity to adapt the choice of terms to the fields of data you will be mowing (the web at large? A specific database? A on-line library?) will often mean success or failure. Note that -as a rule of thumb- if you'r searching FULL CORPORA of texts you should include a lot of terms and be as specific as you manage (without tilting the query of course), while if you'r searching among TITLES (say a Journal titles-database) you should be much more careful when adding feathers to your search-arrows: the number of terms included in a query should in general be proportional to the amount of data being searched.
During this phase the seeker is still just 'shooting arrows': testing different terminological combinations. Basically you run a search, examine the overall patterns in the SERPs, study closely the several highly relevant results, and then reformulate the search to improve its effectiveness.
Just one change at a time! Then test it, and add something more only when happy with the results. If you make several changes at once to your querystring it will be difficult to assess the impact of each individual change, duh.
We are still inside the iterative, cyclical, trial and error phase of the seeking process, but you should now begin to smell the "scent" of your quarry. You should now hear the signal loud among the noise. The "conceptual terms swapping" approach should now be finished. Should you have to change it during the following phases, you will need to re-run your preliminary searches as well. Enough words playing. Now it's time to switch over to some "operational" techniques.


4. Let's now wade into the morasses. Comb the deep deep web

First of all we will search through combing: we will search people that have searched and people in the know.
For instance (but of course it depends on your target) we find all electronic literature databases relevant to the topic, we will find as many relevant messageboards and key journals as we can, and find all related books we will manage to fetch.
One possible way to do this, is to investigate the various reference lists you find and then search for the quoted references you have found inside your SERPs.
Chances are that if a journal denies you access to its articles, for some petty commercial reasons, those very articles will still exist in other copies around the web: the web -remember- was made for sharing. If you don't find your target on the wild, you'll go back to the database that keeps it captive and either find some entrance or guess some entrance or, eventually, gosh, even almost pay for it.
Slowly, a pattern of journals, messageboards, databases and experts will emerge. The scent will be very strong. Still social engineering can be extremely useful when you get 'stuck' and are not able to find a given book: get in touch and consult experts doing research in your target topic area. Most of the time they will be ready and glad to help you. As a rule of thumb, experts and writers are extremely annoyed that nobody cares for their knowledge/books and will mostly bend backwards in order to help the few that seem to be interested.
Successful Web searches rest on a combination of experience and domain knowledge. To be successful in information retrieval one must be (or get) knowledgeable about the search topic.
Little by little you will and must become familiar with your target field and be capable to identify, evaluate and assess your target's experts, messageboards, webrings, articles, reviews and books. The evaluation part is a most important part of this search process.
But evaluating means evaluating USING YOUR OWN SOUND PARAMETERS, not somebody else's ones.
It is also important to remain responsive to all new information that emerges during the search, and not just fetch the type of information you expected to find before beginning your query. You should try to maximize any feedback potential examining your SERPs with a synthetically oriented, completely open approach.
Always question "authority" and don't ever accept the "official" searchscape if your results do appear to contradict it. Remember that you know what you are doing and how to search deeper and further, while chances are that those you found do not know zilch about how to search the web. This brings us to the next point.


5. Identify relevant unpublished or not widely distributed literature and sources (mow the "grey areas")

This is more important than many (would be & self called) experts would like to admit. Nowadays journals' articles and books are often just concoctions of material ripped from the web. There is a plethora of books about "google searching techniques" and "google hacking" all around the web (and also in any bookshop, but there they seem even to demand money for them, go figure :-)
Some of these books are so full of evident mistakes and of stolen -but not understood- snippets that, seriously, with a bit of urine and a big ball of clay, any good seeker could completely recreate their (soi-disant) "authors".
Plagiarism (and plagiarism-proofing) has now reached paroxysmal dimensions. While seekers can (and often do) counter this, almost no one else is able to do it. There are even agencies that let themselves being paid "to search for plagiarisms" (through google, nonetheless :-)
This is ludicrous. I believe that a sound thrashing followed by inadequate medical attention is the way to deal with people willing to pay for such "services" instead of checking on their own if someone did or did not made use of plagiarism.
Fact is, that the grey areas of the web (conference papers and proceedings, unpublished dissertations on relevant topics, "unofficial" messageboards, IRC channels, and even -gosh- not widely know blogs) offer most of the time top-notch information.
That's one of the reasons it could be quite dangerous to limit searches to pdf or doc files: a lot of information is published in just ole plain html (like this very file you'r reading :-)
As all searchers realize -soon or later- there's a lot to learn from people that are completely or partly outside the 'official' academic salons. Of course, once again, your evaluation sklls are of critical (in all senses of the word) importance.
Moreover there are "web tides": moments when suddendly there are encounters between the "official" academic fauna and the many grey hats that abound on the web. The waves roll slowly on the Web. On the Internet seekers can frequently observe micro-communities, working along similar paths and with interests in similar fields, that ignore reciprocally their existence for an inexplicable long time. This has of course to do with both the vastness of the web and the fact that people do not know how to search.


6. Review the written plan regularly to incorporate new discoveries. Try different approaches just in case

Searching is a PROCESS. It is of paramount importance that you constantly update your frames of reference using the material you have found. Possible promising alternative approaches should be listed as well (for a rainy day).
Also, for "very long" long-term searches, remember, when reviewing results, to checklist the sources searched and the approaches you used, to prevent wasting time in duplicating efforts.
A good idea is to "re-do" your search -just in case- without using google at all. This will force you to follow completely different paths, since the main search engines don't overlap that much.
Another possibility is to try to use different approaches, like ftp searches, usenet & blogs searches, irc searches, trolling and all the other many seekers' paraphernalia in order to try to get to your target "from behind".
Here's a simple trick: once every -say- three months try your longterm search queries NOT ON GOOGLE but on A9, or on exalead, or even on a meta search engine like seekz... yep, indeed, they don't overlap that much, yep, of course, I knew you would have enjoyed it :-)


7. Re-run your query using different languages

This regional approach is often underestimated. There's a widespread belief that if something is important enough, it will exist in English as well. This is not only untrue, but quite far off the mark. Especially many language-impaired English-speaking friends discover with amazement the wealth of results available in -say- German, French, Spanish, Russian, Japanese, Korean, Italian, Hindi and Chinese (just to name a few important languages... I could continue this list). There are many translation tools available on the web, even for Japanese:-)
It should be noted en passant that knowing -at least passively- some foreign languages is, for Web-seekers, almost a sine qua non.
Never underestimate the importance of going regional with your long term queries. The wealth of results that local, country specific and relatively "obscure" search engines may give -for some targets- can really be impressive, and will often make all the difference between an average performance and a excellent search .


8. Keep records of all your search activities

Systematic record keeping is OF PARAMOUNT IMPORTANCE when searching. The classical mistake of almost all newbie seekers is to 'forget' to keep records during their long-term searches.
For this purpose I suggest you simply use the NOTE function in Opera, just highlight the target text you are interested in, rightclick, and then chose copy to note (or use the keyboard shortcuts, either CTRL+SHIFT+C or CTRL+ALT+E depending from the version of Opera you'r using): the URL of the page you'r viewing at that moment *and the date* will be automatically stored in your note *together with the highlighted text*.
You should create ad hoc note folders (for instance "research_on_canaletto_29SEP2005") and, at the end of your search, before switching the box off and go to sleep, just move all your related notes inside the correct folders. Opera's Notes are just text format, very easy to edit, cat, search or prune.
Alternatively use something else, even a pen and a sheet of paper will do. DO NOT rely on your memory alone (or on your extraordinary seeking capabilities to re-find at once what you may have lost :-)
If you do, you will regret it. Sooner than you believe.


9. How do you know if your search is/was effective or not? Check your mistakes

Well the short answer is: you will know, because you will "feel" it :-)
The long answer is -instead- that you will never know for sure. The effectiveness of a given search is generally measured in terms of precision (how many retrieved documents are relevant to your query? 100%? 50%? 10%?) and recall (how many of all the theoretically available relevant documents have been retrieved through your specific query? 100%? 50%? 10%?). Alas! Both parameters, and especially the second one, recall, are IMPOSSIBLE to gauge on the wide web, because it is impossible to check -manually or automatically- all existing targets documents in order to determine how many are relevant. But you can gauge them on specific web-subsets, say a given (small) database.
Poor seekers believe the relationship between precision and recall to be always a trade off: more precision means less recall and vice versa.
In fact this very much depends on your target and on the 'form' of your queries. Usually the trade off is very strong during the first phase of a query, but while refining and swapping terms, later on, you may be able -if you know what you do- to increase both precision and recall.
Another possibility is to use FRIENDS. Each seeker seeks in a different way. Each human being sees different (sets of) patterns when investigating the same amount of data. Conducting multiple human investigations on the same target will bring a considerable amount of "freshness" to your original query. Individual approaches may completely determine the output of the search, often more than any other factor. Simply consulting with others can help you to obtain new valuable arrows and improve your overall results. Of course the web is the ideal medium for this.
Checking for possible mistakes INSIDE your queries (or gaps) may also be useful in order to make your search much more effective.
There are many common errors: Spelling errors, terminological & conceptual errors (forgetting important terms), boolean (AND and OR) errors, definition 'scope' errors, combing errors (forgetting important resources), reviewing errors, evaluation errors, forgetting truncation possibilities, insufficient 'regionalization' of the query, and so on. Don't be too snotty when assessing your own queries. Chances are you DID forget something :-)

Estote parati!


Petit image

(c) 3rd Millennium: [fravia+], all rights reserved, reversed, revealed and reviled