How to access and exploit the shallow deep web
by fravia+
Introduction (de rerum profundae telae)
Is the "deep" web still so invisible? ••
"Bad" and "good" databases
Quality on the "shallow" deep web ••
"Deep web searching" strategies and techniques
Estote parati & anonymity matters ••
Wget, scapy and other wonder tools
Be wary of he who would deny you access to information,
for in his heart he dreams himself your master. |
"Deep web"(1) searching seems
to be all the rage since a couple of years, mainly because of the
supposed "undetectability" of this part of Internet by the main search engines.
This was -and still often is- indeed true,
probably because of a
vast quantity of "proprietary databases" that did not, and do not, want their content to be indexed.
These commercial ventures have always used a broad choice of
"on the fly" server tricks and streaming techniques in order to further restrict
the indexing capabilities of almost all bots and search engines.
Not that the main search engines' "indexing capabilities"
were worth deserving much praise to start with:
in fact -and this is a real problem, for seekers(2)-
the most used search engines(3) have unfortunately an "indexed web-content" that:
a) is quite americanocentric
skewed; b) offers often obsolete results due to said s.e. own algos priorities;
c) gets heavily and actively spammed by the
SEO_Beasts;
and moreover, d) leans anyway quite heavily
towards all those useless "popularity" sites that the unwashed love and/or towards
commercial sites whose purpose lies only in
selling commercial trash to the unwashed themselves.
No-one knows how big the "deep web" might be,
for the simple reason that noone knows how big the web "lato senso" is.
The most commonly (and uncritically) cited study(4) pretends a 550 to 1 (sic!) relation between
"unindexed deep web" and "indexed web" documents. This was probably an exaggeration from the beginning,
and is surely not true nowadays.
This paper profess that the many "open access" indexable databases that have recently flourished on what was
the "unindexed web",
have already dented the supposed huge dimensions of the ancient "deep web", both if we intend this concept,
as we probably should,
as "the not always indexed, but potentially indexable, web of
speciality databases with specialistic knowledge" and/or if we intend the deep web strictu senso
as "unindexed web", i.e. the web of "proprietary databases" plus the web "that cannot currently be indexed".
This said,
a correct use of ad hoc searching techniques -as we will see- might help fellow seekers
to overcome such indexing limitations for the shrinking, but still alive and kicking,
"pay per view" proprietary databases.
After all winnowing out commercial trash has always been a sine qua non for searchers browsing the web:
everyone for instance knows how adding even the simplistic and
rather banal -".com" parameter ameliorates any query's result :-)
The problem of the deep web was very simple: many a database (and many a commercial venture)
choose the obsolete "proprietary"
business model and decided to
restrict knowledge flows in order to scrap commercial gains. Even
some "universities" (if institutes that make such choices still deserve such name)
have often rakishly betrayed their original knowledge spreading mission.
This paper's apparent oxymoron about the "shallow deep web" reflects the fact that this
dire situation is nowadays slowly ameliorating thank to the many worthy "open access"
initiatives. The deep web of once is shallowing out, getting more and more indexed. The critical mass
already reached
by the open access databases harbingers a well deserved doom for the "proprietary knowledge"
model of the ancient deep web.
While this text will nevertheless examine some ad hoc "deep web" searching approaches, it is
opportune to recall that -anywhere on the web-
some
longterm web-searching
techniques might require some consideration as well,
while for quick queries other, different kinds of shorterm tips
and advice, could also result useful.
On the other hand,
some of the ad hoc deep web searching techniques explained in the following could probably come handy
for many queries on the indexed part of the web as well.
Note also that this text, for examples and feedback purposes, will concentrate mainly on a specific subset of
the deep web databases: academic scientific journals, since out there a huge quantity
of specialised databases and repositories list everything that can be digitized,
from images to music, from books to films, from government files to statistics,
from personal, private data, to dangerous exploits.
The databases of the deep web have been subdivided into
various groups(5): word-oriented, number-oriented,
images and video, audio, electronic services
and software oriented. For our "academic" search examples, we will aim mainly to word-oriented
databases, (libraries, journals and academic texts): after all they are thought to
represent around 70% of all
databases.
This said, the real bulk of the "huge deep invisible web"
consists mainly of raw data: pictures of galaxies and stars, satellite
images of the earth, dna structures,
molecular chemical combinations and so
on. Such data might be indexed or not, but for searchers clearly represents stuff
of minor interest compared with the scholar treasure troves that some specialised
"academic" databases might offer(6).
Is the "deep" web still so invisible? |
It's worth keeping in mind that the "deep", "invisible", "unindexed" web is
nowadays
indeed getting more and more indexed.
Panta rhei, as Heraclitus
is supposed to have stated:
"tout changes":
indeed the
main
search engines are now indexing parts of the "deep web"
strictu senso: for instance non-HTML
pages formats (pdf, word, excel, etc.) are now currently translated into HTML and therefore do appear
inside the SERPs. The same holds now true for dynamic pages generated
by applications like CGI or ASP database
(id est: software à la Cold
Fusion and Active Server Pages), that can be indexed by search engines'
crawlers as long as
there exist a stable URL
somewhere. Ditto for all those
script pages containing a "
?" inside their URLs.
As an example, the
mod_oai
Apache module, that uses the
Open Archives Initiative Protocol for
Metadata Harvesting, allows search engines' crawlers (and also all assorted
bots that fellow
searchers might
decide to use) to discover new, modified
and/or deleted web resources from many a server.
Some useful indexing "light" is now therefore being thrown onto that dark and invisible deep web of once.
Web global search engines have nowadays evolved into direct -and successful- concurrents of library catalogues, even
if many librarians do not seem to have noticed. What's even more important: various large "old" library databases
have been fully translated
into HTML pages
in order to be indexed by the
main web search engines.
Note that, according to the
vox populi all
the
main search engines' indexes
together (which do not overlap that much among them, btw) just cover (at best) less than 1/5
of the total probable content of the web at large, and thus miss great part of the deep web's content.
In fact most proprietary databases (and many openly accessible ones) still carry
complex forms, or assorted java-script and flash scripts -and
sometime even simple cookies- that
can and do affect what content
(if any content at all) can be gathered by the search engines' crawlers.
As seekers that use
bots well know,
some nasty scripts can even trap the visiting bots, and the
search engines'
spiders and
crawlers, and send them into infinite loops, or worse.
These indexing difficulties surely persists even today.
Yet, as this text will show, the deep web landscape is clearly undergoing deep changes, and the "deep", unindexed,
proprietary and closed
web is shallowing out in a very positive sense :-)
"Bad" and "good" databases |
The bad databasesScripts tricks are only a tiny
part of the indexing difficulties that search engines encounter on the deep web:
it is worth underlining that great part of the deep web is "invisible" only because
of copyright obsession and excessive privatisation of common knowledge: billions of law-related
articles are for instance there, but
must be BOUGHT through
Lexis/Nexis,
billions of other articles are indexed inside a huge quantity of privately licensed databases
that "sell" the right to access such knowledge.
Among the most important "bad" databases:
Add to these the zillions of licensed article, magazines, references, archives,
and other research resources reserved elsewhere to searchers
"authorized" to use them and you'll have indeed a lot of "dark commercial matter" on the deep web.
Note that the fact that the content of these databases is NOT freely available defies all economic logic:
libraries, governments and
corporations have to buy the rights to view such contents for their "authorized users" and
visitors can mostly search, but not view such content.
This classical nasty and
malignant "commercial tumor", that on the deep web was allowed to spread for years, is now -fortunately-
in retreat, thank to the recent growth of the "
Open
Access" databases.
"Proprietary" information
Before starting any search it is always worth spending
some time looking for databases in the fields or topics of
study or research that are of interest to you.
There are real gems on the unindexed web, where -fortunately- not
all databases are locked and enslaved. Nowadays searchers will often be able to gather, next to the
list of "closed" interesting proprietary databases,
another list of openly accessible useful
databases and resources.
If you are interested in, say,
history, you are nowadays for instance simply "not allowed"
to ignore the existence of open repositories of databases like
EuroDocs: Online Sources for European History
Selected Transcriptions, Facsimiles and Translations. I mean, as a medieval buff,
I wish myself I could have had something like
this 30
years ago,
during my Frühmittelalterliche studies :-)
Sadly, however, in this silly society of ours "proprietary" information
has
not always been made
freely available to humankind.
Thus on the deep web
strictu senso users will often be prompted
to pay for knowledge that should obviously be (and of course will be in the future) freely available.
Why such services should still be left to rot in the hand of
commercial
profiteers with almost prehistoric business models, instead of being developed, ameliorated and offered
for free to anyone by some association
of universities or by some international institution, defies
comprehension and any sound macroeconomic logic.
As we will see, searchers might however use some creative (if at times slightly dubious)
approaches
in order to access such proprietary databases "from behind", provided this is not forbidden by the legislation of their
respective countries of residence or of the countries of their proxies.
The good databases
Luckily, while the deep web
strictu senso is still infested by proprietary databases,
some nice counter-tendencies are already snowballing down the web-hills:
there is in fact an emerging, and powerful, movement towards
open access to research
information(7).
The following examples
regard mostly repositories of journals databases
and newspapers' archives, but the same developments can be observed in all other specialistic fields
(for instance for music:
http://www.jamendo.com/en/).
Moreover, not all newspapers and magazines have locked their archives with subscription databases
(usually, especially in the States, such archives can be accessed for money through
Lexis or
Factiva).
There are COMPLETE newspapers and magazine archives searchable for free (for instance
the
Guardian's).
Curiously enough, it is not easy to find listings of such "open access archives".
Therefore -as a proof of concept- such a list was produced
in proprio :-)
It is worth pointing out that great parts of the deep web (intended both
lato and strictu senso)
can be accessed
for academic research purposes
through some well known (and quite useful)
ad hoc search engines, repositories, directories and databases:
There are also some useful ad hoc "deep web" search engines, for instance
-
incywincy, which
incidentally has also
a quite useful (and interesting)
cache (and offers also the
rather practical
possibility to search for forms).
-
You could use for exploring purposes also
engines like http://findarticles.com/, which
indexes millions of magazine articles and offers free full-text articles from hundreds of publications
through its advanced search engine.
For instance:
"invisible web" and
"deep web".
- But there is an incredible palette
of specialized search engines, gosh: there's even a special "uncle sam" google for US "government and military" sites...
http://www.google.com/ig/usgov
Yet, despite such promising open access developments,
the still impressive dimensions of the "proprietary part" of the deep web
might push some seekers to use some
unorthodox retrieval techniques whenever they need
to harvest and/or zap the knowledge that has been buried
behind commercial locks.
Fortunately (for us!
Unfortunately for the proprietary content providers)
the
very structure of the web was made for sharing, neither for hoarding nor for selling, so there's no real way
to block for long any "subscription database" against a decided seeker that knows the basic
web-
protocols AND
has managed to fetch some sound "angles" for his queries. He'll be honing onto his target wherever it might have been hidden.
Quality on the "shallow" deep web |
On the
web and elsewhere seekers possess an
almost natural -and quite sound-
visceral mistrust of hype:
no searcher in his right mind would for instance
readily give his private data
to "hyper-hyped" sniffing social networks à la facebook.
And the"deep web" has for years represented just another example of an "hyper-hyped" web phenomenon. It is
pheraphs worth pointing out how a overweening excitement for all kind of hype
is
eo ipso a useful
negative evaluation
parameter.
The resources of the deep web are generally claimed to be
of better quality and relevance than those offered by the indexed web, since in a ideal world
they should have been written or validated by expert scholars and authorities
in their particular area of expertise.
Yet, as anyone that visits the proprietary deep web can notice,
many supposed "authoritative" and "knowledge rich" proprietary databases encompass, often enough,
capriciously incomplete collections, full of banal truths, repetitive and pleonastic
texts, obsolete and partisan positions
and
unfounded, unvalidated and at times rather unscientific theories. More generally, thank
to the open access databases and repositories, the "academic" scientific content of the
invisible web and its resources are nowadays not much "deeper"
than what a researcher can find -for free- on the indexed (or indexable) web.
This resources's incompleteness is not a fault of the deep web by itself, its just
due to the fact that we are still in a
transition period, where the forces
of old still brake the free flow of knowledge and
only a relatively limited part of the content of
the deep web of once is already indexed and thus
really available for global open peer review on the broadest scale, review
which represents
the only real possible, and necessary, quality guarantee.
The situation will be
brighter as soon as the full content of the deep web's proprietary databases will be finally
opened and indexed
in toto, maybe
by new search engines that will be hopefully more competitive (and less spam-prone)
than the ones most used
(8) today.
Moreover, given the amazing rate of growth
of the open access
databases, the
proprietary databases
will have to open more and more their
content to the search engines' indexing spiders, or
risk sinking into irrelevance.
The great epocal battle for knowledge will be won when
anyone, wherever he might be, and whatever economic resources he might have, will have the possibility of
accessing at once and for free ANY book or text (or image, or music, or film, or poem)
produced by
the human race during his whole history in any language.
This is a dream that the web could already now fullfill, and this is
the dream unfolding right now before our eyes.
Another interesting point is that the old
quality
distinction between "authorities" & "experts" on one side and "dedicated individuals" on the other is
nowadays slowly disappearing. We could even state -paradoxically and taking account of all due exceptions-
that those that study and publish
their take on a given matter for money and career purposes (most of those deep web
"authoritative experts" and almost all the young sycophants
from minor and/or unknown universities that hover around many proprietary
databases) will seldom be able to match the knowledge depth (and
width) offered by those that work on
a given argument out of sheer love and passion. If, as it is true, more and more
scholarly
content is nowadays provided exclusively on the web,
this is also -and in considerable part- due to the "scholarly level" contributions
of an army of "non academic" specialists. Electronic printing differs from traditional printing, duh.
Of course seekers will have to carefully
evaluate what they have found, whomever
might have written it: the old saying
caveat emptor (and/or
caveat fur in some cases :-) should always rule our queries.
Yet he is in for a rude surprise whomever really believes that -say- a university assistant who has worked at best a couple
of years on a given matter and
who simply HAS to publish his trash in order to survive and prosper economically
could really offer more quality knowledge than a "dilettante" specialist
who has dedicated his whole life to his "thematic passion", out of sheer interest and love.
Alas, academical publications have always been infested by
wannabee experts whose
only interesting dowry is to be found in the extraordinary amount of
battological and homological repetitions you'll discover reading their writings.
The content of the invisible and unindexed deep web is often as poor as the content of the indexed "outside" web,
and maybe even more deceiving, given its aura of supposed "deep"
trustworthiness. For this reason the
nuggets and jewels indeed lurking inside many deep web's databases
need to be dug out
cum grano salis, i.e., again,
using the same sound
evaluation techniques
that searchers should always use, whenever and wherever they perform their queries.
"Deep web searching" strategies and techniques |
The use of some of the deep web searching strategies and techniques listed in the following
may result illegal in
your country, so apply them only if you are sure that you are allowed to, or change country.
Before starting
Let's state the obvious: deep web searching is a two-steps process: first you have to locate the
target database(s), then you have to search within said database(s). However
for many fellow searchers
without sufficient resources,
a further additional step, once found the target
database on the "proprietary web", is how to enter it "from behind".
Readers might learn elsewhere the fine art of breaking into servers, here it will
suffice to recall that on the web all pages and texts, be they static or dynamically created,
must have an URL. Such an URL may
appear complex or encrypted, but it's still just an
URL.
Again: most "bad" databases have a single sign-on system: once validated, a browser session cookie
with a four/six/eight hours lifetime will be stored inside the user's browser.
If the user then goes to another "bad database" protected resource,
the cookie will be just checked and the user is not required to type in his username and password again. Therefore
you can EITHER find the correct username/password combo, OR mimick a correct cookie,
which is something that seekers can easily sniff and, if needs be, recreate,
using the proper
tools
It's as simple as the
nomen est omen old truth. It is also worth pointing out that
there's no real necessity to
enter a database using its official entrance gates,
as this old searching
essay (about the Louvre) demonstrated long ago.
Moreover,
when really necessary, a searcher might decide to use
some
artillery in order to open closed web-doors.
Lighter techniques like
guessing and
luring (social engineering) or
heavier approaches like
password breaking,
database exploiting,
and/or using free VPN servers (
Virtual
Private
Networks, which have two protocols
you can manipulate: both PPTP and IPSec)
can
help seekers to find an entrance whenever simpler approaches should fail :-)
Since, as strange
as this might and should
appear,
there is still no reliable collection of the largest and most important deep web
databases,
it is worth using for broad databases investigations the
amazing locating and harvesting power offered by various ad hoc tools
like
wget or scapy.
A list of the existing dubious and legal approaches follows.
The most obvious (if slightly dubious) approach
The most obvious approach in order to
gain access to the deep web proprietary databases, has always been to use
unprotected proxy servers
(9) located on the campus networks of
legit participating institutions.
Once found but just one working "campus proxy", a searchers can download whatever to his heart content,
provided of course that such an activity
is not forbidden by the laws of his
country of residence,
or by the laws of the country -or countries-
whose proxy he uses and chains to his "target
campus" proxy.
Taking some
anonymity precautions would probably be a good idea for those that decide to follow this approach.
A slightly more dubious bunch of approaches
It might be politically incorrect to underline this once more, but anyone might
try to access those deep web databases "from behind", without breaking any law,
if he happens to live in a country (or intends to use some proxy servers
from a country) which has no copyright laws at all, i.e. does not adhere to the (infamous)
Berne, UCC, TRIPS and WCT patent conventions we daily suffer in euramerica.
It is maybe useful recalling here that
the following countries do not
adhere to any patent convention:
Eritrea (ER), Kiribati (KI), Nauru (NR), Palau (PW), Somalia (SO), San Marino (SM),
Turkmenistan (TM), Tuvalu (TV!) this holds even more true, of course, for all servers situated in
various "difficult to control" places
(Gaza strife,
Waziristan, etc.).
In such cases, and using such proxies, a combination of
guessing,
password breaking techniques,
and eventually
luring (i.e. social engineering),
stalking,
trolling,
combing and
klebing (i.e.: finding interesting places/databases/back_entrances through referrals)
can be used and deliver results that might be useful in order to get at those
databases "from behind" (provided you have taken the necessary
anonymity precautions,
la va sans dire).
Another, even more dubious, approach
It might also be
quite politically incorrect to point out that anyone might
decide to access most commercial databases using a
faked credit card, provided this is
not explicitly forbidden in his country (or by the country of his proxy) and provided
he has taken all the necessary
anonymity precautions.
This text won't go into credit cards faking and credit card numbers generators
(10)
(that nowadays
you can even easily find as on-line scripts). Suffice to say that given the poor security levels
offered by those CVV numbers it's no wonder how credit cards themselves
denounce that 95% of the scam they suffer is exclusively due to internet transactions.
Few people would feel sorry for them: it is just an anti-competitive oligarchy of an half dozen
issuers, owned by
banks that would not hesitate a minute to sell your kids or your inner organs for money, if they would reckon
to get away with it.
It must be said also that in order to follow this approach one actually
would not require neither a credit
card generator nor a fake credit card at all:
a keen eye in any airport's duty free
queue will for instance quickly spot enough
number/
name/
valid_until combinations
to be able to download at ease all the databases of the
deep web until the cows come home.
Come to think of it, a friendly waiter or seller
working in any good restaurant or shop "à la mode"
could also result quite helpful when dealing with such credit cards matters.
And now a completely, 100% legal, approach
We were just joking: you don't really need to venture inside the "gray" legal areas listed above: often
your own library may give you free
access to your target (as long as libraries will be allowed to continue to exist in our patent-obsessed societies).
For instance,
you can check
here
where to access in your own country, legally and for free, the whole content of a
JSTOR "proprietary" database
of papers, journals and essays.
Especially younger searchers seem to have forgotten, or never understood,
the mighty power offered by a "physical" library
inter alia
in order to access all kind of web-content.
This is also valid
a fortiori for the content of any good "
mediatheque": why should a searcher
waste his precious time downloading from the web compressed music, films
and/or books
when he can at once and for free get hold of, and copy, whatever
full-fledged
music/book/film he might fancy?
An addition, by ~S~ Nemo
"Finding libraries"
If your local library does not subscribe the databases you want, you can find another
library which subscribes to them, where you can go yourself,
or maybe you could ask someone who goes there, lives there,
or works there.
Lets take for example the LexisNexis database. As the access is usually restricted by
IP (library's IP) and the library must link to a specific page in order for that authentication
to be done, lets find that specific link. In order to do this, we can use the following banal query:
inanchor:"on campus"
lexisnexis, where we search for pages linking (hopefully) to LexisNexis using the anchor
(the underlined clickable text linking to another site) "
on campus"
(
humans are so predictable eh :-)
Now we know that the linked URLs are: www.lexisnexis.com/universe & www.lexisnexis.com/cis (
because
they do not have neither cache nor description text).
Using
google (
inanchor:"on campus" lexisnexis),
we find another URL: web.lexis-nexis.com/universe.
Now is only a question of finding a library
near you using Google:
link:web.lexis-nexis.com/universe
or
link:www.lexisnexis.com/universe
or
link:www.lexisnexis.com/cis; or
using Yahoo:
link:http://web.lexis-nexis.com/universe -ajsks
or
link:http://www.lexisnexis.com/universe
-ajsks or
link:http://www.lexisnexis.com/cis -ajsks.
The 'ajsks' (or any other nonsense string) is used to stop Yahoo from redirecting to yahoo site search.
Yahoo is even better than google for this kind of queries, because you can do more than
link search (contrary to Google), you can for
instance search just the libraries in France:
link:http://web.lexis-nexis.com/universe domain:fr
(or,of course, in any other contry).
Playing with forms and hidden info
A significant part of the deep web is composed of
forms that provide access
to their underlying
structured databases, so all existing techniques for structured data mining
can be usefully investigated and at times applied in this context.
Extracting data behind web-forms is a science
per se, and there is a vaste
literature on such matters
(11). Many deep web forms might lead to more "specialized"
second tier forms,
and the obvious problem, for seekers, is to create (or find!) a client side script or bot capable to handle
deep web forms with minimal human interaction.
Note that many javascripts might alter the behaviour
of a form on the fly, which is a major problem for automation purposes.
Keep in mind, however, that web servers are far from behaving perfectly.
Sometimes they will not respond and at times they will not send
the content you expect. So if you are striving for reliability, you
always need to verify that the HTTP transaction worked
AND that the HTML text returned is (more or less) what you were
expecting.
Some special "deep web search engines" can cut some mustard as well, for instance the already
listed
incywincy,
which has an useful
"
search for
forms" option.
To extract metadata a searcher could also use the
GNU libextractor,
that support
a lot of file formats. Another possibility is
wvware (that is currently useed by
Abiword as Msword importer), that can access those
infamous *.doc version control information data. So this is
the tool for those interested in
finding on the invisible web "invisible" information still hidden within the published documents,
(different spelling or dates, slight reformatting or rewording, but often enough
even
previous different versions of the same text and
very interesting
corrections).
~S~ Nemo gave another intersting example! Have a look at the
tricks
explained for the former HotBot,that can be adapted and used through Yahoo.
The final trick with the text snippets
This is based on an amazing property of the web: whatever is published in digital form
is bound to be copied and reappear somewhere else, yep, even
despite his "patents" :-)
Usually there's a direct correlation "celebrity" ==> "frequency" ==> ease of retrieval:
the more "famous" a text, the more easy it will be to find it.
E.g. the "Lord of the ring":
"Gollum
squealed * squirmed * clutched at Frodo * they came to bind his eyes" -babe. Notice the
-babe
anti-spam
filter, since on the web the whole
text of "famous" books is used
for spamming purposes -interpolated with commercial garbage- by the beastly
SEOs.
It's all nice and dandy for any famous English text, what about other languages and less known texts?
A search for -say- the french version of a japanese Haiku might take
a
little longer,
yet even this target will be
somewhere on the web,
of course.
So how do we apply this "snippet" search-approach to the deep web "proprietary problems"?
Well, first we seek our target through JSTOR, for instance: The Musical Quarterly, that as we can see,
http://www.jstor.org/journals/00274631.html,
has a JSTOR "fixed wall" archive limit to the year 2000.
Then we completely ignore JSTOR and we seek The Musical Quarterly by itself:
http://mq.oxfordjournals.org/archive/ and
then we search our target there.
Now we seek
an abstract or a snippet of our target:
http://mq.oxfordjournals.org/content/vol88/issue1/index.dtl
Now we can find the snippet "..." (
to be developed during workshop: if published
the useful link would disappear)
Incidentally, even a simple tool like
EtherApe,
makes easy to see how a target
might encompass various IPs.
For instance in this case, accessing
the JSTOR portal, contact was made with two different servers:
192.84.80.37 (Manchester university, UK) and 204.153.51.41 (Princeton University, States): this kind of info
comes
handy when trying to find a campus proxy.
Here another simple, image-related, "guessing" example:
Visit the useful & relatively open "deep web" Art & images database at
http://www.paletaworld.org/
Try its
advanced search form. Let's for instance
try
Theme="War and battles" and country = "Greece" (leave "all countries" for the location of
the paintings)
Select -say- Konstantinos Volanakis' Warship, at the bottom of the result page.
And the 1833_1.jpg image will pop up in a javascript
window without URL (javascript:Popup(1833).
Now you could either check its real address using a
tool à la wireshark,
or simply guess it: since the name of the image is
1833_1.jpg we'll just need to find out the subfolder.
You can bet it will be something like "images" or "pics" or "dbimages",
or "dbpics". And indeed, lo and behold:
http://www.paletaworld.org/dbimages/1833_1.jpg.
(
to be developed during workshop with many other guessing examples)
The best approach, however...
When accessing scholastic journals the best approach, however, is to ignore as much as possible
the proprietary commercial schemes concocted by the
patents' afecionados and access instead "good databases" (and repositories) like the ones
listed
above.
As an example, the
Directory of Open Access Journals (DOAJ)
alone, a collection of full text, quality controlled scientific
and scholarly journals that aims to cover
all subjects and languages had
-as of January 2008- 3073 journals in the directory, with 996 journals searchable at article level
and a total of 168592 full text open articles.
Estote parati & anonymity matters |
The following "general" advice regard all kind of activities on the web, and is not
limited to, though being very relevant for, deep web database perusing (especially when dealing with
proprietary "pay per view" databases).
Estote parati
The browser, your sword
Do yourself a --huge-- favour and use a really powerful and quick browser like
Opera.
Besides its many anti-advertisement and quick note-taking bonuses, its incredible speed is
simply invaluable for time-pressed seekers.
You can of course rig firefox, hammering inside it a quantity of adds-on
and extensions, in order to get it working
almost as well as Opera does out of the box, yet firefox will still
be much slower than Opera.
For your text data mining purposes, consider also using ultra-quick
(and powerful)
CLI-browsers à la
elinks instead of the slower
GUI browsers:
you don't really need
all those useless images and awful ads when data mining gathering books and essays, do you?
The operating system, your shield
You MUST choose a good operating system like
GNU/Linux: for serious web-searching purposes
you cannot and should not use
toy operating systems like windows,
it would be like walking
blindfolded on a web of mines: much too slow, no anonymity whatsoever and quite
prone to dangerous (and potentially serious) viral/rootkits problems.
Use the powerful and clean
Ubuntu (debian) distribution,
or
try one of the various "wardriving oriented" GNU/Linux versions, e.g.: Wifislax or Backtrack, that you'll then
either just boot live onto your Ubuntu box (as an added anonymity layer, see
below) or simply
install
in extenso, instead of Ubuntu.
Anonymity matters
The first and foremost important anonymity rule is to AVOID smearing happily your personal
data around: which means you shouldn't have your real data
AT ALL on your boxes.
Anonymity is VERY important
on the web. As a general rule, you need to protect your data: if you foolishly allow private
companies to collect them, any bogus attorney with half an excuse
will easily get hold of them.
A good approach is to create from the beginning a false -but credible- complete
identity, so no "John Smith", no "Captain Spock" and no
"Superterminator": open the phone book instead, and chose a Name, a Surname and an address from three different
pages. Create "his" email on -say- yahoo or gmail, but
also on a more esoteric and less known
free email provider, that
you'll
use whenever some form-clown will ask for a "non-yahoo" email-address.
You'll also need
a relatively "anonymous" laptop for your wardriving purposes, hence, if possible, bought cash (no credit cards' track)
in another country/town, a laptop that you will henceforth connect to the web ONLY for wardriving
purposes.
Your settings on your laptop (in fact, on all your boxes) should always result as "banal" (and hence as anonymous)
as possible, either respecting your bogus identity: Laura@mypc if your chosen fake first name was "Laura", or just
combinations that might at least
rise some confusion: cron@provider; probingbot@shell, grandma@pension_laptop, etc.
Once on line, always watch your steps and never never never smear your real data around.
When using your "dedicated" laptop you should only wardrive, browse and download:
NO private posting/uploading personal stuff/emailing friends/messageboard interactions/IRC chatting...
NEVER.
Other assorted anonymity tips
Learn thoroughly how to use the mighty
kismet
and all other useful wardriving and tracing tools.
Keep an updated list of the quickest wi-fi connections you'll have found through wardriving: location
and speed.
Choose preferably
places with a bunch of
quick connections that
you can reach in a relatively unobtrusive manner, for instance sitting quietly
inside some brasserie or cafe.
Always choose WEP "protected" access points if you can (WEP "protection" is a joke that any kid can
crack in 5 minutes flat): such
"WEP protected" access points tend (in general) to be much quicker than any totally open and
unprotected access point.
Have a good list of quick proxies at hand. Some of the "unconventioned" nations
listed above offer
interesting everlasting proxies, that seem rather
unbothered
by the euramerican copyrights and patents lobbing mobs.
Use regularly a
MACchanger for your laptop in order to
change systematically (but randomly) its wifi MAC signature.
You'r now all set for your relative anonymity, and you are ready to visit (and enter) any database,
using
whatever means might work, even the "dubious" approaches described above, if you deem it necessary.
Consider that even observing all possible precautionary measures,
if someone with enough resources and power really wants to snatch you,
he probably will: when entering the gray areas of the web, being a tag paranoid is probably a good idea.
Just in case, change often your
access points and -of course- alter irregularly, but systematically,
your preferred wardriving locations and
your
timing &
browsing patterns.
Wget, scapy and other wonder tools |
Searchers should master all the following tools.
Unwashed should begin playing with each one of them for at least a couple of days, gasping in awe.
Wget
GNU/Wget is a very powerful free utility for non-interactive download of files from the Web.
It supports http, https, and ftp protocols, as well as retrieval through http proxies. Cosmic power
at your fingertips.
cURL
curl is a command line tool for transferring files with URL syntax,
supporting FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TFTP, TELNET, DICT, LDAP, LDAPS and FILE.
curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload,
proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos...),
file transfer resume, proxy tunneling and a busload of other useful tricks (custom headers,
replace/remove internally generated headers,
custom user-agent strings, custom referrer strings etc.).
Pavuk
Pavuk is a program used to mirror the contents of WWW documents or files.
It transfers documents from HTTP, FTP, Gopher and optionally
from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set
Etherape
Network traffic is displayed graphically. The more "talkative" a node is, the bigger its representation.
Most useful "quick checker" when you browse around.
-
Node and link color shows the most used protocol.
- User may select what level of the protocol stack to concentrate on.
- You may either look at traffic within your network, end to end IP, or even port to port TCP.
- Data can be captured "off the wire" from a live network connection, or read from a tcpdump capture file.
- Live data can be read from ethernet, FDDI, PPP and SLIP interfaces.
- The following frame and packet types are currently supported: ETH_II, 802.2, 803.3, IP, IPv6, ARP, X25L3, REVARP, ATALK, AARP, IPX, VINES, TRAIN, LOOP, VLAN, ICMP, IGMP, GGP, IPIP, TCP, EGP, PUP, UDP, IDP, TP, IPV6, ROUTING, RSVP, GRE, ESP, AH, ICMPV6, EON, VINES, EIGRP, OSPF, ENCAP, PIM, IPCOMP, VRRP; and most TCP and UDP services, like TELNET, FTP, HTTP, POP3, NNTP, NETBIOS, IRC, DOMAIN, SNMP, etc.
- Data display can be refined using a network filter.
- Display averaging and node persistence times are fully configurable.
- Name resolution is done using standard libc functions, thus supporting DNS, hosts file, etc.
- Clicking on a node/link opens a detail dialog showing protocol breakdown and other traffic statistics.
- Protocol summary dialog shows global traffic statistics by protocol.
Wireshark
Wireshark, the world's most powerful network protocol analyzer, a sort of tcpdump on steroids,
is a GNU licensed free software package that
outperforms tools costing thousands of euro and
has an incredible bounty of mighty features (but learn how to use its filters, or you'll sink inside your
captured data):
- Deep inspection of hundreds of protocols,
- Live capture and offline analysis
- Captured network data can be browsed via a GUI, or via the TTY-mode
TShark utility
- Most powerful display filters
- Rich VoIP analysis
- Read/write many different capture file formats:
tcpdump (libpcap),
Catapult DCT2000,
Cisco Secure IDS iplog,
Microsoft Network Monitor,
Network General Sniffer® (compressed and uncompressed), Sniffer® Pro, and NetXray®,
Network Instruments Observer,
Novell LANalyzer,
RADCOM WAN/LAN Analyzer,
Shomiti/Finisar Surveyor,
Tektronix K12xx,
Visual Networks Visual UpTime,
WildPackets EtherPeek/TokenPeek/AiroPeek,
and many others
- Capture files compressed with gzip can be decompressed on the fly
- Live data can be read from Ethernet, IEEE 802.11, PPP/HDLC, ATM,
Bluetooth, USB, Token Ring, Frame Relay, FDDI, and others (depending on your platfrom)
- Decryption support for many protocols, including
IPsec,
ISAKMP,
Kerberos,
SNMPv3,
SSL/TLS,
WEP,
and WPA/WPA2
- Coloring rules can be applied to the packet list for quick, intuitive
analysis
- Output can be exported to XML, PostScript, CSV, or plain text
Scapy
Scapy (created by my friend Philippe Biondi) is a mighty interactive packet manipulation program. Cosmic power, again,
for anyone that
will invest some time in mastering it.
This little marvel is able to forge or decode packets of
a wide number of protocols, send them on the wire, capture them, match requests and replies, and much more.
It can easily handle most classical tasks like scanning, tracerouting, probing, unit tests, attacks or network
discovery (it can replace hping, 85% of nmap, arpspoof, arp-sk, arping, tcpdump, tethereal, p0f, etc.). It also
performs very well at a lot of other specific tasks that most other tools can't handle, like sending invalid frames,
injecting your own 802.11 frames, combining techniques (VLAN hopping+ARP cache poisoning, VOIP decoding on WEP
encrypted channel, ...), etc :-)
Other tools
(Seekers' Suggestions)
mtr
Finally there are some wondrous CLI tools that -amazingly enough- many searchers don't use:
For instance the mighty useful, dutch mtr ("my traceroute":
allinone traceroute+ping... and more!):
me@mybox:~$ sudo mtr www.searchlores.org
Use the "
n" key to switch between DNS-names and IPs.
[
+fravia]
the burp suite
The burp suite is a java application (so, cross platform) that functions like a one-shot proxomitron swiss army knife,
allowing you to edit any and all HTTP requests and responses as you see fit.
It also has a load of other features (some useful for searching, some for other things)
that are definitely worth getting acquainted with.
[~S~ ritz]
(c) III Millennium: [fravia+], all rights
reserved, all wrongs reversed
Notes
1) Terminology nightmares
The "Deep web" is also known as "Unindexed web" or "Invisible web", these terms being used nowadays
in the literature as equivalents.
The "non deep", "indexed web" is on the other hand also known as "Shallow Web", "Surface Web" or "Static Web".
We could not resist adding further confusion to this searchscape, stating
in this paper that the so-called "deep web" is in reality nowadays rather shallow (hence the "shallow deep web" oxymoron),
if we calculate -as we should- depth as directly proportional to "unindexability".
The "deep web" (of old) is in fact more and
more indexed (and indexable). This is due, among other things, to the mighty spreading of open access repositories and
databases that
follow a publishing model ("knowledge should be free for all, at last")
that
harbingers a well deserved doom for the "proprietary knowledge"
and "pay per view"
models of the now obsolete web of old:
deep, invisible, proprietary and in definitive rather useless
because badly indexed or
impossible to index.
2) What is a seeker?
Ode to the seeker
Like a skilled native, the able seeker has become part of the web. He knows the smell of his forest: the foul-smelling mud
of the popups, the slime of a rotting commercial javascript. He knows the sounds of the web: the gentle rustling
of the jpgs, the cries of the brightly colored mp3s that chase one another among the trees, singing as they go;
the dark snuffling of the m4as, the mechanical, monotone clanking of the huge, blind databases, the pathetic cry
of the common user: a plaintive cooing that slides from one useless page down to the next until
it dies away in a sad, little moan.
In fact, to all those who do not understand it, today's Internet looks more and more
like a closed, hostile and terribly boring commercial world.
Yet if you stop and hear attentively, you may be able to hear the seekers,
deep into the shadows, singing a lusty chorus of praise to this wonderful world of theirs --
a world that gives them everything they want.
The web is the habitat of the seeker, and in return for his knowledge and skill it satisfies all his needs.
The seeker does not even need any more to hoard on his hard disks whatever he has found: all the various images,
musics, films, books and whatnot that he fetches from the web... he can just taste and leave there what he finds,
without even copying it, because he knows that nothing can disappear any more: once anything lands on the web,
it will always be there, available for the eternity to all those that possess its secret name...
The web-quicksand moves all the time, yet nothing can sink.
In order to fetch all kinds of delicious fruits, the seeker just needs to raise his sharp searchstrings.
In perfect harmony with the surrounding internet forest, he can fetch again and again, at will,
any target he fancies, wherever it may have been "hidden". The seeker moves unseen among sites
and backbones, using his anonymity skills, his powerful proxomitron shield and his mighty HOST file.
If need be, he can quickly hide among the zombies, mimicking their behavior and thus disappearing into the mass.
Moving silently along the cornucopial forest of his web, picking his fruits and digging his jewels,
the seeker avoids easily the many vicious traps that have been set to catch all the furry, sad little
animals that happily use MSIE (and outlook), that use only one-word google "searches", and that browse
and chat around all the time without proxies, bouncing against trackers and web-bugs and smearing all
their personal data around.
Moreover the seeker is armed: his sharp browser will quickly cut to pieces any slimy javascript or
rotting advertisement that the commercial beasts may have put on his way. His bots'
jaws will tear apart any database defense, his powerful scripts will send perfectly balanced
searchstrings far into the web-forest.
3) &
8) Most used search engines
Data extrapolated from nielsen,
hitwise and alia.
The most
used search engines
are at the moment (January 2008) on planetary scale google (58%), yahoo (20%),
MSNSearch (7%) and ASK (3%), with slight variants
for the usage in the States (where MSNSearch covers 15% of all searches, while AOL has a 5% usage).
Note that these search engines are the
most used, not the
best ones (just try out
exalead :-)
4) Bergman
Bergman, M.K., "
The Deep Web: surfacing hidden
value", Journal of Electronic
Publishing, Vol. 7, No. 1, 2001. It's almost ironic to find this document inside brightplanet, which is
part of the awful and next to useless
"
completeplanet" deep web search engine.
Of course the same document
is available
elsewhere as well.
5) Williams
Williams, Martha E.
"
The State of Databases Today: 2005".
Jacqueline K. Mueckenheim (Ed.),
Gale Directory of Databases,
Vol. 1: Online Databases 2005, part 1 (s. xv-xxv). Detroit, MI: Thomson Gale, 2005
6) Lewandowski
See "
Exploring
the academic invisible web"
by Dirk Lewandowski and Philipp Mayr, 2007.
7) Is 97% of the deep web publicly available?
Some researchers even arrived (many years ago) to
astonishing
conclusions in this respect:
"
One of the more counter-intuitive results is that 97.4% of deep Web sites are publicly available without restriction;
a further 1.6% are mixed (limited results publicly available with
greater results requiring subscription and/or paid fees); only 1.1% of results are totally subscription or
fee limited (Michael K. Bergman, "white paper" in
The Journal of Electronic Publishing,
August 2001, Volume 7, Issue 1)"
Beats me how anyone can dare to state on such deep matters
exact percentages like "97.4%", "1.6%" and/or
"1.1%" without deeply blushing.
9) Proxies
We give for acquired that searchers and readers know already how to find, choose, use and chain
together
proxies
in order to bypass censorship attempts
and to probe interesting servers and databases. The main point is that
a proxy does not simply pass your http request along to your target:
it generates a new request
for the remote
information.
The target database is hence probed by a proxy in -say- Tuvalu, subjected to Tuvalu's non-existing copyright laws,
while the searcher's own laptop
does not access the target database, thus somehow respecting
the dogmatic decrees of the commercial powers that rule us.
Still, chaining (and rotating) proxies is a fundamental precaution when accessing and probing databases
along the gray corridors
of the web. You can and should use ad hoc
software for this purpose, but you should also know how to chain
proxies
per hand, simply adding a
-_-, in the address field of
your browser, between their
respective URLs:
http://www.worksurf.org/cgi-bin/nph-proxy.cgi/-_-/http://anonymouse.org/cgi-bin/anon-www.cgi/http://www.altavista.com,
(If you use numeric IP proxies, just chain them directly using a simple slash
/ between their URLs).
10) Credit cards faking
If you really want to delve into credit cards' security,
start from
Everything you ever wanted to know about CC's
and from
Anatomy of Credit Card Numbers.
11) Extracting forms
There is a huge literature.
See for instance "
Extracting Data Behind Web Forms"
by Stephen W. Liddle, David W. Embley, Del T. Scott & Sai Ho Yau, 2002.
(Statistical predictions about the data that must be submitted to forms. Dealing
with forms that do not
require user authentication)
Also check "
Light-weight Domain-based Form Assistant:
Querying Web Databases On the Fly" by Zhen Zhang, Bin He Kevin & Chen-Chuan Chang, 2005.
(Creating a "form assistant" and discussing the problem of binding costraints & mandatory templates)
(c) III Millennium: [fravia+], all rights reserved, reversed,
reviled & revealed