|
|
One day I was searching for information on the deep web due to an article I had read
on another page, one of the results that came up was for a PDF on the subject.
The essay attempted to put numbers on the size and percentage of deep versus visible
web, the article sparked off many thoughts. I then started to look at the URL at which
the essay lived. Backstepping to the root path gave an interesting looking site,
which contained a pointer to a tool designed to perform searches for data in deep
or hidden databases.
I have been following the [oslse project] for quite some time, and have also been
assembling a list of URL's for queries on search engines. This was so I could
add more engines to my local search bot, and perform wider and narrower searches
without having to visit the individual search pages.
So, I followed the pointer and came to the home site of the tool.
I read with much interest the claims for this tool, many of which I have seen
with other tools of this type, but a couple jumped out at me:
1. You could perform query types on engines which did not support
that particular query type themselves.
2. The amount and variety of engines that it supported.
The next step was to read more and see what it could do.
While reading their details I decided to view their updates page, there provided
were all the data files, nicely labelled for you. This sounded too easy to be true.
My main thought was that if this tool could perform all these searches, it must
contain the information needed to perform the searches. It was this that I was
after not the product itself, it would save me the time of doing the work myself.
One of the main themes that came up in the OSLSE was how to parse the results, and
get only the results and not the rest, this was another reason this tool held my
interest as it seemed to be able to perform types of queries not actually supported
by the search engine.
After downloading the files the first thing to be checked were the data file updates (quite old!). After looking at these, one of them seemed to be encrypted in some way, but still seemed to retain its original layout, as repetitve patterns and linefeeds were still present.
I have also seen a number of similar tools becoming available on the web, my thought
is that these will become more popular and widespread as more people become fustrated
with not being able to find things they are looking for using conventional search
engines and are prepared to pay a small fee to cut down on their searching time.
(I mean people who have not invested and are not interested in investing time
and effort in [learning] these most valuable skills).
Note, as a matter of interest, that several military
and academic institutions recommend this tool and use it on training courses ;)
This does not mean we should take their recommendation without thinking about it and
seeing how the program works and if it can be of help, it may prove worthless - how
will we know unless we check first.
Other people choose the approach of writing their own tools and can then add and
customise them to their own requirements.
The fact that the process would probably be repeated with other similar tools (in
my quest for more specialised search engine data) had a real impact on the approach
taken, as reversing every single one of them to get the data out would take a lot
of time.
I knew that to get the encrypted file to plaintext would be quite easy, as parts
of it were visible in a hexview of the main file, So we knew what to look for, but
decided on a different approach.
So I decided to put the task of looking inside the software to one side, and wondered
if there was a really quick and easy (almost Zen) approach to get the data from
this program, without altering or reversing the program.
Whilst sitting in my comfortable chair, sipping my favourite drink and listening
to some appropriate music - a thought came to me. This was, all I needed to be able
to do is see the queries it sends out, as that would give the query strings. So how
to do this? I have a local proxy, which logs all web requests to a nice log file,
normally when evaluating software I run the program so it goes through the proxy
so it can be checked for spyware or adware components (Just in case), if this
program was run and the requests went through the proxy then the result would
be a log file with all the URLS requested by the program, which would be the query
strings for the search engines. So all I needed to do was point it to the local
proxy and then enter a query string that would be easy to pick out. ("AAAQUERYAAA")
So first step was to clean the log file of my local proxy ;)
Then to install the program, and run whilst not connected to the internet.
At this point I encountered a problem - the program would not run on my PC - crashed
every time it was run before doing anything, So it was removed from my PC.
Then I called a friend, said I had found this nice tool that might be of interest but
that it did not like my PC or something on it - soon another PC with a clean install
was sitting on the desk - he installed the software and agreed to the license
on HIS PC!(it ran
fine on his pc! umm).
The first time the program was run, it was done offline and set to point to the local
proxy. The proxy was setup to return a dummy good page for any 404 errors, which
contained a couple of valid but distant links. The proxy was configured to allow
requests to these pages and return valid pages from the cache.
Now the program was configured to point to the local proxy a basic search was done.
When the search was complete, the proxy logs were checked and lo-behold, all the
query strings were present and correct. Also to be noted was that the links present
on the 404 pages (well - fake good results) had also been requested by the program.
This made me think, so this program had sent the queries to the search engines
and when the results were returned it had requested the links on the results page.
This I thought was very interesting. So it must be doing some processing on the
pages pointed to by the results which suddenly made me think *OF COURSE* this
is how it is able to perform queries of a type not supported by a particular
search engine - it must do the query using a supported one and then do the
extra work itself - a very nice idea. Suddenly the possibilities seemed endless
and my mind started to wander to ideas of my own, but back to the project.
Another thought hit me like a brick on the head, HOLD ON !! I had not formatted
my fake 404 page like the results page of a query, mainly because I had not
been expecting it to grab those pages as well. But given that it had grabbed
them - why had it? The links on my page in no way looked like a results page.
I had assumed that they parsed the results pages and had rules
for picking the links, so you did not end up grabbing loads of adverts. But this
simple fact (that it had grabbed my links also) proved that they must not, and
must handle that in another way - if they do at all!
Could it be that they have a set of rules, for throwing away links and grab
all the ones that pass this - without bothering to check if they are really
results, or just links on the page.
At this point I posted a message to the messageboard, which was simple and just
gave a URL and asked if anyone knew anything about it without shouting about
what it was or could do. I posted in this way, so it could easily be overlooked
and discarded by people just wandering and wondered if anyone would pick up on
the same points also if they followed similar lines of thought. A teaser was left
for those who wanted to see.
When noone sent a reply after some days the thought was it had been too
obscure and that it should have had a red neon sign on it. Then RH posted
a reply, which was certain to grab attention - so at least one person had
seen the potential, this was the point at which Laurent saw the light and
joined in, this resulted in me writing the essay you are reading.
The process described in this essay is very simple, but often overlooked
or not considered in projects of this type. It provided me with all the
required information without violating the license of the software.
Notably it is also an approach which can be used with other similar tools. If
the user supplies a search string which can be used as a marker - to insert
their own query later then the actual query URL string for each search
engine can be retrieved with the minimum of effort. These can then be inserted
into a local search bot, which replaces this marker with the user query.
The above observations on how this tool works have also given rise to a number
of new thought lines for writers of search bots and I think some important
lessons can be learnt from the methods employed by this tool.
The main change in my thoughts is the need for complex rules for parsing the
results page for each engine. This need seems to be lessened by how the
tool works - this is explained more in part two.
The approach taken in this essay also appeals because it does not utilize
any specific files from the original software or any copyrighted material,
because the search engines provide the query URL's and we do not use any
of the extra data contained in the software - Just the Querys, which could
be gained directly from each S.E.
I have repeated the same process with a number of other such tools,
and have gained a large amount of information on some very specialised
search engines. This process was completed without touching any code
or disassemblers.
You are probably sitting there reading this thinking 'when do we get the
meat' well it is in the second part.
This investigation changed direction during
the process, as many do and managed to get the information without having
to go that any deeper, also the aproach changed from being aimed at a specific
target to a more general one which could be reused easily and quickly.
I must point out that during the writing of this essay, I DID NOT USE THIS
PROGRAM MYSELF OR AGREE TO THE LICENSE, or reverse engineer the code or data files within it.
All the information was gained through viewing log files on a local proxy server.
(information which would be present on any proxy server between you and the search
engines)
I want to point out the importance of Log files especially on proxy servers as they
are a very valuable resource and help understanding of similar processes and programs
greatly. They are also a valuable searching resource as they often contain URL's
which are internal links (not external). I have found many interesting databases using
such log files.
At this point I thought about looking into the program itself, not to find the
URL's for the queries as that had already been revealed by the logs, so
my main interest was that nice big engine file.
Laurent has written an essay about the process of gaining the original data files
from this program. This essay is part two of the series and delves much deeper under
the skin of the software and the way it works.
I followed the same path as Laurent for the second part of my investigation,
but leave it to him to describe the process, as his essay is well written
and concise and raises some interesting points and thoughts.
You should consider these two approaches as companions, and as approaches
to the same problem which can be used to compliment each other - thus allowing
a more thorough examination of the software under the microscope. Firstly by
watching it work and then by looking into it.
Onto Part Two - Delving Deeper
Thanks to Laurent for getting me typing this essay.
Hope you enjoyed reading.
Copyright (c) 2001, WayOutThere
|
|