|
|
When I originally saw wayoutthere's post on the messageboard,
I said to myself this is another meta searcher, probably very similar to webferret&co. So I didn't
checked it further until RH catched my attention again. After all, why not give
it a try? Maybe I'll
find some interesting ideas that could be usefull for the oslse project.
After I wrote a first draft, the least I could do was to ask wayoutthere if he didn't wanted to add some notes of his own.
After all, he is the originator of this essay. It turns up that he had more than a few comments to add : He wrote a complete
essay describing his own journey through the search engines world (taking a very interesting approach)
which I greatly advice you to read.
The purpose of this essay is to gather ideas and informations that could be useful for the oslse project and, more
generally, for building our own bots. We're going to find out how Lexibot, our today's target, manages the
query building and the results parsing of the (many!) different search engines it supports.
During my first visit to lexibot home page, I came accross their technical support section.
Somewhere there you can download what they call The most recently released version of the search engines supported by the LexiBot,
namely engine2.dat
(note that it was released in July 2000). Would it be so nice that they offer their whole treasure for nothing? Let's download that file.
No luck, the file seems to be encrypted. But well, from it's size (almost 500Kb)
we may suppose it contain some valuable informations (that make about 0.8Kb info/engine, not a bad ratio).
Moreover, looking at the file content, we can see lots of repetitive patterns. This probably indicates a simple
permutation cipher. So, let's put some reversing efforts in it.
After disasembling the file, it become obvious that, even if it claim to be a Very small program size,
the software wasn't quite optimized (you can find many redundancy code and data). Anyway,
let's search for engine2.dat
, the filename we are interested in. 7 occurances are found.
3 in sub_40866D and 4 in sub_40A0E8. A quick check shows that the file is open in read mode in
the first function and in write mode in the second. So we may assume the decryption take place
somewhere in the sub_40866D function. I'll skip the boring details and jump directly into the decryption function :
Decode_String proc near Flag = dword ptr -10h Count = dword ptr -0Ch cur_pos = dword ptr -8 offset = dword ptr -4 string = dword ptr 8 push ebp mov ebp, esp sub esp, 10h cmp [ebp+string], 0 jnz short loc_0_418C25 jmp Return loc_0_418C25: ; CODE XREF: Decode_String+Aj mov [ebp+cur_pos], 0 BigLoop: ; CODE XREF: Decode_String+A1j mov eax, [ebp+string] push eax call _strlen add esp, 4 cmp [ebp+cur_pos], eax jnb short Return mov eax, [ebp+cur_pos] xor edx, edx mov ecx, 80 div ecx ; DX:AX / CX => AX (remainder=>DX) mov [ebp+offset], edx mov [ebp+Count], 0 mov [ebp+Flag], 0 Decod_loop: ; CODE XREF: Decode_String+82j cmp [ebp+Count], 95 jnb short done cmp [ebp+Flag], 0 jnz short done mov edx, [ebp+string] add edx, [ebp+cur_pos] movsx eax, byte ptr [edx] mov ecx, [ebp+Count] imul ecx, 320 mov edx, [ebp+offset] cmp eax, CryptTable[ecx+edx*4] jnz short loc_0_418C8D mov [ebp+Flag], 1 jmp short loc_0_418C96 loc_0_418C8D: ; CODE XREF: Decode_String+6Ej mov eax, [ebp+Count] add eax, 1 mov [ebp+Count], eax loc_0_418C96: ; CODE XREF: Decode_String+77j jmp short Decod_loop done: ; CODE XREF: Decode_String+4Aj ; Decode_String+50j cmp [ebp+Flag], 0 jz short loc_0_418CAC mov ecx, [ebp+Count] add ecx, 20h mov edx, [ebp+string] add edx, [ebp+cur_pos] mov [edx], cl loc_0_418CAC: ; CODE XREF: Decode_String+88j mov eax, [ebp+cur_pos] add eax, 1 mov [ebp+cur_pos], eax jmp BigLoop Return: ; CODE XREF: Decode_String+Cj ; Decode_String+27j mov esp, ebp pop ebp retn Decode_String endpThis function is called several times, always from the same subroutine which seems to carry the parsing of the engine2.dat file. On entry, arg_0 point to a crypted string which will be decrypted on exit.
void decode_line(char* string) { int cur_pos=0; while (cur_pos<strlen(string)) { //Big Loop int offset=cur_pos%80; int Count=0; bool Flag=false; while((Count<=95) && !Flag) { if (string[cur_pos]==CryptTable[Count*320+offset*4]) Flag=true; else Count++; } if (Flag==true) { string[cur_pos]=Count+32; } cur_pos++; } }Here is an extract of the crypted file (on left), along with the same snippet decrypted by the above function (on right).
crypted | plain |
---|---|
<#;-2J-nI@,)}I5rc gEta+Q6*w-Fw9)z3\Ofv{x'OnI ?:lm2s.H 2<|0gimI6 gwZ 'hlc2a.H+E:;&2!Ow,9v?;60S_6C,vKlya<B7Sa}o|LDK+.VMywVBySqVJl`5 ]Y|a/kjoIE_+\2rOwE-h.Qr+&O:I6Vtf1&&3zF;*)jx%OCwev.z>nTYc'R*Y1^O[tK/M2>Tf?V^OXiQhAR8^W >h|{/F+I3-FB DXSfA6$H;{M]LUFX|>#qn@{IAN!s"6 ]`8axT&Dn9Ai9bQLqh gwZ ]`8axT&Dn9oY_)-:Dn' <EEAxo_SID^T8r[ gwZ ?n|0S0<q|9:Yv5W ;5|=d;[!OZ1[]!wx{E&3~_ ;5|=d;[y}a_[jO5L!M$jKN(=@\D3zfv ;5|=d;[*q5.Rct`?+wN9Uuj`@ ;5|=d;[ynWZYv5\kyE&3~ ;5|=d;[P}u1[9)zh\Ofv/zA1 ;5|=d;[3>a_=T1'ky^:jKunm( ;5|=d;[f|OZ=TA\2k{ ;5|=d;[f}u*kd!UB+0z ;5|=d;[zn'"=T1'ky^:jKunm( ;5|=d;[z|uyRTA\2k{ ;5|=d;[z}iZ@d!UB+0z ;5|=d;[D@Zw=TA\2k{ ;5|=d;[n@@yGkx'eXneB%unm( ;5|=d;[n3Gy#Zxzh!x;' ;5|=d;[!OZ1[9)zh\Ofv/zA1 ;5|=d;[!OZ1G]@QB+0z ;5|=d;[!OZ1WA0I)Xx$B'27anO ;5|=d;[!OZ1}j)5)\Lsv?;xb#8g ;5|=d;[!OZ1nko@9\uHA'([ ;5|=d;[!OZ1#h/zaZMzA'([ ;5|=d;[!OZ1.X1tOXF998Y gwZ ?n|0CQ|'&'"#h/Oz gwZ O8E?j6ks6 M<R"<imP6 .hRDKTf'|G=Q gwZ | [AltaVista - Web] EngineName=AltaVista - Web SubName= Category= END WebPage=http://www.altavista.com/cgi-bin/query?pg=aq&what=web Action=http://www.altavista.com/cgi-bin/query?pg=aq&q=%QUERY_STRING%&r=&kl=XX&d0=&d1= MethodType=0 Boolean=AND,OR,(),NOT,NEAR,",* AdditionalUrlList= END AdditionalTextList= [next >>] END SiteUrlFilters= http://www.amazon.com/ http://companies.altavista.com/ http://ad.doubleclick.net http://careers.av.com http://doc.altavista.com http://jump.altavista.com http://live.av.com http://local.av.com http://mail.altavista.com http://microav.com http://money.av.com http://news.av.com http://search.thetrip.com http://sports.av.com http://www.altavista.com http://www.cmgi.com http://www.doubleclick.net http://www.intelihealth.com http://www.shopping.com http://www.teragram.com http://www.wildweb.com END SiteTextFilters= END UserName= Password= Description= END |
Woooowww, how nice ! The first things that pop to the eyes is that nice Action url along with the Boolean operators
supported. But we don't see any immediate traces of any result page parsing rule, ie, anything that could tell lexibot
where, in the result page, to find valuable info like url, title, summary, ranking, ... Ok, let's keep that in mind for
later.
Decrypting the whole engine2.dat file gives the same details for almost 600 engines. Not bad for a little reversing
effort :-)
Note that if you want to decrypt the file by yourself, you'll need not only the above decode_line() function but also the permutation table itself. You can find it in lexibot.exe from offset 0x0DB1B0 up to 0x0E286F (7600 dwords).
So now we have almost 600 search engines urls along with the correct query syntax (and even with some categorization, see below Practical application).
But the next question is : How the hell do they extract information from the results pages ?
After spending some times trying to figure out where they hide their parsing rules, it becomes obvious
that there were NO parsing rules involved. So, you probably wonder how they manages to extract valuable info from the search engines ?
Well the answer is quite simple actually. The only thing lexibot wants from the search engine are urls to possible matching pages. It
doesn't care about any summary, ranking, whatever else provided by the search engines.
Roughly, what lexibot does is to first build a table of each and every links -url and text anchor (the string between <a> and </a>)-
it found in the results page. Then it search for any 'more results' page by keeping only the links that either matches the
AdditionalUrlList
values (against the url) or either match the AdditionalTextList
values
(against the text anchor).
Now, in order to keep only the links pointing to actual results (and throwing away any useless link), it uses the
SiteUrlFilters
and SiteTextFilters
to keep only the interesting links. Every link that match a filter rule is simply discarded.
And that's all ! The parsing is done.
Finally lexibot can fetch the actual matching pages pointed by the search engine and compute on its own
all necessary information like title, summary, ranking, ...
We now have a plain text file containing all the data necessary for lexibot to
support all those engines. Unfortunately, it seems that this file isn't quite up to date
(it was last released in july 2000). So you may want to update it on your own. Or you may
even want to add some new engines to the list. Nothing easier ! Just edit the plain text file
and it's done. But you'll tell me that we have to encrypt the file before it can be used by lexibot.
No sir ! Simply name it engine2.une
and it works ! Indeed, before checking for the engine2.dat
file,
lexibot first check for a engine2.une
file (une probably means unencrypted, hehe) and if it
exists, it opens that one.
Ok, let's do something practical. We'll add Searchlores's namazu engine to the lists. For this, we will write a new section to
engine2.une
.
The first lines are quite easy to figure out. They should look something like :
[Namazu - Searchlores] EngineName=Namazu - Searchlores SubName= Category= END WebPage=http://www.searchlores.org/ Action=http://www.searchlores.org/cgi-bin/search?query=%QUERY_STRING%&submit=Search%21&max=20&result=normal&sort=score MethodType=0 Boolean=AND,OR,(),NOT,,",*Now we have to tell lexibot which links to consider as matching results, which links to eventually follow for 'more results' and which links to discard.
AdditionalUrlList= &whence= END AdditionalTextList= END SiteUrlFilters= http://www.searchlores.org/cgi-bin/ END SiteTextFilters= searchlores.org how to search END UserName= Password= Description= ENDThis tells lexibot to consider all links containing &whence= as links to more results. All links pointing to /cgi/bin/ or having the strings 'searchlores.org' or 'how to search' as anchor will be ignored as they don't actually point to matching pages.
engine2.une
sections, it seems lexibot doesn't use it. Checking the other files around (or reading the documentation)
we find another file 'engine.gdb' which defines each and every Engines Group recognized by lexibot. So we'll add our own group.
Simply add the following lines at the beginning of the [Engine Groups] section :
My Own= Namazu - Searchlores ENDThis will create a new group, named 'My Own', which contain a single search engine, namazu (of course you could add 'Namazu - Searchlores' to an already existing group).
Also, while writting this essay, I wondered what to do with those 600 urls. Browsing through the engine2.une
file
isn't quite easy. So I thought about formatting it into a nice html table which could be put online. But this wouldn't take any
advantages of all the categorization information stored into engine.gdb and wouldn't make it very usefull. So I end up with a little php script which will parse both
the engine2.une
and the [Engine Group] section of engine.gdb
to generate a nice dynamic html table presenting all the engines related to a user-selectable category.
I didn't tried to optimize the script nor to optimize the file format (in order to easily update it whenever a new engine2.dat file comes out). So
the script is actually slow. But you can easily download it and run it on your local machine :-). I don't plan to maintain this list of urls, but if you
have any additions or corrections, I could update it from times to times. Or even better : you may try to send them to lexibot people themselves :-)
Lexibot's basic idea (fetching each and every matching page) is quite interesting. Indeed it allows operators that are not commonly supported by classical search engine like NEAR, AFTER, BEFORE to be fully supported by lexibot. For this, a minimal subsuming query (ie, replacing proximity operators by AND) is sent to the search engines. Then, once the possible matching pages are fetched, distance between words can be computed. This allow lexibot to evaluate the proximity operators in order to discard some non-fully matching pages.
On the other hand, such an implementation wouldn't fit well into the oslse project which main objective is to offer an on-line meta-search engine. Beeing online, it should avoid any time and bandwidth consuming tasks like fetching so many pages, most of the time only to compute information already available on the search engine results page. However, we could implement this fetching feature in a second phase in order to refine the search over a small user-definable subset of the initial results.
This point of view seems to be shared by wayoutthere :
This tool would seem to be a bandwidth hog, and also there is another view to be held.
When you do a query, it gets all the pages on the link pages (upto 100).
So what is the big deal you ask?
When it was getting those pages, your details were placed in the logs for each and
every page grabbed. Pages that held no interest for you and which would have been
discarded whilst scanning through the results by eye have been grabbed and retrieved.
From my own experiences you always turn up a lot of dummy results, especially if the
search is on a subject with widely used keywords. Do you really want every single
page in the results to be grabbed? Your tracks to be placed on all those web servers
that you never have and may never visit again, you didnt visit it but your bot
did and it left tracks all over(unless you were careful).
Another side effect of this process is that if you use a proxy, and rely on
the cached pages to store information then this process can wipe out entries
which you intended to keep.
This also becomes a problem if any of the target pages are large files which
can make the search process a lot slower while they are retrieved. This is
more of a consideration for modem users, but they still number the most around
the world.
I agree that the best approach would be to grab the results, rank them and
then on a selected subset, retrieve them and perform another level of processing
on them to get the results wanted. This can be done by using a local search
tool to search the files themselves once they are present on the local machine,
which allows almost any type of search imaginable and will be very configurable
as it is locally controlled.
Well, what else to say ? Not much actually. I would just point out once again that I wouldn't have written this essay if wayoutthere
didn't posted that message on the messageboard. And he wouldn't have written his own essay (that I hope you have now read)
if I didn't wrote this one and asked his comments on it.
So, who said that messageboard aren't that useful ? :-)
Thank you for reading this essay, hope it was worth it.
(c) Laurent 2001
|
|