essays |
---|
Manual searches of reliable news sources is a time consuming
task, and not one suited for combing through thousands of news
sources. The time constraints involved in such a search are
unacceptable, to say the least. Computers and networks are
sufficiently advanced that harnessing their power is inevitable,
and is to be desired.
Relying on `weblogs' that cater to your specific interest is
undesired because of the perceived, if not actual, lack of
professionalism that is rampant in the weblog arena. These sites
are not always on top of news events, and are often full of
editorial mistakes. One of the largest weblog sources for `geek
news', Slashdot(2), is often full of editorial
mistakes and serious bias.
Being dependent on 3rd party commercial entities for
appropriate search results can lead to skewed results, and extreme
bias. In the past, it has not been uncommon for search engines to
sell their page ranking results to the highest bidder(3). A cynical and skeptical nature is to be
appreciated when dealing with commercial entities.
Gathering Content
When doing large, professional searching, it is customary to
use bots, or software search agents to comb the World Wide Web for
valuable information. There are many tools available for writing
bots(4), and much literature regarding said
subject.
Regular Expression Web Templates
Large web sites are not easy to maintain, hence the rise of
dynamic web sites and scripting tools, such as PHP, Perl, ASP,
JSP/Servlets, Python, etc. Because of the use of such dynamic
tools, most web sites fit a `template.' Logical tools create
logical web sites, even if structures are masked by extra long and
obscure URLs. A human can decipher and extricate such a
structure, and hence map out the essence of a web site. This work
is manifest by "sending out finely tuned software agents, or bots,
that learn not only which pages to search, but also what
information to grab from those pages."(1)
This structure that is extricated by a human is used once, in
the form of a custom bot, then lost because of the nature of the
ever-changing web, and because there does not exist a standard way
of communicating the structure of a website to another person, or
bot. RDF(5), and other meta tag standards are
not useful, because of the voluntary nature of their use. RDF is
a great idea for a perfect world. We need pragmatic solutions for
an imperfect world.
A standard form for communicating the structure of a web site
is needed, so that this structure can be fed to a bot, and
information gathered efficiently, without the extreme duplication
of work that is so rampant today in the creation of custom search
agents. Regular expressions are the natural choice for modelling
a single page, but an appropriate form for the structure of a
website is needed. This structural form must be completely
modelled in one file, be operating system agnostic, and must cater
specifically to the HTTP protocol. There must also be a table of
metadata at the head of the file, that indicates all the data that
can be culled from the web site in question. For example, when
modelling a news site, then the table of metadata must indicate
that `Science Headlines', as well as `Economic Headlines' are
available. Thus, a robot that is able to digest this standard
form for communicating the structure of a web site must only be
told what data is relevant, and not how to retrieve it and parse
it.
Please note that this standard form for communicating the
structure of a web site is the general case, and can be
specifically used for the culling of news headlines. The reason
for choosing to discuss news headlines is because of a perceived
interest of the Readers by the Author.
Distributed Effort
By harnessing the many hands of the Internet, it will be
possible to keep abreast of the changes of the structure to
different web sites, by utilizing the good will of World Wide
Volunteers. A central repository of Web Templates will be needed
to house the structures of web sites, and this will necessarily
need to be completely Free. Make no mistake, this technology must
be available to one and all, and I don't care if we all have to
live in a cardboard box to do it.
Text Classification
When News Content is available by these generic robots, text
classification technology can be utilized to categorize content,
and to assign user preferences to said content. It could be
organized in such a manner to extricate certain patterns in
content, in the goal of finding valuable information. Envision a
Library, or if the immensity of that thought is too grand, then
perhaps a News Library. Proven algorithms could be used, for
example Bayesian Classification, or perhaps the implementation of
cutting edge technology, such as Support Vector Machines (SVM).
This is an extremely fruitful area of research, and I recommend it
to all who are interested in understanding the nature of
information, and hence, the nature of the 'net.
Conclusion
I would like to thank you, Gentle Reader, for staying the
course and reaching this place of rest. I have written this text
in the hopes of sparking useful, and enlightened discussion. I
know that I will not be dissapointed. Your comments are welcome,
and anxiously awaited.
Sources
(1) Mining the 'Deep Web' With Specialized
Drills, Lisa Guernsey,
http://partners.nytimes.com/2001/01/25/technology/25SEAR.html
or also @ +Forseti's: http://qu00l.net/seeking-nyt.html
(2) Slashdot: News for Nerds. Stuff that
matters.,
http://www.slashdot.org
(3) Pay For Placement?, Danny Sullivan,
h
ttp://searchenginewatch.com/resources/paid-listings.html
(4) Bot Writing, Bot Trapping & Bot Wars:
How to search the web, fravia+,
http://www.searchlore.or
g/bots.htm or http://www.searchlores.
org/bots.htm
(5) Resource Description Framework
(RDF), W3C,
http://www.w3.org/RDF/
Bibliography
Information Retrieval on the Web (2000), Mei Kobayashi,
Koichi Takeda,
http
://citeseer.nj.nec.com/kobayashi00information.html
(citeseer is an excellent source for computer science
papers, that span text classification technology as well as the
future of bots - you won't be dissapointed! also, perhaps do
some searches here on `bayesian classifier' , 'bayesian networks'
, 'support vector machines' for some text classification
algorithms - please note that implementations are forthcoming - to
be integrated with the generic bot)
2000 Search Engine Watch Awards: Best Specialty Search,
Danny Sullivan,
(http://www.searchenginewatch.com/awards/index.html#specialty
(mentions www.moreover.com, and is informative)
Moreover: Business Intelligence and Dynamic Content,
http://www.moreover.com
(commercial implementation of the culling of news sources,
currently offering free searches of their database - could be much
greater were it Free, our Aim)
Autonomy: Automating the Digital Economy,
http://www.autonomy.com
(commercial implementation of basic text classification
algorithms,
aimed at diverse content types to automate the `understanding' of
text - see
their White Papers for an intro to their tech - a Free
implementation will be
completed soon)