Build a Better Search Engine


Two ways to take the problem of redundant data out of a search engine. The first could take about 50% of search engine traffic, complementing current services. The second just improves ordinary search engines.

1. Arachnea

Problem with search engines: By the time you database the stuff its probably out of date. Most of your archive is full of OOD links and it annoys the punters searching. OK, lets do it a different way, with a free service for self-promotion, and a paid for service for revenue.

(a) The free portion.

You offer everyone the chance to register with your site and get 12 free outstanding searches. These are normal search terms they type in, just as they would a search engine. You store these. Users can change them anytime, and can pay for more.

You then accept URLs for spidering, and spider your own choice of content rich sites in the normal manner.

But, you do not allow surfers to search this. As you spider sites in any 24 hour period, you compare them against your private server-side database, running checks [ie checksum] on each site to see if it is new. Only new sites are then held for processing within any 24hr period.

You then run all the held outstanding search requests against this new content, and post all/the first x results on created pages that can be accessed only by the registered user. These pages stay up for one day and contain only links to content new to Arachnea. Surfers can log a preferred time in any 24hrs for their personal Arachnea bookmarks pages to be updated. Other options [ie. language] can permit precedence in these personal bookmark pages.

So each surfer gets up to 10 pages of their own ['My Arachnea bookmarks'] each containing x bookmarks to new content they are interested in, refreshed when they are asleep, to wake up to.

Next day it happens again. It feels like an entirely new concept, a search engine with no latency.

So how do we make it pay?

(b) Paid-for service for generating revenue.

We invite corporate customers to submit either phrases (say their trademarks, copyrighted textual material) or images, data files, bit signatures, entire websites, and software to our secure database.

Each day, we check new content against this material, and report to them if there is a match. So if someone puts up a clone website, libels your company or product, or posts a warez copy of your software on their site, Arachnea will tell you about it, as soon as they try to promote it.

Its a fast checker for stolen content, stolen software, online libel, and clone sites. We can check bit patterns in MP3 files, and in software, match filenames, probe inside zip files, and generally offer a cheap alternative to hiring your own people to surf the web looking for this stuff.

Thats how we get revenue and provide a free site.

2. Zero-Latency for ordinary search sites

This is a simple way to alter how a search engine works, to make a traditional search engine more up to date.

Distribute the data acquisition. Create a distributed server-side program that operates for your system as a client, spidering everything on, say, the aol server. AOL host this. It does all the work for you, watching for new content being uploaded, and telling your database when material is removed by monitoring AOL server ftp operations.

So whats in it for AOL? Well, if AOL let you put your client software on their server, then anyone surfing in from an aol.com dial-up is allowed to use your search engine. Its a 2-way service. Anyone wanting their server's content to be publicised, and their dial-up customers to be able to use your up-to-the-minute search engine, they simply have to host your software.

So thats a new search engine style web technology, and a way of updating the old one.


Back to Stig's Dump.