This chapter explains the relevant steps performed by the open-search agent. This viewpoint extends Fig. 1 by listing actions that each user-search request will produce:
parse User-Keyword to search-keys.
open-search P2P get values for search-keys.
Merge and sort Values. build URL and Meta info lists
Get Summary, meta-info and return results
The steps 2-4 are pipelined and threaded. This means that the user will be presented with results early (step4) while the other processes (step 2-4) remain running and update the information. List 1 elaborates on each step.
List 1.
Step 1 - AAccess an open-search server via HTTP using a web browser
Step 1 - B (optional)install and configure local open-search agent/server
Step 1 - B (optional)Customize search preferences
Step 1 - Cissue a search request to the agent.
Step 1 - DAGENT: parse search request into keys.
Step 1 - D (optional / background)AGENT: [reverse] lookup similar search keys for this key and related pages.
Step 1 - FAGENT: build search request for P2P engine
Step 2P2P: Look up value(s) for the search keys. Values represent URL entries and static-meta-info for the URL.
Step 3 - Asearch-result-builder: get extra info for each entry, build summary, cache,..
Step 3 - Bsearch-result-builder: sort and filter.
Step 3 - Csearch-result-builder: feed step 1-D with related pages/keys. also feed the crawler with links.
Step 4 search-result-formatter: render results for the user, build cached-preview summaries, etc.
A built-in webserver provides the main interface to open-search. This approach is both simple and flexible. It provides the basis to separate front-end design and back-end functionality. The HTTP protocol with with XML/HTML API is a common enough to allow various add-ons (proxy, auth, filter, styling) on top of open-search. A standard CGI interface would benefit quick prototyping, but a customized open-search-httpd will both perform better and simplify the interface to the search-plugins [1]. The user-interface itself has no functional purpose for open-search other then to provide the user and development(!) interface.
To address different needs and eventually outsource development, search algorithms are modularized as plugins. However the plugin-loader and interface-core is a major part of open-search. (there will also be libraries for commonly used functions for OS-plugins). There are different types of plugins and plugins can also interact with each other (plugin-suite). From the top-view: a plugin handles a user request (and can internally call other plugins to complete the job).
Search Plugin: act on user request and schedule lookups on the P2P network.
Result render Pluginact on user request (optional AJAX?! reload/poll/update result-page see the section called “Front-end interface and templates.”)
Result merge Pluginpre-process results from P2P network. (enqueue to render plugin)
Result search Pluginact on special P2P network request.
The p2p network is opaque! For upper layers it's a "black box" that maintains a hash-table all by itself[2]! The Plugins just look-up data in there. later versions can also update the data (ratings, comments,..) in the P2P network, which is basically the job of the crawler.
In the first version we are planning to use some off-the shelf solution that serves as p2p network (such as gnutella, ocean-store or freenet). Later versions will incorporate a customized P2P network layer that addresses low-latency, privacy/anon. and other issues: multicasting, optimize packet size, incoming connections, ring maintenance, recursive and transitive lookups,...
There can be multiple P2P networks for different purposes. or a hybrid solution: fi. while the search-network is completely decentralized, there can be dedicated (centralized) caches, proxies or servers to efficiently parse pdfs, images, etc.
The database is made of several distributed hash tables: A simple 2D map to link hashes and a large p2p-tank for meta information. Here's a first Draft for the DHT Key/Value dataformat. per URL entry:
Title
URL
Search-Keys (DHT KEYS)
Search-Keys (TEXT FORMAT)
data/time modified/crawled/indexed
ratings (category tags and weight for each search-key)
mirror URLs (optional)
crawled by [host/person], indexed by [host/person]. optional pgp-signature (optional - trusted search)
md5-sum of page (optional)
user-comments (optional)
Short-Excerpt/Summary (maybe not part of DHT info)
The first and devel version will use a fixed data set and plugins to import data from external sources. open-search intends to implement a binary-compatible protocol. (config, cache, or home folders can be used on any OS). UTF-8 is the choice of character encoding.
keep in mind: URLs are unique, while search keys are not! some urls have aliases or mirrors, some are dynamic, others static. - open-search needts to trasform a database of unique URLs with multiple search-keys into a Database with unique hashed-keys. Idea: find disjunct set of search-tags that allows to minimize the key-space in a karnough diagram for all URLs. URLs can be sorted into categories (used both for indexing/tagging and for crawler-priority-queueing). (example of categories can include php-script, secure-connection, mirrored-site,..)
serializing hash entries to txt files or XML-dumps is nice for import/export/transcode and debugging.
we need to investigate how much better a SQL database will perform once the dataset grows. postgres is known for very good query optimization, internal hashing and data spatialization!
Each open-search agent has two internal queues/buffers: one for URL to be downloaded for indexing. and two, a buffer with the retrieved data.
An (optional/no privacy) proxy processes can be used to feed user-browsed pages directly into the process-buffer. It can also parse links and learn/rate URLs from user paths (referrer). or be feed crawled pages into the private/personal cache, ie. the opensearch client as personal cache thing, prefetching pages during browsing
The crawler is trivial, although some efforts could be taken to address privacy on crawler level. Crawling could happen time-delayed, or semi-randomly, schedule random crawls on neighbor hosts(!? security), or even drop pages on heuristic limits. Also important is to pay attention to IP proto priority for the crawler (don't suck up user bandwidth, but beware of IP filters!). the open-search proxy seems to be the right place for pluggable session-level privacy.
The open-search indexer-core checks document-type and URL/Hash from the index-buffer-stack and sorts them into channels, which are processed by [different] index-plugins. The return value of the plugins are standardized hashes to be fed to the P2P network-storage.
The indexer can also schedule (enqueue) new URLs for indexing. a priority system can be used to flush or process the channels/queues (ie. look-up the last-crawled date first).
The resulting database is a collection of URLs + meta information, linked to search-keys. Those keys need to be parsed from the page and weighted for the given URL.
As the agent is open-source, we can not guarantee anonymity if it is build on top of the network. User anonymization needs to be implemented below or inside the P2P network. Anonymizing introduces various disadvantages: eg. extra latency. open-search aims to be compatible to tor as optional anonymizer. Privacy issues are addressed at application/session layer with the goal to obfuscate user habits and eliminate individual statistics, but they can not assure anonymity.