Main content

Alert message

 

Project name: System and method for classifying, publishing, searching and locating electronic documents

 

Investors: Prof Yosef Ben Asher, Dr. Shlomo Berkovsky

 

Summary of the invention:

The invention relates to an electronic documents management system and provides a system for classifying, publishing, searching and locating electronic documents, said system comprising: i. means for classifying and publishing electronic documents via an ontological description consisting of at least one vector, each vector comprising at least one feature-value pair wherein each slot of said at least one vector corresponds to a feature of said at least one feature-value pair and a range of each of said slots corresponds to a set of all possible values of said feature; ii. means for storing each of said electronic documents using the following steps: using a first hashing function to map said feature of each feature-value pair to a slot number, corresponding to a coordinate in said at least one vector; using a second hashing function to map said value of each feature-value pair to a numeric value of said slot corresponding to the range of said coordinate; creating a new ordered vector based on the results of said two hashing functions; mapping said new ordered vector to a node in a hypercube; iii. means for storing each of said electronic documents in a hypercube-like graph structure wherein each vertex of said hypercube can be recursively constructed of another hypercube; iv. means for specifying search criteria for one or more electronic documents via an ontological description by enumerating at least one feature-value pair; and v. means for locating said one or more electronic documents according to the specified search criteria.

Multiple vectors can be organized in a hierarchical multi-layered structure. Electronic documents are stored in a hypercube-like graph structure wherein each vertex can be recursively constructed of another hypercube. Search criteria can be specified for one or more documents via an ontological description by enumerating at least one feature-value pair. The electronic document or documents are then located according to the specified search criteria.

In a preferred embodiment of the present invention, electronic documents are distributed across multiple computers connected by a peer-to-peer network. The ontological description employed to publish or locate electronic documents consists of a global, predefined part and optionally an unspecified part dynamically specified by the system's users. The terms used in the ontological description of the electronic documents can undergo a semantic standardization in order to increase the performance of the system.

The present invention further provides a method for classifying, publishing and locating of electronic documents, the method comprising the steps of: i. classifying and publishing electronic documents via an ontological description consisting of at least one vector, each vector comprising at least one feature-value pair wherein each slot of said vector corresponds to a feature of said at least one feature-value pair and a range of each of said slots corresponds to a set of all possible values of said feature; ii. storing each of said electronic documents using the following steps: using a first hashing function to map said feature of each at least one feature-value pair to a slot number, corresponding to a coordinate in said at least one vector; using a second hashing function to map said value of each value-feature pair to a numeric value of said slot corresponding to the range of said coordinate; creating a new ordered vector based on the results of said two hashing functions; mapping said new ordered vector to a node in hypercube; iii. storing each of said electronic documents in a hypercube-like graph structure wherein each vertex of said hypercube can be recursively constructed of another hypercube; iv. specifying search criteria for one or more electronic documents via an ontological description by enumerating at least one feature-value pair; and v. locating said one or more documents according to the specified search criteria

   

Field of the Invention

The invention relates to a system and method for classifying, publishing, searching and locating electronic documents wherein each electronic document is described by one or more feature-value pairs. More specifically, the present invention relates to such system and method applied in a distributed computing environment of a peer-to-peer network.

Background of the invention

Sharing or providing access to large amounts of electronic documents among an important number of users is a common problem or challenge shared by many computer applications.

It is known in the art to use client-server technology for enabling users to access large amounts of electronic documents or files. The client-server model is typically based on a single or short numbers of servers, to which all users connect. While client-server systems work well under certain conditions, the model has some very obvious limitations and drawbacks.

Client-server environments do not scale smoothly. If the amount of users connected to the server or servers grows dramatically, or the traffic generated grows significantly, without augmenting the server capacity, then the system slows down or may even break down. Augmenting the server capacity is an expensive operation that sometimes requires shutting down the system. Reliability of client-server systems is another frequent problem. If the server breaks down, then the system may be severely crippled or even shut down. Anonymity may also be an issue, as the identity and actions of each client are known to the server.

Peer-to-Peer (P2P) technology, developed in the recent years, offers a solid alternative to the traditional client-server model of computing. In P2P systems, every user (peer) acts as both the client and the server at the same time, and provides a portion of the system capability. P2P systems usually lack a designated centralized management, rather depending on the voluntary contribution of resources by the users. Thus, P2P technology allows a dynamic set of users to efficiently share resources without any centralized management. The shared resources are computing power (e.g., in distributed computation), data (e.g., in large-scale file sharing), bandwidth (e.g., data transfer from multiple sources) and others. As a result, the advantages of P2P technology over the client-server model include roughly unlimited scalability, high privacy and anonymity of the users, and low costs. Sharing and aggregation of the resources guarantees robustness and high availability of P2P systems.

The first generation of P2P systems was based on three classical architectures: mediated P2P architecture, pure P2P architecture, and hybrid architecture. Basically, all of them were designated for a large-scale data sharing. Prior art applications, such as Napster, Freenet and Gnutella, allowed users to download data (mainly multimedia files), shared by other users. Performance of these systems suffered from severe problems. For example, in Napster a cluster of central servers maintained the indices of the files shared by the users. Flooding search algorithm of Gnutella limited the scalability of the system and did not allow proper functioning over a heterogeneous set of users. Freenet, despite being fully decentralized and employing efficient routing algorithms, could not guarantee reliable data location. This led to the development of content-addressable P2P systems.

A number of similar content-addressable P2P systems, such as CAN, Pastry, Chord and some others, are referred as the second generation of P2P systems. They implement highly scalable self-organizing infrastructure for a fault-tolerant routing over distributed hashing basing data management mechanism (DHT). In these systems the users and the data objects are assigned unique identifiers (respectively, userids and keys) from a sparse space. An object is inserted through put(key,userid), and located by get(key) primitives in a bounded number of routing hops.

Prior art routing algorithm of content-addressable systems is based on the Plaxton algorithm. The Plaxton algorithm was not designated for P2P systems, but rather for graphs with a static population. The main idea of Plaxton algorithm is in correcting each time a single digit of address. For example, user 1234 receives a message, addressed to user 1278 (the first two digits of the address already match). The message is forwarded to user 1275 (since there the first three digits will match). To support this routing, each user maintains a data structure of logical neighbors that match i-digits length prefix of its own userid, but differ in the (i+1)th digit. To maintain a connected system with N users, each user is connected to O(logN) neighbors. Since a single digit of address is corrected each time the message is routed, the total length of the routing path is O(logN) hops.

In a P2P implementation of Plaxton algorithm, the routing constantly forwards the message to the user, whose userid is closer to the message key than the current user. Although each of the above DHT-based systems (Pastry, CAN, and Chord) employs slightly different variant of Plaxton routing, they outperform the routing algorithms used in the first generation of P2P systems. Their communication overhead is significantly lower due to the fact that the messages are routed to the relevant users only.

However, DHT-based system rely on the hashing primitives of put(key,userid) and get(key). Thus, one of their major limitations is their support in exact-match searches only. For example, when two similar, but not identical keys key.sub.1 and key.sub.2 are inserted into the system, the results will usually be different. Therefore only the searches specifying the exact terms mentioned when the key was inserted, will succeed to find it, and an approximated search cannot be operated.

Certain applications such as file sharing, yellow pages, classifieds, dating services, bulletin boards and others, allow users to publish electronic documents and also search for electronic documents. Frequently, electronic documents are classified and searched for according to predetermined criteria imposed by the service provider or application.

For example, a yellow-pages classifieds type application requires dynamic management (i.e., insertion and search) of general-purpose E-Commerce advertisements (ads). Its infrastructure needs to be capable of managing E-Commerce ads of both supply and demand types. Supply ads are ads, where the users offer a product or a service in exchange for a payment, whereas in demand ads the users generally seek for a product or for a service provided by other users. The main functionality of the system is to identify matching between appropriate demand and supply ads. This is further to referred as publish-locate functionality.

In common E-Commerce systems, a user publishing or searching for an ad, is usually required to fill-in a predefined form describing the matter of the ad. For example, a user searching a car ad can be asked to fill-in a form containing basic fields such as manufacturer name, geographical location, and price range. More sophisticated systems can ask a user to fill-in a complicated form containing more fields such as the manufacturer name, model, range of production years, gearbox type, engine volume and number of previous owners.

This approach, exploiting predefined forms containing a set of features or attributes describing the objects, is further referred to as ontology-based approach. Ontology is a formal explicit specification of a particular domain. It provides both human-understandable and machine-processable mechanisms, allowing enterprises and application systems to collaborate in a smart way. Thus, the set of attributes describing the objects from a particular domain is considered as the ontology of a domain, whereas the attributes are the slots of the ontology.

HyperCup has proposed a flexible ontology-based hypercube topology for P2P data management. It used global predefined ontology to classify users as providers of particular information associated with the ontology slots. This classification determined the position of the user in the hypercube and allowed location of any desired information in a bounded number of steps through a semantic routing. Thus, HyperCup formed an alternative to distributed hashing, flooding, local routing tables and other fundamental search techniques in P2P. Additionally, HyperCuP proposed decentralized algorithms, capable of constructing and maintaining connected hypercube graph, stable to dynamic joins and departures of users.

In HyperCup, the users are connected in a hypercube-like P2P structure, chosen due to its logarithmic diameter, increased fault tolerance and the symmetry guaranteeing equal load of the users. The hypercube dimension d and the range of possible values in the dimensions (further referred to as the coordinates range) k determine the maximal number of users connected to the hypercube. A complete hypercube contains at most N.sub.max=k.sup.d users, where every user is connected to k-1 logical neighbors in each dimension, resulting in N.sub.n=(k-1)d neighbors. This topology generates a symmetric structure where the load of the users in the system is similar, as each user holds roughly equal functionality.

Any edge in the hypercube, connecting a pair of adjacent users X and Y, is assigned a numeric value, referred to as rank. When a node Y is a logical neighbor of X over a dimension i, the rank of the edge connecting X and Y is i. Thus, the edges rank ranges from 0 to d-1. Any user T in the hypercube can act as an initiator of search or broadcast operation which is performed as follows: the message jointly with the rank of the connecting edge is sent to all the neighbors of the initiating user T. Upon receiving a message, users forward it only over the edges, whose ranks are higher than the rank of the edge the message was received from. This guarantees that each user in the graph will receive a message exactly once, and also that any connected user will be reached in O(d) routing hops.

The following example refers to a hypercube with a dimension d=3 and the coordinates range k=2. Eight users numbered from 1 to 8 are connected and form a complete hypercube. Each user is connected to exactly d=3 logical neighbors, and the ranks of the connecting edges are 0, 1, or 2. For example, in respect to user 8, user 1 is regarded as 0-neighbor, user 3 is 1-neighbor, and user 4 is 2-neighbor. Let user 8 be the initiator of the broadcast operation. The messages are sent to the neighbors, i.e., users 1, 3 and 4. Upon receiving a message over 0-rank edge, user 1 forwards it to 1- and 2-neighbors, i.e., users 2 and 5. User 3 forwards it to 2-neighbor, user 6, and this broadcast through "higher-rank forwards" continues until all the nodes in the hypercube are reached. Obviously, no node receives the broadcasted message more than once, and the longest path in this hypercube is d=3 hops long.

HyperCup also proposes a dynamic P2P algorithm for hypercube construction and maintenance. The algorithm is based on the idea that a user can manage not just a single, but also a number of nodes in the hypercube graph. This is required in order to simulate the missing users in the topology of the next complete hypercube, maintained in any topology state. For example, node 4 can simulate three missing nodes of the hypercube. The simulations are noted by the dashed edges 1-4, 2-4 and 3-4, as node 4 acts as a logical neighbor of nodes 1, 2 and 3.

When a new user connects the network, he takes his place (according to the data provided) in the next complete hypercube, releases the user that previously managed the node and starts functioning as a real hypercube node. For example, if a new user that should be positioned at node 5 is connected, it is routed to one of the logical neighbors of node 5, i.e., users maintaining the nodes 1 or 4. As the location of node 5 is simulated by the user maintaining node 4, the new user contacts the user of node 4, builds a real edge between them and takes part of its functionality, i.e., builds a real edge with node 1 and starts simulating the neighbor of node 2.

When a user disconnects, one of the remaining neighbors takes the responsibility for the node, previously managed by the leaving user. Since the next complete hypercube is constantly maintained, previously discussed broadcast and search operations are not affected by sporadic joins and departures of users.

As already stated, users are classified as providers of a particular content. A single predefined ontology defining the domain semantics inherently organizes the users providing the same or similar contents in concept clusters through the above construction and maintenance algorithm. This facilitates querying the generated topology and efficiently routing the queries through the above routing algorithm. Note that when a query is built as a logical combination of the ontology slots, it is routed only to the users that can potentially answer it.

The following simple ontology of cars domain, can be used to construct a 3-dimensional HyperCuP (FIG. 1): dimension 0 distinguishes between manual(0), semi-automatic(1) and automatic(2) gearbox, dimension 1 stands for USA(0) or non-USA(1) produced cars, and dimension 2 for metallic(0) or non-metallic (0) color of the car. Clearly, a query for automatic car produced in the USA is routed to nodes 1 and 5, as only those node stores the requested class of cars.

However, HyperCup approach requiring predefined ontologies is applicable only to a limited set of domains and is inappropriate for a general-purpose E-Commerce system, implying dynamic and a-priori unknown set of objects. A possible solution might be allowing users to add new types of ontological forms. However, this will flood the system with multiple (partially similar and overlapping) ontologies. A negative example can be a setting where a user publishing an object and a user looking for the same object are using slightly different ontologies. Another solution might be developing a single comprehensive ontology, comprising as many domain ontologies as possible. However, projecting this ontology on the hypercube will result in a huge, sparse and barely manageable structure. Moreover, sharing of the single ontology by all the users will obstruct it from being expanded.

All these restrictions contradict the decentralized spirit of P2P networks and raise a need for developing a flexible mechanism for managing a dynamic set of ontologies. It would be desirable to have a system where the ontology could be only partially defined, and parts of it could be dynamically specified by the users. Looking at the previous example of a classifieds system, it would be desirable to enable a user to insert any new classification that he deems appropriate, for example, GPS system, DVD, etc., and on the other hand let users also search for any classification that they deem appropriate.

Summary of the invention

The present invention relates to an electronic documents management system and provides a system for classifying, publishing, searching and locating electronic documents, said system comprising: i. means for classifying and publishing electronic documents via an ontological description consisting of at least one vector, each vector comprising at least one feature-value pair wherein each slot of said at least one vector corresponds to a feature of said at least one feature-value pair and a range of each of said slots corresponds to a set of all possible values of said feature; ii. means for storing each of said electronic documents using the following steps: using a first hashing function to map said feature of each feature-value pair to a slot number, corresponding to a coordinate in said at least one vector; using a second hashing function to map said value of each feature-value pair to a numeric value of said slot corresponding to the range of said coordinate; creating a new ordered vector based on the results of said two hashing functions; mapping said new ordered vector to a node in a hypercube; iii. means for storing each of said electronic documents in a hypercube-like graph structure wherein each vertex of said hypercube can be recursively constructed of another hypercube; iv. means for specifying search criteria for one or more electronic documents via an ontological description by enumerating at least one feature-value pair; and v. means for locating said one or more electronic documents according to the specified search criteria.

Multiple vectors can be organized in a hierarchical multi-layered structure. Electronic documents are stored in a hypercube-like graph structure wherein each vertex can be recursively constructed of another hypercube. Search criteria can be specified for one or more documents via an ontological description by enumerating at least one feature-value pair. The electronic document or documents are then located according to the specified search criteria.

In a preferred embodiment of the present invention, electronic documents are distributed across multiple computers connected by a peer-to-peer network. The ontological description employed to publish or locate electronic documents consists of a global, predefined part and optionally an unspecified part dynamically specified by the system's users. The terms used in the ontological description of the electronic documents can undergo a semantic standardization in order to increase the performance of the system.

The present invention further provides a method for classifying, publishing and locating of electronic documents, the method comprising the steps of: i. classifying and publishing electronic documents via an ontological description consisting of at least one vector, each vector comprising at least one feature-value pair wherein each slot of said vector corresponds to a feature of said at least one feature-value pair and a range of each of said slots corresponds to a set of all possible values of said feature; ii. storing each of said electronic documents using the following steps: using a first hashing function to map said feature of each at least one feature-value pair to a slot number, corresponding to a coordinate in said at least one vector; using a second hashing function to map said value of each value-feature pair to a numeric value of said slot corresponding to the range of said coordinate; creating a new ordered vector based on the results of said two hashing functions; mapping said new ordered vector to a node in hypercube; iii. storing each of said electronic documents in a hypercube-like graph structure wherein each vertex of said hypercube can be recursively constructed of another hypercube; iv. specifying search criteria for one or more electronic documents via an ontological description by enumerating at least one feature-value pair; and v. locating said one or more documents according to the specified search criteria