Meta Data searches on the Gnutella Network (addendum)
-Sumeet
Thadani, Lime Peer Technologies
This document is meant to be an addendum to the white paper we wrote about Meta Information searches on the Gnutella network. When that paper was first published, people on the GDF suggested broadly two enhancements:
After debating about the issues internally, we decided to accept the first point but not the second. The reasons for these independent decisions are highlighted below:
To expand on point 6 above I will use the example of a real query that is based on the audio schema. The audio schema is available (among others) in appendix A. Just for clarity I will mention the names of the fields present in the audio schema – title, artist, album, track, genre, type, seconds, year, language, SHA1, bitrate, price, links, comments. All these fields are at the same level (just below the root in the DOM tree).
It is important to note that it is possible to encode queries pertaining to audio schema with a binary encoding, only because it has a flat structure. It would be much harder to represent queries pertaining to other (not flat) schemas using the binary encoding scheme.
This is the XML that goes out between the two nulls in the Query message. A typical query is (I will be using this sample query in the rest of the paper to illustrate the various points I am talking about):
<audio schema=”http://www.limewire.com/schemas/audio.xsd”>
<title>Point in
Blast</title>
<artist>Bowling
Grapes</artist>
<genre>Classic
Rock</genre>
</audio>
The above rich query contains 158 characters. An important feature is that the query only contains the fields on the basis of which the user is interested in searching. Note that since the user did not specify the language, there is no language tag in the query XML string.
Most alternative proposals on the GDF have suggested replacing pure XML with some sort of delimiter, between the values. So, the same rich query with the delimiter scheme would look like this (I am assuming the delimiter is the pipe ( | )):
http://www.limewire.com/schemas/audio.xsd
| Point in Blast | Bowling Grapes | | | Classic Rock | | | | | | | | | |
|
We need all these delimiters because all the fields appear at a particular position in the XML, and the delimiters are the only way of indexing the values to particular fields. The corresponding rich query with the binary encoding contains 97 characters.
Lets assume that the text query part of the message is a string about 10 bytes long.
If binary encoding were used, the size of the query would be (22+2+10+97 = )131 bytes.
If the pure XML encoding is used the size of the query will be (22+2+10+158 =) 192 bytes.
So by using a binary encoding, we use approximately 70% of the bandwidth we would be using if we were using pure XML.
We conclude that its possible to save bandwidth worth 30% of query and reply traffic, if we were to sacrifice all the good features that pure XML offers.
The LimeWire team suggests that Gnutella clients spend the extra 30% bandwidth on query and reply traffic and use an open, well defined and extensible standard.
Here are some possible alternatives to reduce the amount of data sent out on the wire:
a) We would increase the emphasis on using attributes rather than elements wherever possible. This will reduce the size of the XML considerably. For instance if we were using this approach, the query would look like this:
<audio schema=”http://www.limewire.com/schemas/audio.xsd”
title=”Point in
Blast”
artist=”Bowling
Grapes”
genre=”Classic
Rock”
/>
This format would only use 131 bytes instead of 158 in the example above. The advantage is that it saves bandwidth and is a valid XML, however, the problem here is that it s possible to do this trick only so long the data has a flat structure. It will not work well if the data has structure.
b) The second alternative is to reduce the size of the tag names, so the above query would look like this:
<aud schema=”http://www.limewire.com/schemas/audio.xsd”>
<ti>Point in
Blast</ti>
<ar>Bowling
Grapes</ar>
<ge>Classic
Rock</ge>
</aud>
With this approach the size of the XML part is 134 bytes. This is a good saving, however the problem with this approach is that the all clients would have to do some work to translate ti into title. This technique is a compromise between a full XML approach and a full binary encoding approach.
c) Yet another possible approach is to have a combination of approaches a) and b) above. So the XML of the query will look like this :
<aud schema=”http://www.limewire.com/schemas/audio.xsd”
ti=”Point in Blast”
ar=”Bowling
Grapes”
ge=”Classic
Rock”
/>
With this approach the size of the XML is only 119 bytes. But this approach will suffer from the drawbacks of both a) and b) schemes above.
It is obvious that for query replies, the percentage of extra bandwidth spent in using pure XML versus using binary encoding would roughly match the percentage of extra bandwidth used for queries. In other words, if query replies send out responses in pure XML they would use roughly 30% more bandwidth than if query replies used binary encoding.
A sample query was shown in the previous section. I also mentioned that it does not contain fields that are not part of the query. For instance the query did not contain entries for language and seconds and other fields the user is not interested in searching on.
The user annotates the various files in her library. The annotation about the files will be done as per a certain schema. When a rich query comes to a user, the user searches the various XML documents that she has, pertaining to the schema of the query and returns the documents that match with the rich query.
One response in a query reply looks like this (this response is from a query reply that was sent in response to the sample query above):
<?xml
version="1.0"?>
<audio
xsi:noNamespaceSchemaLocation="http://www.limewire.com/schemas/audio.xsd"
identifier="C:\shared\Bowling Grapes - Point in
Blast.mp3">
<title>Point in
Blast</title>
<artist>Bowling
Grapes</artist>
<track>11</track>
<genre>Classic
Rock</genre>
<SHA1>B48D73E83C95FC857EC748ED7328DCBA</SHA1>
<bitrate>192</bitrate>
<price>$12.50</price>
<comments>I like this
version of the song!!</comments>
</audio>
Notice that the query only specified three fields (artist, title and genre) but the response contains a lot more fields (title, artist, genre, track no, SHA1, bit-rate, price, comments); but the query reply does not contain values for all the fields in the schema, however the reply does contain values for all fields that were populated by the user.
Here are some open-ended questions for which we have no answers yet. Comments and suggestions are welcome.
Where in the Query and QueryReply should the XML
go?
After thinking about it we have come to the conclusion, there are three places where we can stick the meta-data.
The major benefit of this approach is that when all the meta-data is bundled together, and that allows us to avoid repeating common information across the various responses in a query reply. Our calculations indicate that we can save about 20% bandwidth usage by using this technique.
The SHA1 could be sent as part of every XML query and response. It could be an attribute of the root of the DOM document. So it would be at the same level as the schema identifier in the query and at the same level as the schema and file identifier level in the responses.
For clients that are XML enabled, the value of the SHA1 can easily be read, alternately the “SHA1=” string can be searched for within the double nulls, and clients can read the value.
How do various hosts get the schemas they need to
understand and respond to XML queries?
We propose bundling some
standard schemas with the client. The idea is get them off a website or some
other easily accessible place, as and when they are available.
We may be able to make entries in the hash-tables for the schemas. If a client does not know about a particular schema, it is unlikely that she will be able to respond to any queries of that schema. This schema could quickly degenerate to a broadcast scheme.
Another possibility is to not allow a user to do a text and rich search simultaneously. If the user decides to do a rich search, we can pick up certain fields from the rich query, and put those values in the text part of the query. The query routing will then automatically be done on the basis of these values. We will in effect be routing queries on the basis of the values of some rich fields’ values.
LimeWire believes that XML is the best way to go for rich searches. Internal testing shows that this scheme works well. The approach is based on strong standards and is stable and extendable. Competing alteratives use a little less bandwidth, but close out many options.
We are making efforts to reduce bandwidth usage by introducing schemes like, eliminating ping broadcasts, pong caching and query routing. We expect that the gains we make in those areas will be able to support some of the extra burden rich searches bring with it.
This is the audio schema.
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/audio.xsd">
<element name="audio" type="audioType"/>
<complexType name="audioType">
<all>
<element name="title" type="string"/>
<element name="artist" type="string"/>
<element name="album" type="string"/>
<element name="track" type="short"/>
<element name="genre">
<simpleType base="string">
<enumeration value="Blues"/>
<enumeration value="Classic Rock"/>
<enumeration value="Country"/>
<enumeration value="Dance"/>
<enumeration value="Disco"/>
<enumeration value="Funk"/>
<enumeration value="Grunge"/>
<enumeration value="Hip-Hop"/>
<enumeration value="Jazz"/>
<enumeration value="Metal"/>
<enumeration value="New Age"/>
<enumeration value="Oldies"/>
<enumeration value="Other"/>
<enumeration value="Pop"/>
<enumeration value="R & B"/>
<enumeration value="Rap"/>
<enumeration value="Reggae"/>
<enumeration value="Rock"/>
<enumeration value="Techno"/>
<enumeration value="Industrial"/>
<enumeration value="Alternative"/>
<enumeration value="Ska"/>
<enumeration value="Death Metal"/>
<enumeration value="Pranks"/>
<enumeration value="Soundtrack"/>
<enumeration value="Euro-Techno"/>
<enumeration value="Ambient"/>
<enumeration value="Trip-Hop"/>
<enumeration value="Vocal"/>
<enumeration value="Jazz+Funk"/>
<enumeration value="Fusion"/>
<enumeration value="Trance"/>
<enumeration value="Classical"/>
<enumeration value="Instrumental"/>
<enumeration value="Acid"/>
<enumeration value="House"/>
<enumeration value="Game"/>
<enumeration value="Sound Clip"/>
<enumeration value="Gospel"/>
<enumeration value="Noise"/>
<enumeration value="AlternRock"/>
<enumeration value="Bass"/>
<enumeration value="Soul"/>
<enumeration value="Punk"/>
<enumeration value="Space"/>
<enumeration value="Meditative"/>
<enumeration value="Instrumental Pop"/>
<enumeration value="Instrumental Rock"/>
<enumeration value="Ethnic"/>
<enumeration value="Gothic"/>
<enumeration value="Darkwave"/>
<enumeration value="Techno-Industrial"/>
<enumeration value="Electronic"/>
<enumeration value="Pop-Folk"/>
<enumeration value="Eurodance"/>
<enumeration value="Dream"/>
<enumeration value="Southern Rock"/>
<enumeration value="Comedy"/>
<enumeration value="Cult"/>
<enumeration value="Gangsta"/>
<enumeration value="Top 40"/>
<enumeration value="Christian Rap"/>
<enumeration value="Pop/Funk"/>
<enumeration value="Jungle"/>
<enumeration value="Native American"/>
<enumeration value="Cabaret"/>
<enumeration value="New Wave"/>
<enumeration value="Psychadelic"/>
<enumeration value="Rave"/>
<enumeration value="Showtunes"/>
<enumeration value="Trailer"/>
<enumeration value="Lo-Fi"/>
<enumeration value="Tribal"/>
<enumeration value="Acid Punk"/>
<enumeration value="Acid Jazz"/>
<enumeration value="Polka"/>
<enumeration value="Retro"/>
<enumeration value="Musical"/>
<enumeration value="Rock & Roll"/>
<enumeration value="Hard Rock"/>
<enumeration value="Folk"/>
<enumeration value="Folk-Rock"/>
<enumeration value="National Folk"/>
<enumeration value="Swing"/>
<enumeration value="Fast Fusion"/>
<enumeration value="Bebob"/>
<enumeration value="Latin"/>
<enumeration value="Revival"/>
<enumeration value="Celtic"/>
<enumeration value="Bluegrass"/>
<enumeration value="Avantgarde"/>
<enumeration value="Gothic Rock"/>
<enumeration value="Progressive Rock"/>
<enumeration value="Psychedelic Rock"/>
<enumeration value="Symphonic Rock"/>
<enumeration value="Slow Rock"/>
<enumeration value="Big Band"/>
<enumeration value="Chorus"/>
<enumeration value="Easy Listening"/>
<enumeration value="Acoustic"/>
<enumeration value="Humour"/>
<enumeration value="Speech"/>
<enumeration value="Chanson"/>
<enumeration value="Opera"/>
<enumeration value="Chamber Music"/>
<enumeration value="Sonata"/>
<enumeration value="Symphony"/>
<enumeration value="Booty Bass"/>
<enumeration value="Primus"/>
<enumeration value="Porn Groove"/>
<enumeration value="Satire"/>
<enumeration value="Slow Jam"/>
<enumeration value="Club"/>
<enumeration value="Tango"/>
<enumeration value="Samba"/>
<enumeration value="Folklore"/>
<enumeration value="Ballad"/>
<enumeration value="Power Ballad"/>
<enumeration value="Rhythmic Soul"/>
<enumeration value="Freestyle"/>
<enumeration value="Duet"/>
<enumeration value="Punk Rock"/>
<enumeration value="Drum Solo"/>
<enumeration value="A capella"/>
<enumeration value="Euro-House"/>
<enumeration value="Dance Hall"/>
</simpleType>
</element>
<element name="type">
<simpleType base="string">
<enumeration value="Song"/>
<enumeration value="Speech"/>
<enumeration value="Audiobook"/>
<enumeration value="Other"/>
</simpleType>
</element>
<element name="seconds" type="int"/>
<element name="year" type="year"/>
<element name="language" type="language"/>
<element name="SHA1" type="int"/>
<element name="bitrate" type="short"/>
<element name="price" type="string"/>
<element name="link" type="uriReference"/>
<element name="comments">
<simpleType base="string">
<maxInclusive value="100"/>
</simpleType>
</element>
</all>
</complexType>
</schema>
The video schema
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/xerces/video.xsd">
<element name="video" type="videoType"/>
<complexType name="videoType">
<element name="Title" type="string" minOccurs="0" maxOccurs="1"/>
<element name="Director" type="string" minOccurs="0" maxOccurs="1"/>
<element name="Producer" type="string" minOccurs="0" maxOccurs="2"/>
<element name="Studio" type="short" minOccurs="0" maxOccurs="1"/>
<element name="Stars" type="string" minOccurs="0" maxOccurs="3"/>
<element name="Type" minOccurs="0" maxOccurs="1">
<simpleType base="string">
<enumeration value="Music Video"/>
<enumeration value="Commercial"/>
<enumeration value="Trailer"/>
<enumeration value="Movie Clip"/>
<enumeration value="Video Clip"/>
<enumeration value="VHS Movie"/>
<enumeration value="DVD Movie"/>
<enumeration value="Other"/>
</simpleType>
</element>
<element name="Minutes" type="decimal" minOccurs="0" maxOccurs="1"/>
<element name="Size" minOccurs="0" maxOccurs="1">
<complexType>
<element name="Pixel Height" type="int"/>
<element name="Pixel Width" type="int"/>
</complexType>
</element>
<element name="Year" type="year" minOccurs="0" maxOccurs="1"/>
<element name="Langauge" type="language" minOccurs="0" maxOccurs="1"/>
<element name="Subtitles" type="language" minOccurs="0" maxOccurs="3"/>
<element name="SHA1" type="int" minOccurs="0" maxOccurs="1"/>
<element name="Rating" minOccurs="0" maxOccurs="1">
<simpleType base="string">
<enumeration value="G"/>
<enumeration value="PG"/>
<enumeration value="PG-13"/>
<enumeration value="R"/>
<enumeration value="NC-17"/>
<enumeration value="NR"/>
</simpleType>
</element>
<element name="Availability" type="string" minOccurs="0" maxOccurs="1"/>
<element name="Price" type="string" minOccurs="0" maxOccurs="1"/>
<element name="Shipping" type="string" minOccurs="0" maxOccurs="1"/>
<element name="Link" type="uriReference" minOccurs="0" maxOccurs="1"/>
<element name="Comments" minOccurs="0" maxOccurs="1">
<simpleType base="string">
<maxInclusive value="100"/>
</simpleType>
</element>
</complexType>
</schema>
The book schema:
<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.limewire.com/schemas/xerces/book.xsd">
<element name="book" type="bookType"/>
<complexType name="bookType">
<all>
<element name="Title" type="string"/>
<element name="Edition" type="short"/>
<element name="Author" type="string"/>
<element name="Publisher" type="string"/>
<element name="Genre">
<simpleType base="string">
<enumeration value="Computers, Information, and Reference"/>
<enumeration value="Philosophy and Psychology"/>
<enumeration value="Religion"/>
<enumeration value="Social Sciences"/>
<enumeration value="Language"/>
<enumeration value="Science"/>
<enumeration value="Technology"/>
<enumeration value="Arts and Recreation"/>
<enumeration value="History and Geography"/>
<enumeration value="General Fiction"/>
<enumeration value="Classics"/>
<enumeration value="Crime"/>
<enumeration value="Science Fiction"/>
<enumeration value="Poetry"/>
<enumeration value="Children and Teens"/>
<enumeration value="Other"/>
</simpleType>
</element>
<element name="Subject" type="string"/>
<element name="Pages" type="int"/>
<element name="Year" type="year"/>
<element name="Language" type="language"/>
<element name="SHA1" type="int"/>
<element name="ISBN">
<simpleType base="int">
<pattern value="\d{10}"/>
</simpleType>
</element>
<element name="Dimensions">
<complexType>
<element name="Length" type="decimal"/>
<element name="Width" type="decimal"/>
<element name="Height" type="decimal"/>
</complexType>
</element>
<element name="Back">
<simpleType base="string">
<enumeration value="Hardback"/>
<enumeration value="Paperback"/>
</simpleType>
</element>
<element name="Availability" type="string"/>
<element name="Price" type="string"/>
<element name="Shipping" type="string"/>
<element name="Link" type="uriReference"/>
<element name="Comments">
<simpleType base="string">
<maxInclusive value="100"/>
</simpleType>
</element>
</all>
</complexType>
</schema>