Meta Data searches on the Gnutella Network (addendum)

-Sumeet Thadani, Lime Peer Technologies

 

This document is meant to be an addendum to the white paper we wrote about Meta Information searches on the Gnutella network. When that paper was first published, people on the GDF suggested broadly two enhancements:

  1. Replace GML Templates with XML Schemas.
  2. There were concerns about the size of the XML strings being sent out on the wire, and some sort of binary encoding was suggested.

 

After debating about the issues internally, we decided to accept the first point but not the second. The reasons for these independent decisions are highlighted below:

 

Schemas to replace GML

  1. GML is proprietary, XML schemas are an open standard.
  2. The XML schema approach is extendable GML is not.
  3. There are standard tools available to handle schemas, making development and addition of features easier.
  4. Schemas are capable of handling integrity constraints and other powerful query mechanisms like ranges; as well as a more types like dates and numbers. 

 

Continue to use pure XML rather than binary encoding

  1. Using a binary encoding goes against the spirit on using an open standard. We will benefit from the developments to the XML standard.
  2. If we switch to a binary encoding then we lose the ability to send out structured queries.
  3. It would be difficult to extend clients to handle more schemas as and when they are introduced. The binary scheme is not scalable to more types of schemas.
  4. Changing schemas would be very difficult with a binary encoding scheme. If for some reason we want to add or remove fields from a schema, the binary encoding scheme would require code changes and new version releases from various vendors, to handle this change in schema.
  5. Binary encoding does not have support for ranges and types, which XML with schemas provides.
  6. The difference in bandwidth usage between using pure XML and binary encoding is not really that much. We illustrate with some sample calculations below.

 

To expand on point 6 above I will use the example of a real query that is based on the audio schema. The audio schema is available (among others) in appendix A. Just for clarity I will mention the names of the fields present in the audio schema – title, artist, album, track, genre, type, seconds, year, language, SHA1, bitrate, price, links, comments. All these fields are at the same level (just below the root in the DOM tree).

 

It is important to note that it is possible to encode queries pertaining to audio schema with a binary encoding, only because it has a flat structure. It would be much harder to represent queries pertaining to other (not flat) schemas using the binary encoding scheme.

 

This is the XML that goes out between the two nulls in the Query message. A typical query is (I will be using this sample query in the rest of the paper to illustrate the various points I am talking about):

 

<audio schema=”http://www.limewire.com/schemas/audio.xsd”>

<title>Point in Blast</title>

<artist>Bowling Grapes</artist>

<genre>Classic Rock</genre>

</audio>

 

The above rich query contains 158 characters. An important feature is that the query only contains the fields on the basis of which the user is interested in searching. Note that since the user did not specify the language, there is no language tag in the query XML string.

 

Most alternative proposals on the GDF have suggested replacing pure XML with some sort of delimiter, between the values. So, the same rich query with the delimiter scheme would look like this (I am assuming the delimiter is the pipe ( | )):

 

http://www.limewire.com/schemas/audio.xsd | Point in Blast | Bowling Grapes | | | Classic Rock | | | | | | | | | | |

     

We need all these delimiters because all the fields appear at a particular position in the XML, and the delimiters are the only way of indexing the values to particular fields. The corresponding rich query with the binary encoding contains 97 characters.

 

Lets assume that the text query part of the message is a string about 10 bytes long.

 

If binary encoding were used, the size of the query would be (22+2+10+97 = )131 bytes.

 

If the pure XML encoding is used the size of the query will be (22+2+10+158 =) 192 bytes.

 

So by using a binary encoding, we use approximately 70% of the bandwidth we would be using if we were using pure XML.

 

We conclude that its possible to save bandwidth worth 30% of query and reply traffic, if we were to sacrifice all the good features that pure XML offers.

 

The LimeWire team suggests that Gnutella clients spend the extra 30% bandwidth on query and reply traffic and use an open, well defined and extensible standard.

 

Here are some possible alternatives to reduce the amount of data sent out on the wire:

a)      We would increase the emphasis on using attributes rather than elements wherever possible. This will reduce the size of the XML considerably. For instance if we were using this approach, the query would look like this:

 

<audio schema=”http://www.limewire.com/schemas/audio.xsd

 title=”Point in Blast”

artist=”Bowling Grapes”

genre=”Classic Rock”

 />

This format would only use 131 bytes instead of 158 in the example above. The advantage is that it saves bandwidth and is a valid XML, however, the problem here is that it s possible to do this trick only so long the data has a flat structure. It will not work well if the data has structure.

 

b)     The second alternative is to reduce the size of the tag names, so the above query would look like this:

 

<aud schema=”http://www.limewire.com/schemas/audio.xsd”>

<ti>Point in Blast</ti>

<ar>Bowling Grapes</ar>

<ge>Classic Rock</ge>

</aud>

With this approach the size of the XML part is 134 bytes. This is a good saving, however the problem with this approach is that the all clients would have to do some work to translate ti into title. This technique is a compromise between a full XML approach and a full binary encoding approach.

 

c)      Yet another possible approach is to have a combination of approaches a) and b) above. So the XML of the query will look like this :

<aud schema=”http://www.limewire.com/schemas/audio.xsd

 ti=”Point in Blast”

ar=”Bowling Grapes”

ge=”Classic Rock”

 />

 

With this approach the size of the XML is only 119 bytes. But this approach will suffer from the drawbacks of both a) and b) schemes above.

 

It is obvious that for query replies, the percentage of extra bandwidth spent in using pure XML versus using binary encoding would roughly match the percentage of extra bandwidth used for queries. In other words, if query replies send out responses in pure XML they would use roughly 30% more bandwidth than if query replies used binary encoding.

 

Rich Query and Rich Reply structure

 

A sample query was shown in the previous section. I also mentioned that it does not contain fields that are not part of the query. For instance the query did not contain entries for language and seconds and other fields the user is not interested in searching on.

 

The user annotates the various files in her library. The annotation about the files will be done as per a certain schema. When a rich query comes to a user, the user searches the various XML documents that she has, pertaining to the schema of the query and returns the documents that match with the rich query.

 

One response in a query reply looks like this (this response is from a query reply that was sent in response to the sample query above):

 

<?xml version="1.0"?>

<audio xsi:noNamespaceSchemaLocation="http://www.limewire.com/schemas/audio.xsd" identifier="C:\shared\Bowling Grapes - Point in Blast.mp3">

<title>Point in Blast</title>

<artist>Bowling Grapes</artist>

<track>11</track>

<genre>Classic Rock</genre>

<SHA1>B48D73E83C95FC857EC748ED7328DCBA</SHA1>

<bitrate>192</bitrate>

<price>$12.50</price>

<comments>I like this version of the song!!</comments>

</audio>

 

Notice that the query only specified three fields (artist, title and genre) but the response contains a lot more fields (title, artist, genre, track no, SHA1, bit-rate, price, comments); but the query reply does not contain values for all the fields in the schema, however the reply does contain values for all fields that were populated by the user.

 

Other issues

 

Here are some open-ended questions for which we have no answers yet. Comments and suggestions are welcome.

 

Where in the Query and QueryReply should the XML go?

 

After thinking about it we have come to the conclusion, there are three places where we can stick the meta-data.

  1. In between the nulls in the query and the query-reply. This is the scheme we discussed in our last paper.
  2. We could put the meta-data in the QHD. The problem with this approach is that the bundle of XML in the QHD has to be indexed to the files in the responses of the query reply.

The major benefit of this approach is that when all the meta-data is bundled together, and that allows us to avoid repeating common information across the various responses in a query reply. Our calculations indicate that we can save about 20% bandwidth usage by using this technique.

  1. We could put a link to the XML data in the QHD. The client that receives the reply would then have to obtain the meta-data and associate it with the files in that query reply. Obtaining the meta-data will be done out of band (not through the Gnutella network – perhaps by making an HTTP request to the user who sent the reply). This alternative will put no extra burden of rich replies on the Gnutella network. The downside here is that in order to obtain all the meta-data the searching client will probably have to make many HTTP connections to all the clients that sent her replies. This may become unwieldy.

 

How do we deal with SHA1?

 

The SHA1 could be sent as part of every XML query and response. It could be an attribute of the root of the DOM document. So it would be at the same level as the schema identifier in the query and at the same level as the schema and file identifier level in the responses.

 

For clients that are XML enabled, the value of the SHA1 can easily be read, alternately the “SHA1=” string can be searched for within the double nulls, and clients can read the value.

 

How do various hosts get the schemas they need to understand and respond to XML queries?

 

We propose bundling some standard schemas with the client. The idea is get them off a website or some other easily accessible place, as and when they are available.

 

Is Query Routing for XML schemas possible?

 

We may be able to make entries in the hash-tables for the schemas. If a client does not know about a particular schema, it is unlikely that she will be able to respond to any queries of that schema. This schema could quickly degenerate to a broadcast scheme.

 

Another possibility is to not allow a user to do a text and rich search simultaneously. If the user decides to do a rich search, we can pick up certain fields from the rich query, and put those values in the text part of the query. The query routing will then automatically be done on the basis of these values. We will in effect be routing queries on the basis of the values of some rich fields’ values.

 

 

LimeWire believes that XML is the best way to go for rich searches. Internal testing shows that this scheme works well. The approach is based on strong standards and is stable and extendable. Competing alteratives use a little less bandwidth, but close out many options.

 

We are making efforts to reduce bandwidth usage by introducing schemes like, eliminating ping broadcasts, pong caching and query routing. We expect that the gains we make in those areas will be able to support some of the extra burden rich searches bring with it.

 

Appendix A

 

This is the audio schema.

 

<?xml version="1.0" encoding="UTF-8"?>

<schema targetNamespace="http://www.limewire.com/schemas/audio.xsd">

 

  <element name="audio" type="audioType"/>

 

  <complexType name="audioType">

    <all>

      <element name="title" type="string"/>

      <element name="artist" type="string"/>

      <element name="album" type="string"/>

      <element name="track" type="short"/>

      <element name="genre">

        <simpleType base="string">

          <enumeration value="Blues"/>

          <enumeration value="Classic Rock"/>

          <enumeration value="Country"/>

          <enumeration value="Dance"/>

          <enumeration value="Disco"/>

          <enumeration value="Funk"/>

          <enumeration value="Grunge"/>

          <enumeration value="Hip-Hop"/>

          <enumeration value="Jazz"/>

          <enumeration value="Metal"/>

          <enumeration value="New Age"/>

          <enumeration value="Oldies"/>

          <enumeration value="Other"/>

          <enumeration value="Pop"/>

          <enumeration value="R &amp; B"/>

          <enumeration value="Rap"/>

          <enumeration value="Reggae"/>

          <enumeration value="Rock"/>

          <enumeration value="Techno"/>

          <enumeration value="Industrial"/>

          <enumeration value="Alternative"/>

          <enumeration value="Ska"/>

          <enumeration value="Death Metal"/>

          <enumeration value="Pranks"/>

          <enumeration value="Soundtrack"/>

          <enumeration value="Euro-Techno"/>

          <enumeration value="Ambient"/>

          <enumeration value="Trip-Hop"/>

          <enumeration value="Vocal"/>

          <enumeration value="Jazz+Funk"/>

          <enumeration value="Fusion"/>

          <enumeration value="Trance"/>

          <enumeration value="Classical"/>

          <enumeration value="Instrumental"/>

          <enumeration value="Acid"/>

          <enumeration value="House"/>

          <enumeration value="Game"/>

          <enumeration value="Sound Clip"/>

          <enumeration value="Gospel"/>

          <enumeration value="Noise"/>

          <enumeration value="AlternRock"/>

          <enumeration value="Bass"/>

          <enumeration value="Soul"/>

          <enumeration value="Punk"/>

          <enumeration value="Space"/>

          <enumeration value="Meditative"/>

          <enumeration value="Instrumental Pop"/>

          <enumeration value="Instrumental Rock"/>

          <enumeration value="Ethnic"/>

          <enumeration value="Gothic"/>

          <enumeration value="Darkwave"/>

          <enumeration value="Techno-Industrial"/>

          <enumeration value="Electronic"/>

          <enumeration value="Pop-Folk"/>

          <enumeration value="Eurodance"/>

          <enumeration value="Dream"/>

          <enumeration value="Southern Rock"/>

          <enumeration value="Comedy"/>

          <enumeration value="Cult"/>

          <enumeration value="Gangsta"/>

          <enumeration value="Top 40"/>

          <enumeration value="Christian Rap"/>

          <enumeration value="Pop/Funk"/>

          <enumeration value="Jungle"/>

          <enumeration value="Native American"/>

          <enumeration value="Cabaret"/>

          <enumeration value="New Wave"/>

          <enumeration value="Psychadelic"/>

          <enumeration value="Rave"/>

          <enumeration value="Showtunes"/>

          <enumeration value="Trailer"/>

          <enumeration value="Lo-Fi"/>

          <enumeration value="Tribal"/>

          <enumeration value="Acid Punk"/>

          <enumeration value="Acid Jazz"/>

          <enumeration value="Polka"/>

          <enumeration value="Retro"/>

          <enumeration value="Musical"/>

          <enumeration value="Rock &amp; Roll"/>

          <enumeration value="Hard Rock"/>

          <enumeration value="Folk"/>

          <enumeration value="Folk-Rock"/>

          <enumeration value="National Folk"/>

          <enumeration value="Swing"/>

          <enumeration value="Fast Fusion"/>

          <enumeration value="Bebob"/>

          <enumeration value="Latin"/>

          <enumeration value="Revival"/>

          <enumeration value="Celtic"/>

          <enumeration value="Bluegrass"/>

          <enumeration value="Avantgarde"/>

          <enumeration value="Gothic Rock"/>

          <enumeration value="Progressive Rock"/>

          <enumeration value="Psychedelic Rock"/>

          <enumeration value="Symphonic Rock"/>

          <enumeration value="Slow Rock"/>

          <enumeration value="Big Band"/>

          <enumeration value="Chorus"/>

          <enumeration value="Easy Listening"/>

          <enumeration value="Acoustic"/>

          <enumeration value="Humour"/>

          <enumeration value="Speech"/>

          <enumeration value="Chanson"/>

          <enumeration value="Opera"/>

          <enumeration value="Chamber Music"/>

          <enumeration value="Sonata"/>

          <enumeration value="Symphony"/>

          <enumeration value="Booty Bass"/>

          <enumeration value="Primus"/>

          <enumeration value="Porn Groove"/>

          <enumeration value="Satire"/>

          <enumeration value="Slow Jam"/>

          <enumeration value="Club"/>

          <enumeration value="Tango"/>

          <enumeration value="Samba"/>

          <enumeration value="Folklore"/>

          <enumeration value="Ballad"/>

          <enumeration value="Power Ballad"/>

          <enumeration value="Rhythmic Soul"/>

          <enumeration value="Freestyle"/>

          <enumeration value="Duet"/>

          <enumeration value="Punk Rock"/>

          <enumeration value="Drum Solo"/>

          <enumeration value="A capella"/>

          <enumeration value="Euro-House"/>

          <enumeration value="Dance Hall"/>

        </simpleType>

      </element>

      <element name="type">

        <simpleType base="string">

                <enumeration value="Song"/>

                <enumeration value="Speech"/>

                <enumeration value="Audiobook"/>

                <enumeration value="Other"/>

        </simpleType>

      </element>

      <element name="seconds" type="int"/>

      <element name="year" type="year"/>

      <element name="language" type="language"/>

      <element name="SHA1" type="int"/>

      <element name="bitrate" type="short"/>

      <element name="price" type="string"/>

      <element name="link" type="uriReference"/>

      <element name="comments">

        <simpleType base="string">

          <maxInclusive value="100"/>

        </simpleType>

      </element>

    </all>

  </complexType>

 

</schema>

 

 

 

 

The video schema

 

<?xml version="1.0" encoding="UTF-8"?>

 

<schema targetNamespace="http://www.limewire.com/schemas/xerces/video.xsd">

 

  <element name="video" type="videoType"/>

 

  <complexType name="videoType">

    <element name="Title" type="string" minOccurs="0" maxOccurs="1"/>

    <element name="Director" type="string" minOccurs="0" maxOccurs="1"/>

    <element name="Producer" type="string" minOccurs="0" maxOccurs="2"/>

    <element name="Studio" type="short" minOccurs="0" maxOccurs="1"/>

    <element name="Stars" type="string" minOccurs="0" maxOccurs="3"/>

    <element name="Type" minOccurs="0" maxOccurs="1">

      <simpleType base="string">

        <enumeration value="Music Video"/>

        <enumeration value="Commercial"/>

        <enumeration value="Trailer"/>

        <enumeration value="Movie Clip"/>

        <enumeration value="Video Clip"/>

        <enumeration value="VHS Movie"/>

        <enumeration value="DVD Movie"/>

        <enumeration value="Other"/>

      </simpleType>

    </element>

    <element name="Minutes" type="decimal" minOccurs="0" maxOccurs="1"/>

    <element name="Size" minOccurs="0" maxOccurs="1">

      <complexType>

         <element name="Pixel Height" type="int"/>

         <element name="Pixel Width" type="int"/>

      </complexType>

    </element>

    <element name="Year" type="year" minOccurs="0" maxOccurs="1"/>

    <element name="Langauge" type="language" minOccurs="0" maxOccurs="1"/>

    <element name="Subtitles" type="language" minOccurs="0" maxOccurs="3"/>

    <element name="SHA1" type="int" minOccurs="0" maxOccurs="1"/>

    <element name="Rating" minOccurs="0" maxOccurs="1">

      <simpleType base="string">

        <enumeration value="G"/>

        <enumeration value="PG"/>

        <enumeration value="PG-13"/>

        <enumeration value="R"/>

        <enumeration value="NC-17"/>

        <enumeration value="NR"/>

      </simpleType>

    </element>

    <element name="Availability" type="string" minOccurs="0" maxOccurs="1"/>

    <element name="Price" type="string" minOccurs="0" maxOccurs="1"/>

    <element name="Shipping" type="string" minOccurs="0" maxOccurs="1"/>

    <element name="Link" type="uriReference" minOccurs="0" maxOccurs="1"/>

    <element name="Comments" minOccurs="0" maxOccurs="1">

      <simpleType base="string">

        <maxInclusive value="100"/>

      </simpleType>

    </element>

  </complexType>

 

</schema>

 

 

The book schema:

 

<?xml version="1.0" encoding="UTF-8"?>

 

<schema targetNamespace="http://www.limewire.com/schemas/xerces/book.xsd">

 

  <element name="book" type="bookType"/>

 

  <complexType name="bookType">

    <all>

      <element name="Title" type="string"/>

      <element name="Edition" type="short"/>

      <element name="Author" type="string"/>

      <element name="Publisher" type="string"/>

      <element name="Genre">

        <simpleType base="string">

          <enumeration value="Computers, Information, and Reference"/>

          <enumeration value="Philosophy and Psychology"/>

          <enumeration value="Religion"/>

          <enumeration value="Social Sciences"/>

          <enumeration value="Language"/>

          <enumeration value="Science"/>

          <enumeration value="Technology"/>

          <enumeration value="Arts and Recreation"/>

          <enumeration value="History and Geography"/>

          <enumeration value="General Fiction"/>

          <enumeration value="Classics"/>

          <enumeration value="Crime"/>

          <enumeration value="Science Fiction"/>

          <enumeration value="Poetry"/>

          <enumeration value="Children and Teens"/>

          <enumeration value="Other"/>

        </simpleType>

      </element>

      <element name="Subject" type="string"/>

      <element name="Pages" type="int"/>

      <element name="Year" type="year"/>

      <element name="Language" type="language"/>

      <element name="SHA1" type="int"/>

      <element name="ISBN">

        <simpleType base="int">

          <pattern value="\d{10}"/>

        </simpleType>

      </element>

      <element name="Dimensions">

        <complexType>

          <element name="Length" type="decimal"/>

          <element name="Width" type="decimal"/>

          <element name="Height" type="decimal"/>

        </complexType>

      </element>

      <element name="Back">

        <simpleType base="string">

          <enumeration value="Hardback"/>

          <enumeration value="Paperback"/>

        </simpleType>

      </element>

      <element name="Availability" type="string"/>

      <element name="Price" type="string"/>

      <element name="Shipping" type="string"/>

      <element name="Link" type="uriReference"/>

      <element name="Comments">

        <simpleType base="string">

          <maxInclusive value="100"/>

        </simpleType>

      </element>

    </all>

  </complexType>

 

</schema>