Friday, October 31, 2008

RESTful Query URLs

The last couple of days I've been working on writing a RESTful JSON document database. While a number of these already exist (CouchDB, FeatherDB, DovetailDB, Persevere, JSONStore, etc.), I decided to write my own because I wanted a bit more control over the URL scheme used by the REST interface, and I needed the ability to tweak the search functionality to achieve decent performance on some common but complicated queries. All in all it was an interesting diversion. The actual server clocked in at about 1000 SLOC, with much of that boilerplate because I wrote it in Java/JDBC rather than Groovy/GroovySQL.

The most interesting problem came in designing the query scheme for the REST interface. There seems to be a couple different ways to implement it with no real consensus as to which is the "right" way. As with most things, I suspect it depends on how you've implemented other pieces of the architecture and even personal preference. Below I describe three approaches I considered. The nice thing with REST is there's nothing stopping you from implementing all of these approaches in your interface.

NB: I'm no REST expert so the information below is my observations rather than any best practices. I'd love for anyone who knows better to chime into the discussion.

POST query parameters/document
In this approach, you provide a search endpoint, say something unoriginal like '/search', and queries are POSTed to that URI. The query is either a set of form encoded key-value pairs or a search document using a schema shared between the client and server.

This approach seems closer to RPC than REST to me, but may be the best approach if your search functionality requires a more complex exchange of information than simple key-value pairs allow. The obvious downside to this approach is that there is no way to bookmark a query or email/IM a query to someone else. This approach also can't take advantage of the caching built into the HTTP spec.

GET query string
Similar to above, you expose a URI endpoint, possibly something like /search, and queries are sent to that endpoint with the parameters encoded in the query string of the URL, e.g. http://www.google.com/search?q=REST+query+string

This approach improves on the bookmarkability of searches, since all of the parameters are in the URL. However, the use of the query string may interfere with caching as described in Section 13.9 of the HTTP spec. Overall, I think there is nothing inherently un-RESTful about this approach, especially if you provide more resource-oriented URIs than /search, e.g. /documents?author=Reed. In my head, I interpret the latter as "give me all of the document resources but filter on the author Reed. Removing the query string will still give you a resource (or collection of resources in this case).

Where this approach falls down is when you start trying to represent hierarchical or taxonomic queries with the query string, e.g. http://lifeforms.org?k=kingdom&p=phylum&c=class&o=order&f=family&g=genus&s=species as described on the RestWiki.

Encoding query parameters into the URI structure
In this approach the query parameters are encoded directly into the URI structure, e.g. /documents/authors/Reed, rather than using the query string. Another example of is described at Stack Overflow.

This approach solves both the bookmarkability and the caching issues of the previous approaches, but can introduce some ambiguity, especially if your resources aren't strictly hierarchical in nature. The biggest stumbling block for me was this: looking at the URI /documents/authors/Reed, it's not immediately clear what will be returned. For example, if I sent you the URI /documents you might infer that you would get a list or the contents of some documents. From the URI /documents?author=Reed, you might infer that the resource(s) returned would be documents authored by Reed. So what might you expect to get from the URI /documents/authors/Reed? Information about the author Reed or all documents authored by Reed?

How important is this? I guess it's really up to you. A machine likely infers about as much from
/documents/authors/Reed as it does from /documents?author=Reed.

5 comments:

Chrigel said...

Hi Josh

Did you have a look at Sling (http://incubator.apache.org/sling) regarding RESTful repository?

Re your approaches: I humbly think it really depends on your personal choice. I personally would try to stay as long as possible in a hierarchy in the url and query for attributes of a resource, e.g.

/animals/mammals?size=large

instead of

/animals?type=mammals&size=large

please note that different urls may mean same resource, i.e. it is possible to get via another hierarchical taxonomy to the same ressource e.g.

/animals/mammals/elephants
vs.
/animals/landbound/elephants
(yeah, stupid example i know)

Josh Reed said...

Hi Chrigel,

Thanks for the comment. I have seen Sling and I really like the fact that it's built with OSGi and JCR. I may have to dump my JSON into it and see how it handles them.

Check back if you're interested because I'm going to blog a bit more about the data and types of queries I'm working with.

Cheers,
Josh

dkubb said...

I usually set up hierarchical URLs first to support the basic needs of my app, like /articles/{title}/comments for example. However I also use query strings to scope the results further, eg: /articles/{title}/comments?author=John+Doe.

I prefer to use query strings for user defined queries. One of my goals is to have my apps searchable via HTML forms using GET, so that constrains my approach somwhat (IMHO in a good way).

The query string caching issue you mentioned in RFC 2616 only applies to cases where you're not setting the Expires and Cache-Control headers. Since you really should be doing that anyway, if you care about caching the URL, it doesn't seem to be too much of a barrier. Besides those two headers I think it's also important to set the Last-Modified and ETag headers and support validation.. that is return 304 Not Modified when the search results are the same as previous requests.

Josh Reed said...

Thanks for the comment, dkubb. I think you're right about the mixed approach. I'm going to blog a bit more about using the 3rd approach and what I liked/didn't like about it.

As for the caching, you're right about the headers as long as the client isn't behind some proxies. Does Squid cache URLs with query strings by default?

Cheers,
Josh

Naat said...

Greetings! Very useful advice in this particular post! It's the little changes that produce the most important changes. Thanks a lot for sharing!