TechnicaLee Speaking: March 2009 Archives

We’re doing something mildly interesting in the recently re-chartered SPARQL working group. We’re spending our first couple of months defining what our deliverables will be for the rest of our 18-month chartered lifetime. The charter gives us some suggestions on things to consider (update, aggregates, XML serialization for queries, and more) and some constraints to keep in mind (backwards compatibility), but beyond that it’s up to the group.

So we’ve started by gathering potential features. We solicited features—which can be language extensions, syntactic shortcuts, protocol enhancements, result format modifications, integrations with other technologies like XQuery, OWL, or RIF, new query serializations, and more—both from within the Working Group and from the broader community. Within a week or so, we had identified nearly 40 features, and I expect a few more to come in yet.

The problem is: all of these features would be helpful. My take on developer-oriented-technology standards such as SPARQL is that ultimately they serve the users of the users of the implementations. There’s a pyramid here, wherein a small number of SPARQL implementations will support a larger number of developers creating SPARQL-driven software which in turn does useful (sometimes amazing) things for a much larger set of end users. So ideally, we’d focus on the features that benefit the largest swaths of those end users.

But of course that’s tough to calculate. So there’s another way we can look at things: the whole pyramid balances precariously on the shoulders of implementers, and, in fact, the specifications are themselves written to be formal guides to producing interoperable implementations. If implementers can’t understand an extension or willfully choose not to add it to their implementations, then there wasn’t much point in standardizing it in the first place. This suggests that implementer guidance should be a prime factor in choosing what our Working Group should focus on. And that’s far more doable since many of the Working Group participants are themselves SPARQL implementers.

Yet, implementers priorities are not always tied to what’s most useful for SPARQL users and SPARQL users’ users. (This can be for a wide variety of reasons, not the least of which is that the feedback on what’s important for the implementer's’ users’ users often loses something in the multiple layers of communication that end up relaying it to implementers.) So what about that middle category, SPARQL users/developers? These fine folks have the most direct experience with SPARQL’s capabilities, caveats, and inabilities to solve different classes of problems as they apply to solving their users’ business/scientific/social/consumer problems. SPARQL users can and will surely contribute valuable experience along the lines of what extensions might make SPARQL easier to learn, easier to use, more powerful, and more productive when building solutions on the Semantic Web technology stack.

The difficulty here is that it’s often very, very hard for SPARQL developers to be selective in what features they’d like to see added to the landscape. SPARQL is their toolbox, and from their perspective (and understandably so), there’s little downside in stuffing as many tools as possible into SPARQL, just in case.

Things get more complicated. I (very) often joke (and will now write down for the first time) that if you get 10 Semantic Web advocates in a room, you’ll probably have 15 or 20 opinions as to what the Semantic Web is and what it’s for. When we zoom in on just the SPARQL corner of the Semantic Web world, things are no different. Some people are using SPARQL to query large knowledge bases. Some people are using SPARQL to answer ontologically-informed queries. Some people are using SPARQL to query an emerging Web of linked data. Some people are using SPARQL for business intelligence. Some people are using SPARQL in XML pipelines. Some people are using SPARQL as a de facto rules language. Some people are using SPARQL as a federated query language. And much more. No wonder then, that the Working Group might have difficulties reaching consensus on a significantly whittled-down list of features to standardize.

Why not do it all? Or, at least, why not come up with some sort of priority list for all of the features and work our way down that one at a time? It’s tempting, given the high quality of the suggestions, but I’m pretty sure it’s not feasible. Different groups of features interact with each other in different ways, and it’s exactly these interactions that need to be formally written down in a specification. Furthermore, the W3C process requires that as we enter and exit the Candidate Recommendation stage we demonstrate multiple interoperable implementations of our specifications—this becomes extremely challenging to achieve when the language, protocol, etc. are constantly moving targets. Add to that the need to build test cases, gather substantive reviews from inside and outside the Working Group, and (where appropriate) work together with other Working Groups. Now consider that Working Group participants are (for the most part) giving no more than 20% of their time to the Working Group. Believe me, 18 months flies by.

So what do I think is reasonable? I think we’ll have done great work if we produce high quality specifications for maybe three, four, or five new SPARQL features/extensions. That’s it.

(I’m not against prioritizing some others on the chance that my time estimates are way off; that seems prudent to me. And I also recognize that we’ve got some completely orthogonal extensions that can easily be worked on in parallel with one another. So there’s some wiggle room. But I hold a pretty firm conviction that the vast majority of the features that have been suggested are going to end up on the proverbial cutting-room floor.)

Here’s what I (personally) think should go into our decisions of what features to standardize:

Implementation experience. It’s easy to get in trouble when a Working Group resorts to design-by-committee; I prefer features that already exist in multiple, independent implementations. (They need not be interoperable already, of course: that’s what standards work is for!)
Enabling value. I’m more interested in working on features that enable capabilities that don’t already exist within SPARQL, compared to those features which are largely about making things easier. I’m also interested in working on those extensions that help substantial communities of SPARQL users (and, as above, their users). But in some cases this criterion may be trumped by…
Ease of specification. Writing down a formal specification for a new feature takes time and effort, and we’ve only a limited amount of both with which to work. I’m inclined to give preference to those features which are easy to get right in a formal specification (perhaps because a draft specification or formal documentation already exists) compared to those that have many tricky details yet to be worked out.
Ease/likelihood of implementation. I think this is often overlooked. There are a wide range of SPARQL implementations out there, and—particularly given the emerging cloud of linked data that can easily be fronted by multiple SPARQL implementations—there are a large number of SPARQL users that regularly write queries against different implementations. The SPARQL Working Group can add features until we’re blue in the face, but if many implementations are unable or choose not to support the new features, then interoperability remains nothing but a pipe dream for users.

One potential compromise, of sorts, is to define a standard extensibility mechanism for SPARQL. SPARQL already has one extensibility point in the form of allowing implementations to support arbitrary filter functions. There are a variety of forms that more sophisticated extensibility points might take. At the most general, Eric Prud’hommeaux mentioned to me the possibility of an EXTENSION keyword that would take an identifying URI, arbitrary arguments, and perhaps even arbitrary syntax within curly braces. Less extreme than that might be a formal service description that allows implementations to explore and converge on non-standard functionality while providing a standard way for users and applications to discover what features a given SPARQL endpoint supports. The first SPARQL Working Group (the DAWG) seems to have been very successful in designing a language that provided ample scope for implementers to try out new extensions. I think if our new Working Group can keep that freedom while also providing some structure to encourage convergence on the syntax and semantics of SPARQL extensions, we’ll be in great shape for the future evolution of SPARQL.

There’s one final topic that I’ve alluded to but also wanted to explicitly mention: energy. We’ve got a lot of Working Group members with a variety of perspectives and a large number of potential work items around which we need to reach consensus. And then we need to reach consensus on the syntax and semantics of our work items, as well as the specification text used to describe them. We need editors and reviewers and test cases and test harnesses and W3C liaisons and community outreach and comment responders. All of this takes energy. The DAWG nearly ground to a premature halt as the standardization process dragged on for year after year. We can’t allow for that to happen this time around, so we need to keep the energy up. An enthusiastic Working Group, frequent contributions from the broader community, occasional face-to-face meetings, and noticeable progress indications can all help to keep our energy from flagging. And, of course, sticking to our 18-month schedule is as important as anything.

What do you think? I’m eager to hear from anyone with suggestions for how the Working Group can best meet its objectives. Do you disagree with some of my underlying assumptions? How about my criterion for considering features? Do you see any extensibility/evolutionary mechanisms that you think would ease the future growth of SPARQL? Please let me know.

Posted by Lee Feigenbaum at 12:37 AM | Permalink | Comments (1)

Bob DuCharme, who has recently been exploring a variety of triple stores, has an insightful post up asking questions about the idea of named graphs in RDF stores. Since the Open Anzo repository is based around named graphs (as are all Cambridge Semantics’ products based on Open Anzo such as Anzo for Excel), I thought I’d take a stab at giving our answers to Bob’s questions:

1. If graph membership is implemented by using the fourth part of a quad to name the graph that the triple belongs to, then a triple can only belong directly to one graph, right?

This is correct. In Open Anzo, triples are really quads, in that every subject-predicate-object triple has a fourth component, a URI that designates the named graph of the triple. The named graph with URI u comprises all of the triples (quads) that have u as their fourth component.

Of course, this means that the same triple (subject-predicate-object) can exist in multiple named graphs. In such a case, each such triple is distinct from the others; it can be removed from one named graph independently of its presence in other named graphs.

2. I say "belong directly" because I'm thinking that a graph can belong to another graph. If so, how would this be indicated? Is there some specific predicate to indicate that graph x belongs to graph y?

Open Anzo has no concept of nesting graphs or graph hierarchies. The URI of a named graph can be used as the subject or object of a triple just like any other URI, with a meaning specific to whatever predicate is being used. So two graphs can be related by means of ordinary triples, but there is no special support for any such constructs.

3. If we're going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let's say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 "FROM NAMED" clauses in my query? (Let's assume that I plan to query those same 432 multiple times.)

There are a couple of points here.

First, for Open Anzo at least, it's up to the application developer how to group triples into named graphs. I don't think we've ever ourselves used the scheme you suggest (everything updated at once is a named graph), but you could if you wanted. Instead, named graphs tend to collect triples that represent a reasonably core object in the application's domain of discourse.
Open Anzo does use named graphs for provenance. Named graphs are the basic unit for:
- Versioning. When one or more triples in a named graph are updated, the entire graph is versioned. Open Anzo tracks the modification time and the user that instigated the change, and also provides an API for getting at previous revisions of a graph. (Graphs can also be explicitly created that do not keep track of revisions. Those still track the last updated on and last updated by bits of provenance.)
- Access control. Control of who can read, write, remove, or change permissions on RDF data in Open Anzo is attached strictly at the named-graph level. This tends to work nicely with the general modeling approach that lets a named graph represent a conceptual entity.
- Replication. Client applications can maintain local replicas of data from an Open Anzo server. Replication occurs at the level of a named graph.
Second, it's worth noting that Open Anzo adds a bit of infrastructure for handling this sort of provenance. Each named graph in an Open Anzo repository has an associated metadata graph. The system manages the triples in the metadata graph, which can include access control data, provenance data, version histories, associated ontological elements, and more. This lets all of the provenance information be treated as RDF without conflating it with user/application-created triples.
Third, regarding the challenge of handling queries that need to span hundreds or thousands of named graphs: As Bob observed, this is a common situation when you are basing a store around named graphs. The Open Anzo approach to this problem is to introduce the idea of a named dataset. A named dataset is a URI-identified collection of graphs. (Technically, it's two collections of graphs, representing both the default and named graph elements of a SPARQL query.) Glitter, the Open Anzo SPARQL engine, extends SPARQL with a FROM DATASET <u> clause that scopes the query to the graphs contained in the referenced named dataset, u. Currently, named datasets explicitly enumerate their constituent graphs. There's no reason, however, that the same approach could not be used along with other methods of identifying the dataset's graph contents, such as URI patterns or a query.

All in all, we find the named graph model to be extremely empowering when building applications based on RDF. It gives a certain degree of scaffolding that allows all sorts of engineering and user experience flexibility. At a high level, we approach named graphs in a similar fashion to how we approach ontologies. We find both constructs useful for dealing with large amounts of RDF in practical enterprise environments, for engineering various ways of partitioning and understanding the data throughout the software stack. In the end, the named graph model goes to the heart of a few of RDF's core value propositions: agility and expressivity of the data model and adaptability of software built upon it.

Posted by Lee Feigenbaum at 12:19 AM | Permalink

TechnicaLee Speaking

Software designs, implementations, solutions, and musings by Lee Feigenbaum

March 16, 2009

Evolving standards, consensus, and the energy to get there

March 2, 2009

Named graphs in Open Anzo