SPARQLing at SemTech

| 3 Comments

SemTech 2009 has come and gone, and it was great. I was concerned—as were others—that the state of the economy would depress the turnout and enthusiasm for the show, but it seems that any such effects were at least counterbalanced by a growing interest in semantic technologies. Early reports are that attendance was up about 20% from last year, and at sessions, coffee breaks, and the exhibit hall there seemed to always be more people than I expected. Good stuff.

Eric P. and I gave our SPARQL By Example tutorial to a crowd of about 50 people on Monday. From the feedback I’ve received, it seems that people found the session beneficial, and at least a couple of people remarked on the fact that Eric and I seemed to be having fun. If this whole semantic thing doesn’t work out, at least we can fall back on our ad-hoc comedy routines.

Anyways, I wanted to share a couple of links with everyone. I think they work nicely to supplement other SPARQL tutorials in helping teach SPARQL to newcomers and infrequent practitioners.

  1. SPARQL By Example slides. I’ve probably posted this link before, but the slides have now been updated with some new examples and with a series of exercises that help reinforce each piece of SPARQL that the reader encounters. Thanks to Eric P. for putting together all of the exercises and to Leigh Dodds for the excellent space exploration data set.
  2. SPARQL Cheat Sheet slides. This is a short set of about 10 slides intended to be a concise reference for people learning to write SPARQL queries. It includes things like common prefixes, the structure of queries, how to encode SPARQL into an HTTP URL, and more.

Enjoy, and, as always, I’d welcome any feedback, suggestions for improvements, or pointers to how/where you’re able to make use of these materials.

We’ll be releasing the first versions of our Anzo products in July. Between now and then I’m going to try to do some blogging showing various parts of the products. But before I begin that, I’ve been thinking a bunch recently about how to characterize our use of Semantic Web technologies, and I wanted to write a bit on that.

Our software views the world of enterprise data in a pretty straightforward way:

  1. Bring together as much data as possible.
  2. Do stuff with the data.
  3. Allow anyone to consume the data however (& whenever) they want.

This is a very simple take on what we do, but it gets to the heart of why we care about semantics: We love semantics because semantics is the “secret sauce” that makes possible each of these three aspects of what we do.

Here’s how:

Bring together as much data as possible

First of all, in most cases we don’t actually physically copy data around. That sort of warehouse approach is appropriate in some cases, but in general we prefer to leave data where it is and bring it together virtually. Our semantic middleware product, the Anzo Data Collaboration Server, provides unified read, write, and query interfaces to whatever data sources we’re able to connect to. We often refer to the unified view of heterogeneous enterprise data as a semantic fabric, but really it’s linked data for the enterprise.

Semantic Web technologies make this approach feasible. RDF is a data standard that is both expressive enough to represent any type of data that’s connected to the server and also flexible enough to handle new data sources incrementally. URIs provide a foundation for minting identifiers that don’t clash unexpectedly as new data sources are brought into the fold. Named graphs give us a simple abstraction upon which we can engineer practical concerns like security, audit trails, offline access, real-time updates, and caching. And, of course, GRDDL gives us a standard way to weave XML source data into the fabric.

Without Semantic Web technologies we’d need to worry about defining a master relational schema up front, or we’d have to constantly figuring out how to structurally relate or merge XML documents. And when we’re talking about data that originates not only in one or two big relational databases but also in hundreds or thousands or hundreds of thousands of Excel spreadsheets, the old ways just don’t cut it at all. Semantic Web technologies, on the other hand, provide the agile data foundation we need to bring data together.

But bringing together as much data as possible is not an end in itself. What’s the point of doing this?

Do stuff with the data

This one’s intentionally vague, because there are lots of things that lots of different people want—and need—to do with data, and Anzo is a platform that accommodates many of those things. In general, though, Semantic Web standards again lay the groundwork for the types of things that we want to do with data:

  • Data access. SPARQL gives us a way to query information from multiple data sources at once.
  • Describing data. RDF Schema and OWL are extremely expressive ways to describe (the structure of) data, particularly compared to alternatives like relational DDL or XML Schema. We can (and do) use data descriptions to do things like build user interfaces, generate pick lists (controlled vocabularies), validate data entry, and more.
  • Transform data. There are all kinds of ways in which we need to derive new data from existing data. We might do this via inference (enabled by RDFS and OWL) or via rules (enabled by SPARQL CONSTRUCT queries, by RIF, or by SWRL) or simply via something like SPARQL/Update.

Without Semantic Web technologies, we’d probably end up using a proprietary approach for querying across data sources. We’d have to hardcode all of our user interface or else invent or adopt a non-standard way of describing our data beyond what a relational schema gives us. And then we might choose a hodgepodge of rules engines, SQL triggers, and application-specific APIs to handle transforming our data. And this might all work just fine, but we’d have to put in all the time, effort, and money to make all the pieces work together.

To me, that’s the beauty of the much-maligned Semantic Web layer cake. The fact that semantic technologies represent a coherent set of standards (i.e. a set of disparate technologies that have been designed to play nice together) means that I can benefit from all of the “glue” work that’s already been done by the standards community. I don’t need to invent ways to handle different identifier schemes across technologies or how to transform from one data model to another and back again: the standards stack has already done that.

Allow anyone to consume the data however (& whenever) they want

Once we’ve put in place the ability to bring data together and do stuff to that data, the remaining task is to get that information in front of anyone who needs it when they need it. We’ve put in a lot of effort to make bringing data into the fabric and acting on that data easy, and it would be a shame if every time someone needs to consume some information they need to put in a request and wait 6 months for IT to build the right queries, views, and forms for them.

To this end, Anzo on the Web takes the increasingly popular faceted-browsing paradigm and puts it in the hands of non-technical users. Anyone can visually choose the data that they need to see in a grid, a scatter plot, a pie chart, a timeline, etc. and the right view is created immediately. Anyone can choose what properties of the data should be available as facets to filter through the data set via whatever attributes he or she wants.

Once again it’s the flexibility of the Semantic Web technology stacks that makes this possible for us. RDF makes it trivial for us to create, store, and discover customized lenses with arbitrary properties. RDF also lets us introspect on the data to present visual choices to users when configuring views and adding filters. SPARQL is a great vehicle for building the queries that back faceted browsing.

In summary

It bears repeating that as with most technology standards, the things that we accomplish with Semantic Web standards could be done with other technology choices. But using a coherent set of standards backed by a thriving community of both research and practice means that:

  1. We don’t have to invent all the glue that ties different technologies together
  2. Any new standards that evolve within this stack immediately give our software new capabilities (see #1)
  3. There’s a wide range of 3rd party software that will easily interoperate with Anzo (other RDF stores, OWL reasoners, etc.)
  4. We can focus on enabling solutions, rather than on the core technology bits. All of the above frees us up to do things like build an easy to use faceted browsing tool, build Anzo for Excel to collect and share spreadsheet data, build security and versioning and real-time updates, and much more.

Again, the semantics is really the secret sauce that makes much of what we do possible, but there’s a lot more innovation and engineering that turns that secret sauce into practical solutions. I’ll have some takes on what this looks like in practice in the coming weeks, and we’d love to show you in person if you’ll be in the Boston, MA area or if you’ll be at SemTech in San Jose, CA.

I’m looking forward to this year’s Semantic Technology Conference in San Jose the week of June 14-18. I saw lots of fantastic sessions at last year’s SemTech and met tons of great people, and I imagine that this year will be even better. My colleagues at Cambridge Semantics and I will be giving a few talks, running the gamut from tutorial to technology survey to project report to our vision of how to build practical semantic solutions:

  • SPARQL By Example tutorial. I’ll be giving this half-day tutorial on Monday afternoon. We’ll use actual SPARQL queries that can be run on the (public) Semantic Web today as a means to learning SPARQL from the ground up.
  • Making Sense of Spreadsheets in Merck Basic Research. Jaime Melendez of Merck and I will be giving this talk bright and early on Tuesday morning. We’ll be reporting on the results of a joint innovation project that we completed last year using our Anzo software to address several challenges facing Merck basic research.
  • Enterprise Scalable Semantic Solutions in Five Days. Mike Cataldo will be talking later Tuesday morning about how Anzo makes use of semantic technologies to help our customers build practical, production-ready solutions in a matter of days.
  • Faceted Browsing Tools. Jordi Albornoz will be talking on Tuesday afternoon about the power and simplicity of faceted browsing and semantic lens technologies. He’ll be comparing and contrasting Exhibit, Fresnel, and our own Anzo on the Web.

I know that people have been saying this for a few years now, but I keep seeing the Semantic Web taking significant steps forward both inside of and outside of corporate firewalls. I fully expect this year’s SemTech to reaffirm this point of view. If you’ll be in San Jose, come by some of our talks and see what I mean. We’ll also have a space in the exhibit hall, so you can come and say hi there as well. See you there!

I’m currently on a bit of a whirlwind trip to beautiful Lucerne to present a Semantic Web tutorial at the SIG meeting preceding the PRISM Forum meeting.

For the tutorial, I put together about 150 slides that act as a survey of the current landscape of Semantic Web technologies and tools. It’s aimed to give an audience some motivation for Semantic Web technologies, and to provide a tour through most Semantic Web technologies. It’s not a “how to” tutorial—it’s more of a “here’s what this Semantic Web thing is all about” tutorial.

Anyway, I thought the slides might be interesting to other people and/or helpful to other presenters. Since I cribbed a bunch of material from some other people, it’s only fair that other people be free to do the same with my slides.

I’m always eager for feedback and suggestions to improve the tutorial material.

I’ve worked off and on in the past will Mills Davis and Brand Niemann of the U.S. EPA in looking at ways that Semantic Web technologies can benefit the U.S. federal government. We’ve got another chance to make a case for this this week.

Currently, the folks behind recovery.gov are hosting one week of open dialogue of IT approaches for exposing data about the U.S. stimulus package in an open and transparent fashion.

Needless to say, there are many calls for XML and Web Services-based approaches. In my opinion, these are fine and are definitely better than not having the data available at all. But I also think this dialogue gives those of us who believe in the transformative power of Semantic Web technologies a chance to speak in their favor.

Mills and I have submitted three ideas to the dialogue. I’d love it if you took a look at them, and if you think they’re good ideas, please indicate your support by voting and leaving a comment. I’d also love to hear from anyone else who is participating in the dialogue!

We’re doing something mildly interesting in the recently re-chartered SPARQL working group. We’re spending our first couple of months defining what our deliverables will be for the rest of our 18-month chartered lifetime. The charter gives us some suggestions on things to consider (update, aggregates, XML serialization for queries, and more) and some constraints to keep in mind (backwards compatibility), but beyond that it’s up to the group.

So we’ve started by gathering potential features. We solicited features—which can be language extensions, syntactic shortcuts, protocol enhancements, result format modifications, integrations with other technologies like XQuery, OWL, or RIF, new query serializations, and more—both from within the Working Group and from the broader community. Within a week or so, we had identified nearly 40 features, and I expect a few more to come in yet.

The problem is: all of these features would be helpful. My take on developer-oriented-technology standards such as SPARQL is that ultimately they serve the users of the users of the implementations. There’s a pyramid here, wherein a small number of SPARQL implementations will support a larger number of developers creating SPARQL-driven software which in turn does useful (sometimes amazing) things for a much larger set of end users. So ideally, we’d focus on the features that benefit the largest swaths of those end users.

But of course that’s tough to calculate. So there’s another way we can look at things: the whole pyramid balances precariously on the shoulders of implementers, and, in fact, the specifications are themselves written to be formal guides to producing interoperable implementations. If implementers can’t understand an extension or willfully choose not to add it to their implementations, then there wasn’t much point in standardizing it in the first place. This suggests that implementer guidance should be a prime factor in choosing what our Working Group should focus on. And that’s far more doable since many of the Working Group participants are themselves SPARQL implementers.

Yet, implementers priorities are not always tied to what’s most useful for SPARQL users and SPARQL users’ users. (This can be for a wide variety of reasons, not the least of which is that the feedback on what’s important for the implementer's’ users’ users often loses something in the multiple layers of communication that end up relaying it to implementers.) So what about that middle category, SPARQL users/developers? These fine folks have the most direct experience with SPARQL’s capabilities, caveats, and inabilities to solve different classes of problems as they apply to solving their users’ business/scientific/social/consumer problems. SPARQL users can and will surely contribute valuable experience along the lines of what extensions might make SPARQL easier to learn, easier to use, more powerful, and more productive when building solutions on the Semantic Web technology stack.

The difficulty here is that it’s often very, very hard for SPARQL developers to be selective in what features they’d like to see added to the landscape. SPARQL is their toolbox, and from their perspective (and understandably so), there’s little downside in stuffing as many tools as possible into SPARQL, just in case.

Things get more complicated. I (very) often joke (and will now write down for the first time) that if you get 10 Semantic Web advocates in a room, you’ll probably have 15 or 20 opinions as to what the Semantic Web is and what it’s for. When we zoom in on just the SPARQL corner of the Semantic Web world, things are no different. Some people are using SPARQL to query large knowledge bases. Some people are using SPARQL to answer ontologically-informed queries. Some people are using SPARQL to query an emerging Web of linked data. Some people are using SPARQL for business intelligence. Some people are using SPARQL in XML pipelines. Some people are using SPARQL as a de facto rules language. Some people are using SPARQL as a federated query language. And much more. No wonder then, that the Working Group might have difficulties reaching consensus on a significantly whittled-down list of features to standardize.

Why not do it all? Or, at least, why not come up with some sort of priority list for all of the features and work our way down that one at a time? It’s tempting, given the high quality of the suggestions, but I’m pretty sure it’s not feasible. Different groups of features interact with each other in different ways, and it’s exactly these interactions that need to be formally written down in a specification. Furthermore, the W3C process requires that as we enter and exit the Candidate Recommendation stage we demonstrate multiple interoperable implementations of our specifications—this becomes extremely challenging to achieve when the language, protocol, etc. are constantly moving targets. Add to that the need to build test cases, gather substantive reviews from inside and outside the Working Group, and (where appropriate) work together with other Working Groups. Now consider that Working Group participants are (for the most part) giving no more than 20% of their time to the Working Group. Believe me, 18 months flies by.

So what do I think is reasonable? I think we’ll have done great work if we produce high quality specifications for maybe three, four, or five new SPARQL features/extensions. That’s it.

(I’m not against prioritizing some others on the chance that my time estimates are way off; that seems prudent to me. And I also recognize that we’ve got some completely orthogonal extensions that can easily be worked on in parallel with one another. So there’s some wiggle room. But I hold a pretty firm conviction that the vast majority of the features that have been suggested are going to end up on the proverbial cutting-room floor.)

Here’s what I (personally) think should go into our decisions of what features to standardize:

  • Implementation experience. It’s easy to get in trouble when a Working Group resorts to design-by-committee; I prefer features that already exist in multiple, independent implementations. (They need not be interoperable already, of course: that’s what standards work is for!)
  • Enabling value. I’m more interested in working on features that enable capabilities that don’t already exist within SPARQL, compared to those features which are largely about making things easier. I’m also interested in working on those extensions that help substantial communities of SPARQL users (and, as above, their users). But in some cases this criterion may be trumped by…
  • Ease of specification. Writing down a formal specification for a new feature takes time and effort, and we’ve only a limited amount of both with which to work. I’m inclined to give preference to those features which are easy to get right in a formal specification (perhaps because a draft specification or formal documentation already exists) compared to those that have many tricky details yet to be worked out.
  • Ease/likelihood of implementation. I think this is often overlooked. There are a wide range of SPARQL implementations out there, and—particularly given the emerging cloud of linked data that can easily be fronted by multiple SPARQL implementations—there are a large number of SPARQL users that regularly write queries against different implementations. The SPARQL Working Group can add features until we’re blue in the face, but if many implementations are unable or choose not to support the new features, then interoperability remains nothing but a pipe dream for users.

One potential compromise, of sorts, is to define a standard extensibility mechanism for SPARQL. SPARQL already has one extensibility point in the form of allowing implementations to support arbitrary filter functions. There are a variety of forms that more sophisticated extensibility points might take. At the most general, Eric Prud’hommeaux mentioned to me the possibility of an EXTENSION keyword that would take an identifying URI, arbitrary arguments, and perhaps even arbitrary syntax within curly braces. Less extreme than that might be a formal service description that allows implementations to explore and converge on non-standard functionality while providing a standard way for users and applications to discover what features a given SPARQL endpoint supports. The first SPARQL Working Group (the DAWG) seems to have been very successful in designing a language that provided ample scope for implementers to try out new extensions. I think if our new Working Group can keep that freedom while also providing some structure to encourage convergence on the syntax and semantics of SPARQL extensions, we’ll be in great shape for the future evolution of SPARQL.

There’s one final topic that I’ve alluded to but also wanted to explicitly mention: energy. We’ve got a lot of Working Group members with a variety of perspectives and a large number of potential work items around which we need to reach consensus. And then we need to reach consensus on the syntax and semantics of our work items, as well as the specification text used to describe them. We need editors and reviewers and test cases and test harnesses and W3C liaisons and community outreach and comment responders. All of this takes energy. The DAWG nearly ground to a premature halt as the standardization process dragged on for year after year. We can’t allow for that to happen this time around, so we need to keep the energy up. An enthusiastic Working Group, frequent contributions from the broader community, occasional face-to-face meetings, and noticeable progress indications can all help to keep our energy from flagging. And, of course, sticking to our 18-month schedule is as important as anything.

What do you think? I’m eager to hear from anyone with suggestions for how the Working Group can best meet its objectives. Do you disagree with some of my underlying assumptions? How about my criterion for considering features? Do you see any extensibility/evolutionary mechanisms that you think would ease the future growth of SPARQL? Please let me know.

Named graphs in Open Anzo

| No Comments

Bob DuCharme, who has recently been exploring a variety of triple stores, has an insightful post up asking questions about the idea of named graphs in RDF stores. Since the Open Anzo repository is based around named graphs (as are all Cambridge Semantics’ products based on Open Anzo such as Anzo for Excel), I thought I’d take a stab at giving our answers to Bob’s questions:

1. If graph membership is implemented by using the fourth part of a quad to name the graph that the triple belongs to, then a triple can only belong directly to one graph, right?

This is correct. In Open Anzo, triples are really quads, in that every subject-predicate-object triple has a fourth component, a URI that designates the named graph of the triple. The named graph with URI u comprises all of the triples (quads) that have u as their fourth component.

Of course, this means that the same triple (subject-predicate-object) can exist in multiple named graphs. In such a case, each such triple is distinct from the others; it can be removed from one named graph independently of its presence in other named graphs.

2. I say "belong directly" because I'm thinking that a graph can belong to another graph. If so, how would this be indicated? Is there some specific predicate to indicate that graph x belongs to graph y?

Open Anzo has no concept of nesting graphs or graph hierarchies. The URI of a named graph can be used as the subject or object of a triple just like any other URI, with a meaning specific to whatever predicate is being used. So two graphs can be related by means of ordinary triples, but there is no special support for any such constructs.

3. If we're going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let's say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 "FROM NAMED" clauses in my query? (Let's assume that I plan to query those same 432 multiple times.)

There are a couple of points here.

  1. First, for Open Anzo at least, it's up to the application developer how to group triples into named graphs. I don't think we've ever ourselves used the scheme you suggest (everything updated at once is a named graph), but you could if you wanted. Instead, named graphs tend to collect triples that represent a reasonably core object in the application's domain of discourse.
  2. Open Anzo does use named graphs for provenance. Named graphs are the basic unit for:
    • Versioning. When one or more triples in a named graph are updated, the entire graph is versioned. Open Anzo tracks the modification time and the user that instigated the change, and also provides an API for getting at previous revisions of a graph. (Graphs can also be explicitly created that do not keep track of revisions. Those still track the last updated on and last updated by bits of provenance.)
    • Access control. Control of who can read, write, remove, or change permissions on RDF data in Open Anzo is attached strictly at the named-graph level. This tends to work nicely with the general modeling approach that lets a named graph represent a conceptual entity.
    • Replication. Client applications can maintain local replicas of data from an Open Anzo server. Replication occurs at the level of a named graph.
  3. Second, it's worth noting that Open Anzo adds a bit of infrastructure for handling this sort of provenance. Each named graph in an Open Anzo repository has an associated metadata graph. The system manages the triples in the metadata graph, which can include access control data, provenance data, version histories, associated ontological elements, and more. This lets all of the provenance information be treated as RDF without conflating it with user/application-created triples.
  4. Third, regarding the challenge of handling queries that need to span hundreds or thousands of named graphs: As Bob observed, this is a common situation when you are basing a store around named graphs. The Open Anzo approach to this problem is to introduce the idea of a named dataset. A named dataset is a URI-identified collection of graphs. (Technically, it's two collections of graphs, representing both the default and named graph elements of a SPARQL query.) Glitter, the Open Anzo SPARQL engine, extends SPARQL with a FROM DATASET <u> clause that scopes the query to the graphs contained in the referenced named dataset, u. Currently, named datasets explicitly enumerate their constituent graphs. There's no reason, however, that the same approach could not be used along with other methods of identifying the dataset's graph contents, such as URI patterns or a query.

All in all, we find the named graph model to be extremely empowering when building applications based on RDF. It gives a certain degree of scaffolding that allows all sorts of engineering and user experience flexibility. At a high level, we approach named graphs in a similar fashion to how we approach ontologies. We find both constructs useful for dealing with large amounts of RDF in practical enterprise environments, for engineering various ways of partitioning and understanding the data throughout the software stack. In the end, the named graph model goes to the heart of a few of RDF's core value propositions: agility and expressivity of the data model and adaptability of software built upon it.

We had overwhelming interest and, consequently, questions during our first SPARQL By Example Webcast (recorded archive available) that we did back in December. We ended up going through some basic SPARQL queries against FOAF data, DBPedia data, and leading up to introducing OPTIONAL queries against Jamendo data at DBTune.org.  This Thursday, Semantic Universe and I will be presenting a second part of this tutorial. We’ll look at other elements of SPARQL queries, including UNIONs, datasets, CONSTRUCT queries, ASK queries, DESCRIBE queries, negation, and several common extensions to SPARQL such as aggregates and free-text search. At least, covering all of that is the goal!

If you’re interested, you need to register in advance and then attend the Webcast at 1pm EST / 10am PST this Thursday, January 22. Hope to “see” many of you there.

Find recent content on the main index or look in the archives to find all content.

Recent Comments

  • Danny: Nice one Lee! Have you got the examples in a read more
  • Elias Torres: Will the video of the comedy routines be available online? read more
  • Prateek: Thanks for these! The examples will be very useful for read more
  • Ryan Shaw: Will you be posting slides for these talks? [Lee: Absolutely, read more
  • kingsley Idehen: Lee, Naturally, I support the idea :-) Kingsley read more
  • Benjamin Nowack: +1 for ease of implementation and user feedback. I'd rather read more
  • Brian Donnelly: Hi George and Lee, I agree with Lee but fully read more
  • Axel: Hi Lee, maybe you also want to include our SPARQL read more
  • Lee: Hi George, Sorry to hear about your unhappy experiences with read more
  • George Izzard O'Veering: As one who has had a lot of personal experience read more
Powered by Movable Type 4.23-en