March 5, 2012

Enterprise Semantics Blog

We (Cambridge Semantics) have recently launched a new blog, Enterprise Semantics. The blog covers a mix of technical and business topics related to the use of semantic technologies inside large enterprises. I'm writing some posts on that blog, and I'll be continuing to put posts here as well. You can sign up to follow the blog in an RSS reader via its feed, or you can receive emails when there are new posts by subscribing on the blog itself, or you can just follow us @CamSemantics. (The feed is not currently syndicated by Planet RDF, so if you read my blog via Planet RDF and are interested in enterprise semantics, you should probably still sign up separately.)

Here's just a taste of some of the content we've published in the first two months of the blog:

What Happened to NoSQL for the Enterprise?

So what it comes down to is that for decades we’ve had one standard way to store and query important data, and today there are new choices.  As with any choice, there are tradeoffs, and for some applications NoSQL databases, including Semantic Web databases, can enable organizations to get more done in less time and with less hardware than relational databases.  The trick is to know when and how to deploy these new tools.

Big Data... or Right Data?

What matters most, Big Data or Right Data? One look at all the IT headlines these days would suggest that Big Data is the most important data issue today. After all, with lots of computing power and better database storage techniques it is now practical to analyze petabytes of data. However, is that really the most compelling need that end users have? I don’t think so. Instead, I would claim that the issue most end users have is getting together the right data to help them do their jobs better, not analyzing billions of individual transactions.

What the Semantic Web and Digital Cameras have in Common

Analog photography went through lots of phases of dramatic improvement, becoming a mass-market technology. matter how far it went it was limited in its flexibility. Every picture was pretty much as you took it. Any modification required real experts, with specialist equipment and working in a dark room. With the advent of digital photography we have achieved extreme flexibility. The picture you take is simply the starting point to create the picture you want, and the end users themselves can make the changes with easy to use tools.

Semantic Web technology represents the same dramatic shift from the traditional technologies.

Why Semantic Web Software Must Be Easy(er) to Use

In short, if Semantic Web software is hard to use, then many of the benefits of using these technologies in the first place are immediately lost. Conversely, if Semantic Web software is easy to use, on the other hand, then the benefits of Semantic Web technologies' flexibility are brought directly to the end user, the business user. The business manager can bring together new data sets for analysis today, rather than a week for now. An analyst can setup triggers and alerts to monitor for key business indicators today, rather than waiting 3 months. A senior scientist can begin looking for correlations within ad-hoc sets of data today, rather than next year.

It's All About the Data Model

There is a new data model called RDF—the data model of the Semantic Web—which combines the best of both worlds: the flexibility of a spreadsheet and the manageability and data integrity of a relational database. Based on standards set by the World Wide Web Consortium (W3C) to enable data combination on the Web, RDF defines each data cell by the entity it applies to (row) and the attribute it represents (column). Each cell is self-describing and not locked into a grid, in other words the data doesn't have to be "regular". Further, it has formal operations that can be performed on it, much like relational algebra, but clearly at a more atomic level.

December 9, 2011

Linked Enterprise Data Patterns Workshop

I spent Tuesday and Wednesday this week at the W3C Linked Enterprise Data Patterns workshop at MIT (#LEDP). After all, we do linked data and we work with large enterprise customers, so it seemed like a natural fit. The workshop was an interesting two days hearing folks share their experiences using linked data (and sometimes not using linked data) in enterprise situations (and sometimes not in enterprise situations). The main consensus that emerged from the workshop was a desire for a set of profiles of conformance criteria for what constitutes interoperable linked data implementations. I'm personally pretty certain though that the consensus ends there: people continue to have very different views of what pieces of the Semantic Web technology stack (or related technologies like REST and Atom) are most important for a linked data deployment. Eric Prud'hommeaux tried to classify the linked data camps into those doing data integration and storage and query and those doing HTTPy resource linking, but I'm guessing the distinctions are even more nuanced than that.

Anyways, on Wednesday I gave a talk on the patterns we use to segment data within Anzo, as well as some of our other usages of Semantic Web technologies and where we see gaps in the standards world (frankly, more in adoption than in specification). I recorded a screencast of the talk—it's not the most polished, but if you weren't able to attend the workshop you might be interested in the talk. I've also posted the slides themselves online. Here's the video:


There were a couple of discussions in the middle of the talk that I had to cut out because they involved too much cross-talk taking place far away from the mic and were hard to understand. One was a discussion around the way that we (by default) break data into graphs and how it privileges RDF subjects over objects, and whether that affects access control decisions (our experience: no). Another discussion around the 9 minute mark was about the use of the same URI to identify a graph and the subject of data within that graph. A third discussion surrounded ongoing efforts to extend VoID to do additional descriptions of linked data endpoints.

September 27, 2011

Saving Months, Not Milliseconds: Do More Faster with the Semantic Web

When I suggested that we're often asking the wrong question about why we should use Semantic Web technologies, I promised that I'd write more about what it is about these technologies that lowers the barrier to entry enough to let us do (lots of) things that we otherwise wouldn't. In the meantime, some other people have done a great job of anticipating and echoing my own thoughts on the topic, so I'm going to summarize them here.

The bottom line is this: The Semantic Web lets you do things fast. And because you can do things fast, you can do lots more things than you could before. You can afford to do things that fail (fail fast); you can afford to do things that are unproven and speculative (exploratory analysis); you can afford to do things that are only relevant this week or today (on-demand or situational applications); and you can afford to do things that change rapidly. Of course, you can also do things that you would have done with other technology stacks, only you can have them up and running (& ready to be improved, refined, extended, and leveraged) in a fraction of the time that you otherwise would have spent.

The word 'fast" can be a bit deceptive when talking about technology. We can all be a bit obsessed with what I call stopwatch time. Stopwatch time is speed measured in seconds (or less). It's raw performance: How much quicker does my laptop boot up with an SSD? How long does it take to load 100 million records into a database? How many queries per second does your SPARQL implementation do on the Berlin benchmark with and without a recent round of optimizations?

We always talk about stopwatch time. Stopwatch time is impressive. Stopwatch time is sexy. But stopwatch time is often far less important than calendar time.

Calendar time is measured in hours and days or in weeks and months and years. Calendar time is the actual time it takes to get an answer to a question. Not just the time it takes to push the "Go" button and let some software application do a calculation, but all of the time necessary to get to an answer: to install, configure, design, deploy, test, and use an application.

Calendar time is what matters. If my relational database application renders a sales forecast report in 500 milliseconds while my Semantic Web application takes 5 seconds, you might hear people say that the relational approach is 10 times faster than the Semantic Web approach. But if it took six months to design and build the relational solution versus two weeks for the Semantic Web solution, Semantic Sam will be adjusting his supply chain and improving his efficiencies long before Relational Randy has even seen his first report. The Semantic Web lets you do things fast, in calendar time.

Why is this? Ultimately, it's because of the inherent flexibility of the Semantic Web data model (RDF). This flexibility has been described in many different ways. RDF relies on an adaptive, resilient schema (from Mike Bergman); it enables cooperation without coordination (from David Wood via Kendall Clark); it can be incrementally evolved; changes to one part of a system don't require re-designs to the rest of the system. These are all dimensions of the same core flexibility of Semantic Web technologies, and it is this flexibility that lets you do things fast with the Semantic Web.

(There is a bit of nuance here: if stopwatch performance is below a minimum threshold of acceptability, then no one will use a solution in the first place. Semantic Web technologies have had a bit of a reputation for this in the past, but it's long since true. I'll write more about that in a future post.)

September 12, 2011

Why Semantic Web Technologies: Common, Coherent, Standard

To paraphrase both Ecclesiastes and Michael Stonebraker & Joseph Hellerstein, there is nothing new under the sun.

It's as true with Semantic Web technologies as with anything else—tuples are straightforward, ontologies build on schema languages and description logics that have been around for ages, URIs have been baked into the Web for twenty years, etc. But while the technologies are not new, the circumstances are. In particular, the W3C set of Semantic Web technologies are particularly valuable for having been brought together as a common, coherent, set of standards.

  • Common. Semantic Web technologies are broadly applicable to many, many different use cases. People use them to publish pricing data online, to uncover market opportunities, to integrate data in the bowels of corporate IT, to open government data, to promote structured scientific discourse, to build open social networks, to reform supply chain inefficiencies, to search employee skill sets, and to accomplish about ten thousand other tasks. This makes a one-size-fits-all elevator pitch challenging, but it also means that there's a large audience of practitioners that are benefitting from these technologies and so are coming together to create standards, build tool sets, and implement solutions. These are not niche technologies with limited resources for ongoing development or at risk to be hijacked for a purpose at odds with your own.
  • Coherent. Semantic Web technologies are designed to work together. The infamous layer cake diagram may have many shortcomings, but it does demonstrate that these technologies fit together like jigsaw puzzle pieces. This means that I can build an application using the RDF data model, and then incrementally bring new functionality online by adopting other Semantic Web technologies. Without a coherent set of technologies, I'd have to either roll my own solutions for new functionality (expensive, error-prone) or try to overcome impedance mismatches in connecting together multiple unrelated technologies (expensive, error-prone).
  • Standard. Semantic Web technologies are developed in collaborative working groups under the auspices of the World Wide Web Consortium (W3C). The specifications are free (both as in beer and as in not constrained by intellectual property) and are backed by test suites and implementation reports that go a long way to encouraging interoperable tools.

The technologies are not novel and are not perfect. But they are common, coherent, and standard and that sets them apart from a lot of what's come before and a lot of other options that are currently out there.

August 29, 2011

The Magic Crank

As a brief addendum to my previous post: I've been using this image for a few years now to illustrate what the Semantic Web is not. I call it the magic crank. I imagine that it sits in the corner of the office of some senior pharma executive, and every time their drug development pipeline gets a bit thin or patent protection for the big blockbuster drugs wears off, the executive pulls it out. She dusts off the crank and plugs in the latest databases full of data on genomics, protein interactions, efficacy and safety studies, etc. A few turns of the magic crank later, and she's rewarded with a little card that tells her exactly what drug to invest in next.

To me, the magic crank is the unrealized holy grail of the Semantic Web in the pharma industry. And it's an extremely powerful and valuable goal. But it's a bit dangerous as well: every time someone new to the Semantic Web learns that the magic crank is what the Semantic Web is all about, they end up trying to tackle large, unsolved problems. They end up asking "What can I do with Semantic Web technologies that I can't do otherwise?". Once you've latched onto the potential of the magic crank, it's very hard to ratchet your questions back down to the less-impressive-but-practical-and-still-very-valuable, "What can I do with Semantic Web technologies that I wouldn't do otherwise?".


Credit for the image goes to Trey Ideker of UCSD. I first saw the image in a presentation by Enoch Huang at CSHALS a few years ago.

August 22, 2011

Why Semantic Web Technologies: Are We Asking the Wrong Question?

I haven't written much lately. I've been busy building things. And while I've been building things, I've been learning things. I'd like to start writing and start sharing some of the things I've been learning.

I'd say that at least once a week, when talking to prospective customers, I get asked the following:

What can I do with Semantic Web technologies that I can't do otherwise?

It's a question that's asked in good faith: enterprise software buyers have heard tales of rapid data integration, automated data inference, business-rules engines, etc. time and time again. By now, any corporate IT department likely owns several software packages that purport to accomplish the same things that Semantic Web vendors are selling them. And so a potential buyer learns about Semantic Web technologies and searches for what's new:

What can I do with Semantic Web technologies that I can't do otherwise?

The real answer to this question is distressingly simple: not much. IT staff around the world are constantly doing data integration, data inference, data classification, data visualization, etc. using the traditional tools of the trade: Java, RDBMSes, XML…

But the real answer to the question misses the fact that this is the wrong question. We ought instead to ask:

What can I do with Semantic Web technologies that I wouldn't do otherwise?

Enterprise projects are proposed all the time, and all eventually reach a go/no-go decision point. Businesses regularly consider and reject valuable projects not because they require revolutionary new magic, but because they're simply too expensive for the benefit or they'd take too long to fix the situation that's at hand now. You don't need brand new technology to make dramatic changes to your business.

The point of semantic web tech is not that it's revolutionary – it's not cold fusion, interstellar flight, quantum computing – it's an evolutionary advantage – you could do these projects with traditional techs but they're just hard enough to be impractical, so IT shops don't – that's what's changing here. Once the technologies and tools are good enough to turn "no-go" into "go", you can start pulling together the data in your department's 3 key databases; you can start automating data exchange between your group and a key supply-chain partner; you can start letting your line-of-business managers define their own visualizations, reports, and alerts that change on a daily basis. And when you start solving enough of these sorts of problems, you derive value that can fundamentally affect the way your company does business.

I'll write more in the future about what changes with Semantic Web technologies to let us cross this threshold. But for now, when you're looking for the next "killer application" for Semantic Web in the enterprise, you don't need to look for the impossible, just the not (previously) practical.

June 27, 2011

Anzo Connect: Semantic Web ETL in 5 Minutes

At last week's SemTech conference, my colleague Ben Szekely kicked off the business track of the lightning talks by debuting our new product, Anzo Connect. Ben showed how Anzo Connect can be used in just a few minutes (4.5 to be precise) to pull data from a relational database, map it to an ontology, integrate the data into an existing RDF store, and visualize the results in a Web-based dashboard.

We took this video of the lightning demo from the audience, but it gives some idea of what Anzo Connect is all about. If you're interested in learning more about Anzo Connect or any of our other software, please drop me a note.

(5 Minute ETL with Anzo Connect)

May 31, 2011

Evolution Towards Web 3.0: The Semantic Web

On April 21, 2011, I had the pleasure of speaking to Professor Stuart Madnick's "Evolution Towards Web 3.0" class at the MIT Sloan School of Management. The topic of the lecture was—unsurprisingly—the Semantic Web. I had a great time putting together the material and discussing it with the students, who seemed to be very engaged in the topic. It was a less technical audience then I often speak with, and so I tried to focus on some of the motivating trends, use cases, and challenges involved with Semantic Web technologies and the vision of the Semantic Web.

I've now placed the presentation online. It's broken down into three basic parts:

  • What about the development of the Web and enterprise IT motivates the Semantic Web?
  • How is it being used today?
  • What are some of the challenges facing the Semantic Web, both on the World Wide Web and within enterprises?

I found the last of the three sections particularly interesting, and I hope you do too.

The presentation has speaker's notes along with them that add significant commentary to the slides. You can view them by clicking on the "Speaker Notes" tab below the slides. Please let me know what you think: Evolution Towards Web 3.0: The Semantic Web.

March 24, 2011

Describing the Structure of RDF Terms

I'm wondering if there are existing vocabularies and best practices that deal with the following use case:

How do I write down metadata about the return type of a SPARQL function that returns a URI?

Since "returns a URI" can be a bit ambiguous in the face of things like xsd:anyURI typed literals, we can be a bit more precise:

How do I write down metadata about the return type of a SPARQL function that returns a term for which the isURI function returns true?

Functions like this have all sorts of uses. We use them all the time in conjunction with CONSTRUCT queries and the SPARQL 1.1 BIND clause to generate URIs for new resources.

So, when describing this function, how do I write down the return type of one of these URI-generating functions? I want to write something like:

fn:GenerateURI fn:returns ??

If I had a function that returned an integer, I'd expect to be able to write something like:

fn:Floor fn:returns xsd:integer

But in that case, I'm taking advantage of the fact that datatyped literals denote themselves. (Thanks to Andy Seaborne for pointing this out to me.) I can't say this:

fn:GenerateURI fn:returns xsd:anyURI

This seems to tell me that my function returns something that denotes a URI. (One such things that denotes a URI is an xsd:anyURI literal.) But, again, that's not what I want to say here. I want to say that my function returns something that is syntactically a URI. That is, it returns something that is named by a URI. I considerd something like:

fn:GenerateURI fn:returns rdfs:Resource

But rdfs:Resource is a class of everything, and as far as I can tell would mean that my function could return a URI, a literal, or a blank node.

So any suggestions for how to approach this sort of modeling of the return type (and parameter types) for SPARQL functions?

January 17, 2011

Cambridge Semantics is Hiring

At Cambridge Semantics, we're excited to be bringing a few new people onto our team. We're looking to hire:

  • A Web Engineer. If you're an expert in serious JavaScript, HTML, and CSS development, this is a great position for you. You'll be working to further Anzo on the Web, our Web-based self-service reporting and data collection tool that uses semantic technologies to put flexible, data-driven visualizations and analytics in the hands of non-technical business users.
  • A Customer Implementation Engineer. We're looking for a sharp, creative problem solver to join our professional services team and help our customers use Anzo for Excel, Anzo on the Web, and the rest of our Anzo semantic technologies  You'll work directly with our customers to solve a wide variety of business problems and also work closely with our entire Cambridge Semantics team, from engineering to sales to marketing.
  • A Quality Assurance Engineer. If you're experienced in designing and executing software test plans and are looking for an exciting opportunity to apply your talents to cutting-edge enterprise semantic software, then check out this position. You'll be working to design, execute, and automate test cases for all of our current and new Anzo products to help make the software as good as it can possibly be.

If you're interested in applying for any of these positions, please send your resume to If you know anyone who might be interested, please send them our way!

November 24, 2010


Bob DuCharme suggested that I share this explanation about the role of FROM, FROM NAMED, and GRAPH within a SPARQL query. So here it is…

A SPARQL query goes against an RDF dataset. An RDF dataset has two parts:

  • A single default graph -- a set of triples with no name attached to them
  • Zero or more named graphs -- each named graph is a pair of a name and a set of triples

The FROM and FROM NAMED clauses are used to specify the RDF dataset.

The statement "FROM u" instructs the SPARQL processor to take the graph that it knows as "u", take all the triples from it, and add them to the single default graph. If you then also have "FROM v", then you take the triples from the graph known as v and also add them to the default graph.

The statement "FROM NAMED x" instructs the SPARQL processor to take the graph that it knows as "x", take all the triples from it, pair it up with the name "x", and add that pair (x, triples from x) as a named graph in the RDF dataset.
Note that "known as" is purposefully not specified -- some implementations dereference the URI to get the triples that make up that graph; others just use a graph store that maps names to triples.

All the parts of the query that are outside a GRAPH clause are matched against the single default graph.

All the parts of the query that are inside a GRAPH clause are matched individually against the named graphs.

This is why it sometimes makes sense to specify the same graph for both FROM and FROM NAMED:


...puts the triples from x in the default graph and also includes x as a named graph. So that later in the query, triple patterns outside of a GRAPH clause can match parts of x and so can triple patterns inside a GRAPH clause.

There's a visual picture of this on slide 13 of my SPARQL Cheat Sheet slides.

July 7, 2010

Could SemTech Run On Excel? (SemTech Lightning Demo)

At SemTech a couple of weeks ago, I participated in the jam-packed lightning talk session, 90 minutes packed with 5-minute talks and moderated with great aplomb by Paul Miller. While most of the speakers presented pithy, informative, and witty slide decks, I opted to go a different route: I've long believed that some of the biggest value in Semantic Web technologies lies in their ability to dramatically change the timescales involved in traditional IT projects—to this end, I used my 5 minute slot to give a live demo of using our Anzo software suite to build a solution for running a conference such as SemTech using just Excel and a Web browser.

When I got back to Boston, I made a recording of the same lightning demo for posterity. Please enjoy it here and drop me a note if you have any questions or would like to learn more.

(Best viewed in full screen, 720p.)

July 1, 2010

Early SPARQL Reviews

sw-sparql-orange[1]The SPARQL Working Group is still working on all of our specifications. None are yet at Last Call, though we feel our designs are quite stable  and we're hoping to reach Last Call within a few months. Standard W3C process encourages interested community members to review Working Drafts as they're produced, but especially encourages reviews of Last Call drafts.

While we will of course do this (solicit as widespread review of our Last Call drafts as possible), I'd like to put out a call for reviews of our current set of Working Drafts. If you can only do one review, you're probably best off waiting for Last Call; but if you have the inclination and time, it would be great to receive reviews of our current set of Working Drafts at our comments list at The Working Group has committed to responding formally to all comments received from hereon out.

Here is our current set of documents, along with a few explicit areas/issues that the Working Group and editors would love to receive feedback about (of course, all reviews & all feedback is welcome):

SPARQL 1.1 Query

  • Feedback on MINUS and NOT EXISTS, the two new negation constructs in SPARQL 1.1 (section 8)
  • Feedback on the new functions in SPARQL 1.1 (15.4.14 through 15.4.21)
  • Feedback on the aggregates ("set functions") included in SPARQL 1.1 (section 10.2.1)
  • Feedback on property paths (currently in its own document)

SPARQL 1.1 Update

  • Handling of RDF datasets in SPARQL Update (particularly the WITH, USING, and USING NAMED clauses)

SPARQL 1.1 Service Description

  • Discovery mechanism for service descriptions (section 2)
  • Modeling of graphs and RDF datasets (3.2.7 through 3.2.10 and 3.4.11 through 3.4.17)
  • Service description as related to entailment (3.2.5, 3.2.6 and 3.4.3 through 3.4.5)

SPARQL 1.1 Entailment Regimes

  • The mechanisms for restricting solutions in all regimes
  • Are the OWL Direct Semantics too general? E.g. it allows for variables in complex class expressions

SPARQL 1.1 Federation Extensions

  • Should support for SERVICE be mandatory in SPARQL 1.1 Query implementations?
  • Should support for BINDINGS be mandatory in SPARQL 1.1 Query implementations?

SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs

  • Interpretation/translation of HTTP verbs into SPARQL Update statements
  • Handling of indirect graph identification (section 4.2 et al.)

September 8, 2009

Does anyone use SPARQL over SOAP?

The SPARQL Working Group would like to know if anyone uses SPARQL over SOAP. Please leave a comment if you do. (We know that several implementations support a SOAP implementation of the SPARQL protocol, but we don’t have much evidence that this part of such implementations is ever used.)


July 7, 2009


I promised Danny that I’d write this up, so here’s to making good on promises.

Open Anzo is a quad store. (I’ve written about this before.) All of the services Open Anzo offers—versioning, replication, real-time updates, access control, etc.—are oriented around named graphs. Time and time again we’ve found named graphs to be invaluable in building applications atop an RDF repository.

And while SPARQL took the first steps towards standardizing quads via the named graphs component of the RDF dataset, the CONSTRUCT query result form only returned triples.

For our purposes in Open Anzo, this severely limits the usefulness of CONSTRUCT. We can’t use it to pull out a subset of the server’s data, as any data returned has been stripped of its named graph component. The solution was pretty simple, and is a good example of practicing what I’ve been preaching recently: a key part of the standards process is for implementations to extend the standards.

In this case, we simply extended Glitter’s (Open Anzo’s SPARQL engine) CONSTRUCT templates to support a GRAPH clause, in exactly the same way that SPARQL query patterns support GRAPH clauses. This means that any triple pattern within a CONSTRUCT template will now either output a triple (if its outside any GRAPH clause) or a quad (if its inside a GRAPH clause).

Key to making this happen is the fact that both the Open Anzo server and the three client APIs (Java, JavaScript, and .NET) support serializing and deserializing quads to/from the TriG RDF serialization format. TriG’s a very straightforward extension of Turtle, and I’d like to see it used more and more throughout Semantic Web circles.

Anyway, here are a few simple examples of CONSTRUCTing quads in practice:

# fix up typo'ed predicates
  GRAPH ?g {
    ?s rdf:type ?o
  GRAPH ?g {
    ?s rdf:typo ?o

# copy triples into a new graph
  GRAPH ex:newGraph {
    ?s ?p ?o
  ?s ?p ?o

# more complicated -- place constructed triples in
# a new “inferred” graph and indicate this fact in
# an Open Anzo metadata graph associated with the
  GRAPH ex:inferredGraph {
    ?p ex:uncle ?uncle
  GRAPH ?mdg {
    ?mdg anzo:hasInferredGraph true
  GRAPH ?g {
    ?p ex:parent [ ex:brother ?uncle ] .
  GRAPH ?mdg {
    ?mdg a anzo:metadatagraph ; anzo:namedGraph ?g

Of course, combine this with some of the other SPARQL extensions that Glitter supports—subqueries, projected expressions, assignment, and aggregates being my favorites—and you’ve got a powerful way to transform and extract quad-based RDF data.

June 21, 2009

SPARQLing at SemTech

SemTech 2009 has come and gone, and it was great. I was concerned—as were others—that the state of the economy would depress the turnout and enthusiasm for the show, but it seems that any such effects were at least counterbalanced by a growing interest in semantic technologies. Early reports are that attendance was up about 20% from last year, and at sessions, coffee breaks, and the exhibit hall there seemed to always be more people than I expected. Good stuff.

Eric P. and I gave our SPARQL By Example tutorial to a crowd of about 50 people on Monday. From the feedback I’ve received, it seems that people found the session beneficial, and at least a couple of people remarked on the fact that Eric and I seemed to be having fun. If this whole semantic thing doesn’t work out, at least we can fall back on our ad-hoc comedy routines.

Anyways, I wanted to share a couple of links with everyone. I think they work nicely to supplement other SPARQL tutorials in helping teach SPARQL to newcomers and infrequent practitioners.

  1. SPARQL By Example slides. I’ve probably posted this link before, but the slides have now been updated with some new examples and with a series of exercises that help reinforce each piece of SPARQL that the reader encounters. Thanks to Eric P. for putting together all of the exercises and to Leigh Dodds for the excellent space exploration data set.
  2. SPARQL Cheat Sheet slides. This is a short set of about 10 slides intended to be a concise reference for people learning to write SPARQL queries. It includes things like common prefixes, the structure of queries, how to encode SPARQL into an HTTP URL, and more.

Enjoy, and, as always, I’d welcome any feedback, suggestions for improvements, or pointers to how/where you’re able to make use of these materials.

June 4, 2009

Why we love Semantic Web technologies

We’ll be releasing the first versions of our Anzo products in July. Between now and then I’m going to try to do some blogging showing various parts of the products. But before I begin that, I’ve been thinking a bunch recently about how to characterize our use of Semantic Web technologies, and I wanted to write a bit on that.

Our software views the world of enterprise data in a pretty straightforward way:

  1. Bring together as much data as possible.
  2. Do stuff with the data.
  3. Allow anyone to consume the data however (& whenever) they want.

This is a very simple take on what we do, but it gets to the heart of why we care about semantics: We love semantics because semantics is the “secret sauce” that makes possible each of these three aspects of what we do.

Here’s how:

Bring together as much data as possible

First of all, in most cases we don’t actually physically copy data around. That sort of warehouse approach is appropriate in some cases, but in general we prefer to leave data where it is and bring it together virtually. Our semantic middleware product, the Anzo Data Collaboration Server, provides unified read, write, and query interfaces to whatever data sources we’re able to connect to. We often refer to the unified view of heterogeneous enterprise data as a semantic fabric, but really it’s linked data for the enterprise.

Semantic Web technologies make this approach feasible. RDF is a data standard that is both expressive enough to represent any type of data that’s connected to the server and also flexible enough to handle new data sources incrementally. URIs provide a foundation for minting identifiers that don’t clash unexpectedly as new data sources are brought into the fold. Named graphs give us a simple abstraction upon which we can engineer practical concerns like security, audit trails, offline access, real-time updates, and caching. And, of course, GRDDL gives us a standard way to weave XML source data into the fabric.

Without Semantic Web technologies we’d need to worry about defining a master relational schema up front, or we’d have to constantly figuring out how to structurally relate or merge XML documents. And when we’re talking about data that originates not only in one or two big relational databases but also in hundreds or thousands or hundreds of thousands of Excel spreadsheets, the old ways just don’t cut it at all. Semantic Web technologies, on the other hand, provide the agile data foundation we need to bring data together.

But bringing together as much data as possible is not an end in itself. What’s the point of doing this?

Do stuff with the data

This one’s intentionally vague, because there are lots of things that lots of different people want—and need—to do with data, and Anzo is a platform that accommodates many of those things. In general, though, Semantic Web standards again lay the groundwork for the types of things that we want to do with data:

  • Data access. SPARQL gives us a way to query information from multiple data sources at once.
  • Describing data. RDF Schema and OWL are extremely expressive ways to describe (the structure of) data, particularly compared to alternatives like relational DDL or XML Schema. We can (and do) use data descriptions to do things like build user interfaces, generate pick lists (controlled vocabularies), validate data entry, and more.
  • Transform data. There are all kinds of ways in which we need to derive new data from existing data. We might do this via inference (enabled by RDFS and OWL) or via rules (enabled by SPARQL CONSTRUCT queries, by RIF, or by SWRL) or simply via something like SPARQL/Update.

Without Semantic Web technologies, we’d probably end up using a proprietary approach for querying across data sources. We’d have to hardcode all of our user interface or else invent or adopt a non-standard way of describing our data beyond what a relational schema gives us. And then we might choose a hodgepodge of rules engines, SQL triggers, and application-specific APIs to handle transforming our data. And this might all work just fine, but we’d have to put in all the time, effort, and money to make all the pieces work together.

To me, that’s the beauty of the much-maligned Semantic Web layer cake. The fact that semantic technologies represent a coherent set of standards (i.e. a set of disparate technologies that have been designed to play nice together) means that I can benefit from all of the “glue” work that’s already been done by the standards community. I don’t need to invent ways to handle different identifier schemes across technologies or how to transform from one data model to another and back again: the standards stack has already done that.

Allow anyone to consume the data however (& whenever) they want

Once we’ve put in place the ability to bring data together and do stuff to that data, the remaining task is to get that information in front of anyone who needs it when they need it. We’ve put in a lot of effort to make bringing data into the fabric and acting on that data easy, and it would be a shame if every time someone needs to consume some information they need to put in a request and wait 6 months for IT to build the right queries, views, and forms for them.

To this end, Anzo on the Web takes the increasingly popular faceted-browsing paradigm and puts it in the hands of non-technical users. Anyone can visually choose the data that they need to see in a grid, a scatter plot, a pie chart, a timeline, etc. and the right view is created immediately. Anyone can choose what properties of the data should be available as facets to filter through the data set via whatever attributes he or she wants.

Once again it’s the flexibility of the Semantic Web technology stacks that makes this possible for us. RDF makes it trivial for us to create, store, and discover customized lenses with arbitrary properties. RDF also lets us introspect on the data to present visual choices to users when configuring views and adding filters. SPARQL is a great vehicle for building the queries that back faceted browsing.

In summary

It bears repeating that as with most technology standards, the things that we accomplish with Semantic Web standards could be done with other technology choices. But using a coherent set of standards backed by a thriving community of both research and practice means that:

  1. We don’t have to invent all the glue that ties different technologies together
  2. Any new standards that evolve within this stack immediately give our software new capabilities (see #1)
  3. There’s a wide range of 3rd party software that will easily interoperate with Anzo (other RDF stores, OWL reasoners, etc.)
  4. We can focus on enabling solutions, rather than on the core technology bits. All of the above frees us up to do things like build an easy to use faceted browsing tool, build Anzo for Excel to collect and share spreadsheet data, build security and versioning and real-time updates, and much more.

Again, the semantics is really the secret sauce that makes much of what we do possible, but there’s a lot more innovation and engineering that turns that secret sauce into practical solutions. I’ll have some takes on what this looks like in practice in the coming weeks, and we’d love to show you in person if you’ll be in the Boston, MA area or if you’ll be at SemTech in San Jose, CA.

May 31, 2009

Cambridge Semantics @ SemTech 2009

I’m looking forward to this year’s Semantic Technology Conference in San Jose the week of June 14-18. I saw lots of fantastic sessions at last year’s SemTech and met tons of great people, and I imagine that this year will be even better. My colleagues at Cambridge Semantics and I will be giving a few talks, running the gamut from tutorial to technology survey to project report to our vision of how to build practical semantic solutions:

  • SPARQL By Example tutorial. I’ll be giving this half-day tutorial on Monday afternoon. We’ll use actual SPARQL queries that can be run on the (public) Semantic Web today as a means to learning SPARQL from the ground up.
  • Making Sense of Spreadsheets in Merck Basic Research. Jaime Melendez of Merck and I will be giving this talk bright and early on Tuesday morning. We’ll be reporting on the results of a joint innovation project that we completed last year using our Anzo software to address several challenges facing Merck basic research.
  • Enterprise Scalable Semantic Solutions in Five Days. Mike Cataldo will be talking later Tuesday morning about how Anzo makes use of semantic technologies to help our customers build practical, production-ready solutions in a matter of days.
  • Faceted Browsing Tools. Jordi Albornoz will be talking on Tuesday afternoon about the power and simplicity of faceted browsing and semantic lens technologies. He’ll be comparing and contrasting Exhibit, Fresnel, and our own Anzo on the Web.

I know that people have been saying this for a few years now, but I keep seeing the Semantic Web taking significant steps forward both inside of and outside of corporate firewalls. I fully expect this year’s SemTech to reaffirm this point of view. If you’ll be in San Jose, come by some of our talks and see what I mean. We’ll also have a space in the exhibit hall, so you can come and say hi there as well. See you there!

May 12, 2009

Semantic Web Landscape - 2009

I’m currently on a bit of a whirlwind trip to beautiful Lucerne to present a Semantic Web tutorial at the SIG meeting preceding the PRISM Forum meeting.

For the tutorial, I put together about 150 slides that act as a survey of the current landscape of Semantic Web technologies and tools. It’s aimed to give an audience some motivation for Semantic Web technologies, and to provide a tour through most Semantic Web technologies. It’s not a “how to” tutorial—it’s more of a “here’s what this Semantic Web thing is all about” tutorial.

Anyway, I thought the slides might be interesting to other people and/or helpful to other presenters. Since I cribbed a bunch of material from some other people, it’s only fair that other people be free to do the same with my slides.

I’m always eager for feedback and suggestions to improve the tutorial material.

April 28, 2009

Encourage semantic technologies at

I’ve worked off and on in the past will Mills Davis and Brand Niemann of the U.S. EPA in looking at ways that Semantic Web technologies can benefit the U.S. federal government. We’ve got another chance to make a case for this this week.

Currently, the folks behind are hosting one week of open dialogue of IT approaches for exposing data about the U.S. stimulus package in an open and transparent fashion.

Needless to say, there are many calls for XML and Web Services-based approaches. In my opinion, these are fine and are definitely better than not having the data available at all. But I also think this dialogue gives those of us who believe in the transformative power of Semantic Web technologies a chance to speak in their favor.

Mills and I have submitted three ideas to the dialogue. I’d love it if you took a look at them, and if you think they’re good ideas, please indicate your support by voting and leaving a comment. I’d also love to hear from anyone else who is participating in the dialogue!

March 16, 2009

Evolving standards, consensus, and the energy to get there

We’re doing something mildly interesting in the recently re-chartered SPARQL working group. We’re spending our first couple of months defining what our deliverables will be for the rest of our 18-month chartered lifetime. The charter gives us some suggestions on things to consider (update, aggregates, XML serialization for queries, and more) and some constraints to keep in mind (backwards compatibility), but beyond that it’s up to the group.

So we’ve started by gathering potential features. We solicited features—which can be language extensions, syntactic shortcuts, protocol enhancements, result format modifications, integrations with other technologies like XQuery, OWL, or RIF, new query serializations, and more—both from within the Working Group and from the broader community. Within a week or so, we had identified nearly 40 features, and I expect a few more to come in yet.

The problem is: all of these features would be helpful. My take on developer-oriented-technology standards such as SPARQL is that ultimately they serve the users of the users of the implementations. There’s a pyramid here, wherein a small number of SPARQL implementations will support a larger number of developers creating SPARQL-driven software which in turn does useful (sometimes amazing) things for a much larger set of end users. So ideally, we’d focus on the features that benefit the largest swaths of those end users.

But of course that’s tough to calculate. So there’s another way we can look at things: the whole pyramid balances precariously on the shoulders of implementers, and, in fact, the specifications are themselves written to be formal guides to producing interoperable implementations. If implementers can’t understand an extension or willfully choose not to add it to their implementations, then there wasn’t much point in standardizing it in the first place. This suggests that implementer guidance should be a prime factor in choosing what our Working Group should focus on. And that’s far more doable since many of the Working Group participants are themselves SPARQL implementers.

Yet, implementers priorities are not always tied to what’s most useful for SPARQL users and SPARQL users’ users. (This can be for a wide variety of reasons, not the least of which is that the feedback on what’s important for the implementer's’ users’ users often loses something in the multiple layers of communication that end up relaying it to implementers.) So what about that middle category, SPARQL users/developers? These fine folks have the most direct experience with SPARQL’s capabilities, caveats, and inabilities to solve different classes of problems as they apply to solving their users’ business/scientific/social/consumer problems. SPARQL users can and will surely contribute valuable experience along the lines of what extensions might make SPARQL easier to learn, easier to use, more powerful, and more productive when building solutions on the Semantic Web technology stack.

The difficulty here is that it’s often very, very hard for SPARQL developers to be selective in what features they’d like to see added to the landscape. SPARQL is their toolbox, and from their perspective (and understandably so), there’s little downside in stuffing as many tools as possible into SPARQL, just in case.

Things get more complicated. I (very) often joke (and will now write down for the first time) that if you get 10 Semantic Web advocates in a room, you’ll probably have 15 or 20 opinions as to what the Semantic Web is and what it’s for. When we zoom in on just the SPARQL corner of the Semantic Web world, things are no different. Some people are using SPARQL to query large knowledge bases. Some people are using SPARQL to answer ontologically-informed queries. Some people are using SPARQL to query an emerging Web of linked data. Some people are using SPARQL for business intelligence. Some people are using SPARQL in XML pipelines. Some people are using SPARQL as a de facto rules language. Some people are using SPARQL as a federated query language. And much more. No wonder then, that the Working Group might have difficulties reaching consensus on a significantly whittled-down list of features to standardize.

Why not do it all? Or, at least, why not come up with some sort of priority list for all of the features and work our way down that one at a time? It’s tempting, given the high quality of the suggestions, but I’m pretty sure it’s not feasible. Different groups of features interact with each other in different ways, and it’s exactly these interactions that need to be formally written down in a specification. Furthermore, the W3C process requires that as we enter and exit the Candidate Recommendation stage we demonstrate multiple interoperable implementations of our specifications—this becomes extremely challenging to achieve when the language, protocol, etc. are constantly moving targets. Add to that the need to build test cases, gather substantive reviews from inside and outside the Working Group, and (where appropriate) work together with other Working Groups. Now consider that Working Group participants are (for the most part) giving no more than 20% of their time to the Working Group. Believe me, 18 months flies by.

So what do I think is reasonable? I think we’ll have done great work if we produce high quality specifications for maybe three, four, or five new SPARQL features/extensions. That’s it.

(I’m not against prioritizing some others on the chance that my time estimates are way off; that seems prudent to me. And I also recognize that we’ve got some completely orthogonal extensions that can easily be worked on in parallel with one another. So there’s some wiggle room. But I hold a pretty firm conviction that the vast majority of the features that have been suggested are going to end up on the proverbial cutting-room floor.)

Here’s what I (personally) think should go into our decisions of what features to standardize:

  • Implementation experience. It’s easy to get in trouble when a Working Group resorts to design-by-committee; I prefer features that already exist in multiple, independent implementations. (They need not be interoperable already, of course: that’s what standards work is for!)
  • Enabling value. I’m more interested in working on features that enable capabilities that don’t already exist within SPARQL, compared to those features which are largely about making things easier. I’m also interested in working on those extensions that help substantial communities of SPARQL users (and, as above, their users). But in some cases this criterion may be trumped by…
  • Ease of specification. Writing down a formal specification for a new feature takes time and effort, and we’ve only a limited amount of both with which to work. I’m inclined to give preference to those features which are easy to get right in a formal specification (perhaps because a draft specification or formal documentation already exists) compared to those that have many tricky details yet to be worked out.
  • Ease/likelihood of implementation. I think this is often overlooked. There are a wide range of SPARQL implementations out there, and—particularly given the emerging cloud of linked data that can easily be fronted by multiple SPARQL implementations—there are a large number of SPARQL users that regularly write queries against different implementations. The SPARQL Working Group can add features until we’re blue in the face, but if many implementations are unable or choose not to support the new features, then interoperability remains nothing but a pipe dream for users.

One potential compromise, of sorts, is to define a standard extensibility mechanism for SPARQL. SPARQL already has one extensibility point in the form of allowing implementations to support arbitrary filter functions. There are a variety of forms that more sophisticated extensibility points might take. At the most general, Eric Prud’hommeaux mentioned to me the possibility of an EXTENSION keyword that would take an identifying URI, arbitrary arguments, and perhaps even arbitrary syntax within curly braces. Less extreme than that might be a formal service description that allows implementations to explore and converge on non-standard functionality while providing a standard way for users and applications to discover what features a given SPARQL endpoint supports. The first SPARQL Working Group (the DAWG) seems to have been very successful in designing a language that provided ample scope for implementers to try out new extensions. I think if our new Working Group can keep that freedom while also providing some structure to encourage convergence on the syntax and semantics of SPARQL extensions, we’ll be in great shape for the future evolution of SPARQL.

There’s one final topic that I’ve alluded to but also wanted to explicitly mention: energy. We’ve got a lot of Working Group members with a variety of perspectives and a large number of potential work items around which we need to reach consensus. And then we need to reach consensus on the syntax and semantics of our work items, as well as the specification text used to describe them. We need editors and reviewers and test cases and test harnesses and W3C liaisons and community outreach and comment responders. All of this takes energy. The DAWG nearly ground to a premature halt as the standardization process dragged on for year after year. We can’t allow for that to happen this time around, so we need to keep the energy up. An enthusiastic Working Group, frequent contributions from the broader community, occasional face-to-face meetings, and noticeable progress indications can all help to keep our energy from flagging. And, of course, sticking to our 18-month schedule is as important as anything.

What do you think? I’m eager to hear from anyone with suggestions for how the Working Group can best meet its objectives. Do you disagree with some of my underlying assumptions? How about my criterion for considering features? Do you see any extensibility/evolutionary mechanisms that you think would ease the future growth of SPARQL? Please let me know.

March 2, 2009

Named graphs in Open Anzo

Bob DuCharme, who has recently been exploring a variety of triple stores, has an insightful post up asking questions about the idea of named graphs in RDF stores. Since the Open Anzo repository is based around named graphs (as are all Cambridge Semantics’ products based on Open Anzo such as Anzo for Excel), I thought I’d take a stab at giving our answers to Bob’s questions:

1. If graph membership is implemented by using the fourth part of a quad to name the graph that the triple belongs to, then a triple can only belong directly to one graph, right?

This is correct. In Open Anzo, triples are really quads, in that every subject-predicate-object triple has a fourth component, a URI that designates the named graph of the triple. The named graph with URI u comprises all of the triples (quads) that have u as their fourth component.

Of course, this means that the same triple (subject-predicate-object) can exist in multiple named graphs. In such a case, each such triple is distinct from the others; it can be removed from one named graph independently of its presence in other named graphs.

2. I say "belong directly" because I'm thinking that a graph can belong to another graph. If so, how would this be indicated? Is there some specific predicate to indicate that graph x belongs to graph y?

Open Anzo has no concept of nesting graphs or graph hierarchies. The URI of a named graph can be used as the subject or object of a triple just like any other URI, with a meaning specific to whatever predicate is being used. So two graphs can be related by means of ordinary triples, but there is no special support for any such constructs.

3. If we're going to use named graphs to track provenance, then it would make sense to assign each batch of data added to my triplestore to its own graph. Let's say that after a while I have thousands of graphs, and I want to write a SPARQL query whose scope is 432 of those graphs. Do I need 432 "FROM NAMED" clauses in my query? (Let's assume that I plan to query those same 432 multiple times.)

There are a couple of points here.

  1. First, for Open Anzo at least, it's up to the application developer how to group triples into named graphs. I don't think we've ever ourselves used the scheme you suggest (everything updated at once is a named graph), but you could if you wanted. Instead, named graphs tend to collect triples that represent a reasonably core object in the application's domain of discourse.
  2. Open Anzo does use named graphs for provenance. Named graphs are the basic unit for:
    • Versioning. When one or more triples in a named graph are updated, the entire graph is versioned. Open Anzo tracks the modification time and the user that instigated the change, and also provides an API for getting at previous revisions of a graph. (Graphs can also be explicitly created that do not keep track of revisions. Those still track the last updated on and last updated by bits of provenance.)
    • Access control. Control of who can read, write, remove, or change permissions on RDF data in Open Anzo is attached strictly at the named-graph level. This tends to work nicely with the general modeling approach that lets a named graph represent a conceptual entity.
    • Replication. Client applications can maintain local replicas of data from an Open Anzo server. Replication occurs at the level of a named graph.
  3. Second, it's worth noting that Open Anzo adds a bit of infrastructure for handling this sort of provenance. Each named graph in an Open Anzo repository has an associated metadata graph. The system manages the triples in the metadata graph, which can include access control data, provenance data, version histories, associated ontological elements, and more. This lets all of the provenance information be treated as RDF without conflating it with user/application-created triples.
  4. Third, regarding the challenge of handling queries that need to span hundreds or thousands of named graphs: As Bob observed, this is a common situation when you are basing a store around named graphs. The Open Anzo approach to this problem is to introduce the idea of a named dataset. A named dataset is a URI-identified collection of graphs. (Technically, it's two collections of graphs, representing both the default and named graph elements of a SPARQL query.) Glitter, the Open Anzo SPARQL engine, extends SPARQL with a FROM DATASET <u> clause that scopes the query to the graphs contained in the referenced named dataset, u. Currently, named datasets explicitly enumerate their constituent graphs. There's no reason, however, that the same approach could not be used along with other methods of identifying the dataset's graph contents, such as URI patterns or a query.

All in all, we find the named graph model to be extremely empowering when building applications based on RDF. It gives a certain degree of scaffolding that allows all sorts of engineering and user experience flexibility. At a high level, we approach named graphs in a similar fashion to how we approach ontologies. We find both constructs useful for dealing with large amounts of RDF in practical enterprise environments, for engineering various ways of partitioning and understanding the data throughout the software stack. In the end, the named graph model goes to the heart of a few of RDF's core value propositions: agility and expressivity of the data model and adaptability of software built upon it.

January 20, 2009

SPARQL By Example Webcast, Part II: Thursday, January 22

We had overwhelming interest and, consequently, questions during our first SPARQL By Example Webcast (recorded archive available) that we did back in December. We ended up going through some basic SPARQL queries against FOAF data, DBPedia data, and leading up to introducing OPTIONAL queries against Jamendo data at  This Thursday, Semantic Universe and I will be presenting a second part of this tutorial. We’ll look at other elements of SPARQL queries, including UNIONs, datasets, CONSTRUCT queries, ASK queries, DESCRIBE queries, negation, and several common extensions to SPARQL such as aggregates and free-text search. At least, covering all of that is the goal!

If you’re interested, you need to register in advance and then attend the Webcast at 1pm EST / 10am PST this Thursday, January 22. Hope to “see” many of you there.

December 12, 2008

SPARQL By Example Webcast: Tuesday, December 16

I’ll be presenting an introduction to SPARQL Webcast this coming Tuesday at 1pm EST. The Webcast is the 4th in a free series hosted by the fine folks at Semantic Universe. In this one hour session, I’ll be using real queries that work against real data on the Web to teach the nuts and bolts of SPARQL, the Semantic Web query language. I’ve given this particular tutorial a few times before, and it’s a fun one to give—and I’ve gotten positive feedback in the past. Semantic Universe archives the sessions online, as well, so if you can’t make it on Tuesday, please check back later.

What: SPARQL By Example Webcast
When: Tuesday, December 16, 2008, at 1pm EST
Who: Anyone interested in an introduction to SPARQL driven completely by real examples
How: Register and then attend via Webcast

October 22, 2008

Videos: Anzo for Excel in action

Ever since we first showed Anzo for Excel at SemTech in May, we've had a blast talking with tons of people and discussing how a semantics-based approach to Excel might address many longstanding spreadsheet challenges facing organizations big and small alike. The technology continues to improve dramatically from week to week, and we're looking forward to a first general release of a productized Anzo for Excel at the end of this year.

We've recently put together a short (5 min.) video showing many of the capabilities of Anzo for Excel in the context of an ad-hoc project planning scenario:

Anzo for Excel: Spreadsheets for enterprise data management

As an added bonus, Jordi also put together a short video showing how Anzo for Excel along with our Anzo on the Web can allow you to expose, share, and publish spreadsheet data on the Web with just a few clicks, while simultaneously taking advantage of powerful faceted browsing capabilities (a la Exhibit) and custom visualizations:

Anzo for Excel: EPA Fuel Demonstration

We're actively engaging with partners and customers in exploring and building out use cases, demos, and projects leveraging Anzo for Excel. Drop me a line if you're interested in seeing/learning more.

October 1, 2008

Semantic Web Industry Report

David Provost has announced the completion and release of his report, On the Cusp: A Global Review of the Semantic Web Industry. Cambridge Semantics was pleased to be a part of the report, which we think is a valuable contribution to the growing industry. David features profiles of 17 Semantic Web product-oriented vendors and deploying companies. David's conclusions include:

The Semantic Web industry is alive, well, and it’s increasingly competitive as a commercial technology. At this point, there are too many success stories and too much money being invested to dismiss the technology as non-viable.  The Semantic Web is presently building a track record, which means the big wins and unanticipated uses are yet to come. In the meantime, adoption is occurring, and the early news is very good indeed.


July 25, 2008

SPARQL @ 6 months

I've been keeping an eye on the things people are saying about SPARQL and how SPARQL is being used ever since we published the W3C SPARQL Recommendations back in January. Of course, SPARQL has been around for much longer than half a year (as noted, for example, by Deepak Singh who "hadn’t even realized that [SPARQL] was in draft status"). Here's a summary of some of the more interesting or noteworthy ones. Please share any of your favorites in the comments!

Upon announcing SPARQL

Positive Impressions

Negative Impressions

  • Unhappy With SPARQL - unsatisfied with the verbose syntax, the lack of arbitrary selectable expressions, and the ASK query form
  • Have you heard of SPARQL? - worries that SPARQL queries are so targeted as to miss accidental discovery of interesting information and that this might be a new technology burden for small Web site developers
  • The problems of SPARQL - enumerates three problems touching on the RDF data model's complexity, SPARQL's relative anonymity (compared to SQL), and a perceived lack of expressivity

Explanatory/tutorial Writings

  • Understanding SPARQL - A tutorial by Andrew Matthews on IBM developerWorks that teaches SPARQL "through the example of a team tracking and journaling system for a virtual company."
  • Introduzione al Web Semantico - OK, this Italian tutorial Simone Onofri only has a few slides on SPARQL, but slide 48 (introducing SPARQL) is so beautiful that I wanted to include this anyway.
  • Why SPARQL? - From yours truly.



Of course, the past six months have also seen plenty of new Linked Data deployments (often with accompanying SPARQL endpoints) as well as a bevy of new implementations, and enhancements and upgrades to existing implementations. And SPARQL continues to be used as the data-access bedrock of Semantic Web applications. All in all, the future looks bright--some might even say it SPARQLs.

June 5, 2008

The Open Anzo Command Line Interface

We're continuing to work feverishly at Cambridge Semantics, and one of the main focal points of our efforts is the upcoming (later this year) release of Open Anzo 3.0. In February I wrote a bit about the core client APIs that we've stabilized for this release. Today, I wanted to share a huge development-productivity aid that uses the client implementation: a feature-rich command-line client.

Joe Betz, who added and announced the new command line interface a few weeks ago, also wrote an excellent guide to getting setup and using the client. I heartily recommend the guide, but to whet your appetite, here's an example interaction with the CLI client. (This interaction occurs after the install and configuring of default settings for the client, as given in the guide. It also assumes a running Anzo server (as per the "Quick Start" section in the guide).)

May 15, 2008

Lee @ SemTech next week

I'll be heading to SemTech this weekend and am looking forward to meeting a lot of new people and seeing a lot of familiar, friendly faces. I'm particularly excited about the presentation that I'll be giving on Wednesday morning. In conjunction with Brand NIemann of the U.S. EPA, I'll be demonstrating some of the work that Cambridge Semantics has been doing to work with spreadsheets as a first-class source of semantic data. Our team has done a fantastic job building a user experience that's tightly integrated into Excel, and in doing so has provided a very easy way to free information from the confines of the spreadsheet.

I'm going to show a few different scenarios that involve linking data between different spreadsheets, reusing spreadsheet data on the Web, keeping live data updated in real-time, and more. Much of the presentation and demonstration is in the context of the U.S. Census Bureau's Statistical Abstract, and I'll also be showing how the same software can be applied to conference data from SemTech itself.

If you're planning to be at SemTech next week, please drop me a note so that I can come and say hi there. And if you are there, please come and see my presentation:

Title: Getting to Web Semantics for Spreadsheets in the U.S. Government
Day: Wednesday, May 21, 2008
Time: 08:30 AM - 09:30 AM

March 26, 2008

Now available online - Scientific American: "The Semantic Web in Action"

I blogged previously about my experience co-authoring an article on the Semantic Web for Scientific American. Since then, Scientific American has granted me permission to publish the text of the article on my Web site. So please feel free to enjoy the article and share it with others: "The Semantic Web In Action"

A few notes:

  • The default view of the article breaks it into multiple pages to make it more easily digestible and bookmarkable. There is a link at the top and bottom to a single-page version suitable for printing and reading offline. Or if you just happen to prefer reading it like that.
  • The article text is followed by the text of the article's sidebars. There are links back and forth between the main text and the relevant sidebars. Most of the sidebars in the article included artwork which I do not have permission to reproduce online at this time.
  • At the end of the article I've gathered links to the various companies, projects, and technologies referenced in the article. (The terms of the reproduction rights from Scientific American prohibit adding links within the main content of the article.)

Please let me know what you think. Also, if you have any trouble reading or printing the article, let me know as well. (I whipped together some JavaScript to do the pagination while maintaining the browser's back button and internal anchors and things like that, so there may be some bugs. I'll write more about the JavaScript some other time.)

March 18, 2008

Gathering SPARQL Extensions

I realized that I hadn't blogged a pointer to the compilation of SPARQL extensions that I've created on the ESW wiki. Quoting myself:

Over the DAWG's lifetime (and since publication of the SPARQL Recommendations in January), there have been many important features that have been discussed but did not get included in the SPARQL specifications. I -- and many others -- hope that many of these topics will be addressed by a future working group, though there are no concrete plans for such a group at this time.

In the interest of cataloging these extensions and encouraging SPARQL developers to seek interoperable implementations of SPARQL extensions, I've created:

That page links to individual pages for (currently) 13 categories of SPARQL extensions. Each of those pages, in turn, discusses the relevant type of SPARQL extension and attempts to provide links to research, discussion, and implementations of the extension.

I also plan to use this list to help encourage user- and implementor-driven discussion of these extensions over the coming months. Again, the goal is to allow SPARQL users to make known what features are most important to them and also to allow implementations to seek common syntaxes and semantics for SPARQL extensions. (All of this, in the end, should help a future working group charter a new version of SPARQL and produce a specification that allows for interoperable SPARQL v2 implementations.)

It's a wiki. Please add references that are not there, new topics, or discussions of existing topics. (I've tried to reuse existing ESW Wiki pages for some topics that already had discussion.)

Where I say "this list" above, I mean Please subscribe if you're interested in discussing any or all of these potential SPARQL extensions.

March 12, 2008

Semantic Web tutorial

Last week, Eric Prud'hommeaux and I presented a tutorial on Semantic Web technologies at the Conference on Semantics in Healthcare & Life Sciences (C-SHALS). It was a four-hour session covering an intro to RDF, SPARQL, GRDDL, RDFa, RDFS, and OWL, mostly in the context of health care (patients' clinical examination records) and life sciences (pyramidal neurons in Alzheimer's Disease, as per the W3C HCLS interest group's knowledgebase use case). We reprised the GRDDL and RDFa sections yesterday in a whirlwind 15-20 minute talk at yesterday's Cambridge Semantic Web gathering.

Enjoy the slides. I'd welcome any suggestions so that the slides can be enhanced and reused (by myself and others) in the future.

March 8, 2008

Modeling Statistics in RDF - A Survey and Discussion

At the Semantic Technologies Conference in San Jose in May, Brand Niemann of the U.S. EPA and I are presenting Getting to Web Semantics for Spreadsheets in the U.S. Government. In particular, Brand and I are working to exploit the semantics implicit in the nearly 1,500 spreadsheets that are in the U.S. Census Bureau's annual Statistical Abstract of the United States. The rest of this post discusses various strategies for modeling this sort of statistical data in RDF; for more information on the background of this work, please see my presentation from the February 5, 2008, SICoP Special Conference.)

The data for the Statistical Abstract is effectively time-based statistics. There are a variety of ways that this information can be modeled as semantic data. The approaches differ in simplicity/complexity, semantic expressivity, and verbosity. At least as interestingly, they vary in precisely what they are modeling: statistical data or a particular domain of discourse. The goal of this effort is to examine the potential approaches to modeling this information in terms of ease of reuse, ease of query, ability to integrate with information from all 1,500 spreadsheets (and other sources), and the ability to enhance the model incrementally with richer semantics. There are surely other approaches to modeling this information as well: I'd love to here any ideas or suggestions for other approaches to consider.



D2R Server for Eurostat

The D2R server guys host an RDF copy of the Eurostat collection of European economic, demographic, political, and geographic data. From the start, they make the simplifying assumption that:

Most statistical data are time series, therefore only the latest availabe value is provided here.

In other words, they do not try to capture historic statistics at all. The disclaimer also notes that what is modeled in RDF is a small subset of the available data tables.

Executing a SELECT DISTINCT ?p { ?s ?p ?o } to learn more about this dataset tells us:


I make a few observations from this:

  • Most of these are predicates that correspond to a statistical category. I'm curious what the types of the subjects are. The query here is (the filter is added to limit the question to resources that use the Eurostat predicates):
     SELECT DISTINCT ?t WHERE {  ?s rdf:type ?t .  ?s ?p ?o .
      FILTER(regex(str(?p), 'eurostat') )
    The result is two types: regions and countries. Simple enough.
  • I'm also curious as to the types of the objects. Let's see if there are any resources (URIs) as objects. We do the ?s ?p ?o query from before but add in FILTER(isURI(?o)). The result shows that, aside from rdf:type and owl:sameAs (which we expected), only the predicate db:eurostat/parentcountry points to other resources. Doing a query on this predicate, we see that it relates regions (e.g. db:regions/Lorraine) to countries (e.g. db:countries/France).
  • I'd expect that, especially in the absence of time-based data, they don't have object structures with blank nodes. Changing the previous filter to use isBlank confirms that this is true.
  • So what are the types of the other data? Strings? Numbers? Let's find out. Poking around with various values for XXX in the filter FILTER(isLiteral(?o) && datatype(?o) = XXX) we see that some data uses xsd:strings while other data uses xsd:double. Poking around at the remaining predicates, we discover that they use xsd:long for non-decimal numbers.
  • What are they using owl:sameAs for? Executing SELECT ?s ?o { ?s owl:sameAs ?o } shows what I suspected: they're equating URIs that they've minted under a Eurostat namespace ( to DBPedia URIs (to broaden the linked data Web). Let's see if they use owl:sameAs for anything else. We add FILTER(!regex(str(?o), 'dbpedia')) and the query now returns no results.

The 2000 U.S. Census

Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. Joshua's about provides an overview of the data. In particular, I note that he is working with tables that are multiple levels deep (e.g. population by sex and then by age).

The most useful part of the writeup, though, is the writeup specifically about modeling the census data in RDF. In general, Joshua models nested levels of statistical tables (representing multiple facets of the data) as a chain of predicates (with the interim nodes as blank nodes). If a particular criterion is further subdivided, then the aggregate total at that level is linked with rdf:value. Otherwise, the value is given as the object itself. Note that the subjects are not real-world entities ("the U.S.") but instead are data tables ("the U.S. census tables"). The entities themselves are related to the data tables via a details predicate. The below excerpt combines both types of information (the entity itself followed by the data tables above the entity):

 @prefix rdf: <> .
 @prefix dc: <> .
 @prefix dcterms: <> .
 @prefix : <,2005:rdf/census/details/100pct> .
 @prefix politico: <> .
 @prefix census: <> .

   a politico:country ;
   dc:title "United States" ;
   census:households 115904641 ;
   census:waterarea "664706489036 m^2" ;
   census:population 281421906 ;
   census:details <> ;
   dcterms:hasPart <>, <>, ...

 <>  :totalPopulation 281421906 ;     # P001001
   :totalPopulation [
      dc:title "URBAN AND RURAL (P002001)";
      rdf:value 281421906 ;   # P002001
      :urban [
         rdf:value 222360539 ;  # P002002
         :insideUrbanizedAreas 192323824 ;   # P002003
         :insideUrbanClusters 30036715 ;     # P002004
      :rural 59061367 ;   # P002005
   :totalPopulation [
     dc:title "RACE (P003001)";
     rdf:value 281421906 ;   # P003001
   :populationOfOneRace [
       rdf:value 274595678 ;    # P003002
       :whiteAlone 211460626 ;     # P003003
       :blackOrAfricanAmericanAlone 34658190 ;     # P003004
       :americanIndianAndAlaskaNativeAlone 2475956 ;   # P003005

This is an inconsistent modeling (which Joshua admits himself in the description). Note for instance how :totalPopulation > :urban has a rdf:value link to the aggregate US urban population. When you go one level deeper though, :totalPopulation > :urban > :insideUrbanizedAreas has an object which is itself the value of that statistic.

As I see it, this inconsistency could be avoided in two ways:

  1. Always insist that a statistic hangs off of a resource (URI or blank node) via the rdf:value predicate.
  2. Allow a criterion/classificaiton predicate to point both to a literal (aggregate) value, and also to further subdivisions. This would allow the above example to have a triple which was :totalPopulation > :urban > 222360539 in addition to the further nested :totalPopulation > :urban > :insideUrbanizedAreas > 192323824.

The second approach seems simpler to me (less triples). It can be queried with an isLiteral filter restriction. The first approach might be a slightly simpler query, as it would always just query for rdf:value. (The queries would be about the same size, but the rdf:value approach is a bit clearer to read than the isLiteral filter approach.)

As an aside, this statement from Joshua is a telling factor on the value of what we are doing with the U.S. Statistical Abstract data:

(If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.)

Riese: RDFizing and Interlinking the EuroStat Data Set Effort

Riese is another effort to convert the EuroStat data to RDF. It seeks to expand on the coverage of the D2R effort. Project discussion is available on an ESW wiki page, but the main details of the effort are on the project's about page. Currently, riese only provides five million out of the three billion triples that it seeks to provide.

The under the hood section of the about page links to the riese schema. (Note: this is a simple RDF schema; no OWL in sight.) The schema models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).

Every statistical data item is a riese:item. riese:items are qualified with riese:dimensions, one of which is, in particular, dimension:Time.

The "ask" page gives two sample queries over the EuroStat RDF data, but those only deal in the datasets. RDF can be retrieved for the various Riese tables and data items by appending /content.rdf to the items' URIs and doing an HTTP GET. Here's an example of some of the RDF for a particular data item (this is not strictly legal Turtle, but you'll get the point):

@prefix : <> .
@prefix riese: <> .
@prefix dim: <> .
@prefix dim-schema: <> .

:bp010 a riese:dataset ;
  # all dc:title's repeated as rdfs:label
  dc:title "Current account - monthly: Total" ;
  riese:data_start "2002m10" ; # proprietary format?
  riese:data_end   "2007m09" ;
  riese:structure  "geo\time" ; # not sure of this format
  riese:datasetOf :bp010/2007m03_ea .

:bp010/2007m03_ea a riese:Item ;
  dc:title "Table: bp010, dimensions: ea, time: 2007m03" ;
  rdf:value "7093" ; # not typed
  riese:dimension dim:geo/ea ;
  riese:dimension dim:time/2007m03 ;
  riese:dataset :bp010 .

dim:geo/ea a dim-schema:Geo .
  dc:title "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15)" .

dim:time/2007m03 a dim-schema:Time .
  dc:title "" . # oops

dim-schema:Geo rdfs:subClassOf riese:Dimension ; dc:title "Geo" .
dim-schema:Time rdfs:subClassOf riese:Dimension ; dc:title "Time" .

(A lot of this is available in dic.nt (39 MB).)


In summary, these three examples show three distinct approaches for modeling statistics:

  1. Simple, point-in-time statistics. Predicates that fully describe each statistic relate a (geographic, in this case) entity to the statistic's value. There's no way to represent time in this (or other dimensions) into this model other than to create a new predicate for every combination of dimensions (e.g. country:bolivia stat:1990population18-30male 123456). Queries are flat and rely on knowledge of or metadata (e.g. rdfs:label) about the predicates. No way to generate tables of related values easily. Observation: this approach effectively builds a model of the real-world, ignoring statistical artifacts such as time, tables, and subtables.
  2. Complex, point-in-time statistics. An initial predicate relates a (geographic, in this case) entity to both an aggregate value for the statistic, as well as to (via blank nodes) other predicates that represent dimensions. Aggregate values are available off of any point in the predicate chain. Applications need to be aware of the hierarchical predicate structure of the statistics for queries, but can reuse (and therefore link) some predicates amongst different statistcs. Nested tables can easily be constructed from this model. Observation: this approach effectively builds a model of the statistical domain in question (demographics, geography, economics, etc. as broken into statistical tables).
  3. Complex statistics over time. Each statistic (each number) is represented as an item with a value. Dimensions (including time) are also described as resources with values, titles, etc. In this approach, the entire model is described by a small number of predicates. Applications can flexibly query for different combinations of time and other dimensions, though they still must know the identifying information for the dimensions in which they are interested. Applications can fairily easily construct nested tables from this model. Observation: this approach effectively uses a model of statistics (in general) which in turn is used to express statistics about the domains in question.

Statistical Abstract data

Simple with time

One of the simplest data tables in the Statistical Abstract gives statistics for airline on-time arrivals and departures. A sample of how this table is laid out is:

Airport On-time Arrivals On-time Departures
2006 Q1 2006 Q2 2006 Q1 2006 Q2
Total major airports 77.0 76.7 79.0 78.5
Atlanta, Hartsfield 73.9 75.5 76.0 74.3
Boston, Logan International 75.6 66.8 80.5 74.8

Overall, this is fairly simple. Every airport, for each time period has an on-time arrival percentage and an on-time departure percentage. If we simplified it even further by removing the use of multiple times, then it's just a simple grid spreadsheet (relating airports to arrival % and departure %). This does have the interesting (?) twist that the aggregate data (total major airports) is not simply a sum of the constituent data items (since we're dealing in percentages).

Simple point-in-time approach

If we ignore time (and choose 2006 Q1 as our point in time), then this data models as:

 ex:ATL ex:ontime-arrivals 73.9 ; ex:ontime-departures 76.0 .
 ex:BOS ex:ontime-arrivals 75.6 ; ex:ontime-departures 80.5
 ex:us-major-airports ex:ontime-arrivals 77.0 ; ex:ontime-departures 79.0

This is simple, but ignores time. It also doesn't give any hint that ex:us-major-airports is a total/aggregate of the other data. We could encode time in the predicates themselvs (ex:ontime-arrivals-2006-q1), but I think everyone would agree that that's a bad idea. We could also let each time range be a blank node off the subjects, but that assumes all subjects have data conforming to the same time increments. Any such approach starts to get close to the complex point-in-time approach, so let's look at that.

Complex point-in-time approach

If we ignore time and view the "total major airports" as unrelated to the individual airports, then we have no "nested tables" and this approach degenerates to the simple point-in-time approach, effectively:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] .
 ex:BOS a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5
   ] .
   dcterms:hasPart ex:ATL, ex:BOS ;
   stat:details [
     ex:on-time-arrivals 77.0 ;
     ex:on-time-departures 79.0 ;
   ] .    

We could treat time as a special-case that conditionalizes the statistics (stat:details) for any particular subject, such as:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     stat:start "2006-01-01"^^xsd:date ;
     stat:end   "2006-02-28"^^xsd:date ;
     stat:details [
       ex:on-time-arrivals 73.9 ;
       ex:on-time-departures 76.0
     ] .
   ] .

If we ignore time but view the "total major airports" statistics as an aggregate of the individual airports (which are subtables, then), we get this RDF structure:

   ex:on-time-arrivals 77.0 ;
   ex:on-time-departures 79.0 ;
   ex:ATL [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] ;
   ex:BOS [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5

This is interesting because it treats the individual airports as subtables of the dataset. I don't think it's really a great way to model the data, however.

Complex Statistics Over Time

 ex:ontime-flights a stat:Dataset ;
   dc:title "On-time Flight Arrivals and Departures at Major U.S. Airports: 2006" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-12-31"^^xsd:date ;
   stat:structure "... something that explains how to display the stats ? ..." ;
   stat:datasetOf ex:atl-arr-2006q1, ex:atl-dep-2006q1, ... ;
 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:arrivals ;
   stat:dimension ex:ATL .
 ex:atl-dep-2006q1 a stat:Item ;
   rdf:value 76.0 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:departures ;
   stat:dimension ex:ATL .
 ... more data items ...
 ex:Q12006 a stat:TimePeriod ;
   dc:title "2006 Q1" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
 ex:arrivals a stat:ScheduledFlightTime ;
   dc:title "Arrival time" .
 ex:departures a stat:ScheduledFlightTime ;
   dc:title "Departure time" .
 ex:ATL a stat:Airport ;
   dc:title "Atlanta, Hartsfield" .
 ... more dimension values ...
 stat:TimePeriod rdfs:subClassOf stat:Dimension ; dc:title "time period" .
 stat:ScheduledFlightTime rdfs:subClassOf stat:Dimension ; dc:title "arrival or departure" .
 stat:Airport rdfs:subClassOf stat:Dimension ; dc:title "airport" .

First, this seems to be the most verbose. It also seems to give the greatest flexibility in terms of modeling time and querying the resulting data. One related alternative to this approach would replace dimension objects with dimension predicates, as in:

 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
   stat:airport ex:ATL ;
   stat:scheduled-flight-time ex:arrivals .
 stat:airport rdfs:subPropertyOf stat:dimension ; dc:title "airport " .

This may be a bit less verbose, but loses the ability to have multivalued dimensions such as stat:TimePeriod in the first example.


The riese approach seems the best combination of flexibility and usability. It should allow us to recreate the data-table structures with a reasonable degree of fidelity in another environment (e.g. on the Web), as well as to construct a basic semantic repository by attaching definitions to the various statistical entities, facets, and properties. All that said, the proofs in the pudding, and until I'm quite open to other suggestions.

February 27, 2008

Anzo.*: Building Semantic Applications in Heterogeneous Environments

At Cambridge Semantics we're busy working on what will become version 3 of Open Anzo. As I've written about before, our interest in Semantic Web technologies lies in the powerful applications that can be built by taking advantage of RDF's data model. To this end, we've continually sought RDF programming models that contain features necessary to building these applications:

  • Named graphs (quads) support, for modularizing applications' data
  • Replication, for offline applications and snappy user experience
  • Notification, for real-time collaborative updates
  • Role-based access control, to facilitate a multi-user environment
  • Versioning, to maintain an auditable history of data changes

To promote a consistent development experience between the various environments that we support--Java development, Web development, Windows development--we've worked to define a core set of abstract, client-side APIs (documentation is currently sound but not complete) for building semantic applications that can take advantage of these enterprise features. Currently, we have three concrete instantiations of this API:, Anzo.js, and Anzo.NET. Version 3 of Anzo includes many other architectural improvements intended to help us realize Anzo's status as an open-source semantic middleware platform, and we're not done yet. We do our best to keep the latest version of the code in subversion stable, however, so feel free to check it out. The mailing list is a great place to ask questions. As we get closer to a formal release of Anzo 3, we'll have more code samples, tutorials, and demos to share, so stay tuned...

January 25, 2008


I'm quite pleased to have played a part in helping SPARQL become a W3C Recommendation. As we were putting together the press release that accompanied the publication of the SPARQL recommendations, Ian Jacobs, Ivan Herman, Tim Berners-Lee, and myself put together some comments (in bullet point form) explaining some of the benefits of SPARQL. They do a good job of capturing a lot of what I find appealing about SPARQL, and I wanted to share them with other people. I don't think these are the best examples of SPARQL's value or the most eloquently expressed, but I do think it captures a lot of the essence of SPARQL. (While some of the text is attributable to me, parts are attributable to Ian, Ivan, and Tim.)

  • SPARQL is to the Semantic Web (and, really, the Web in general) what SQL is to relational databases. (This is effectively Tim's quotation from the press release.)
  • If we view the Semantic Web as a global collection of databases, SPARQL can make the collection look like one big database. SPARQL enables us to reap the benefits of federation. Examples:
    • Federating information from multiple Web sites (mashups)
    • Federating information from multiple enterprise databases (e.g. manufacturing and customer orders and shipping systems)
    • Federating information between internal and external systems (e.g. for outsourcing, public Web databases (e.g. NCBI), supply-chain partners)
  • There are many distinct database technologies in use, and it's of course impossible to dictate a single database technology at the scale of the Web. RDF (the Semantic Web data model), though, serves as a standard lingua franca (least common denominator) in which data from disparate database systems can be represented. SPARQL, then, is the query language for that data. As such, SPARQL hides the details of a sever's particular data management and structure details. This reduces costs and increases robustness of software that issues queries.
  • SPARQL saves development time and cost by allowing client applications to work with only the data they're interested in. (This is as opposed to bringing it all down and spending time and money writing software to extract the relevant bits of information.)
    • Example: Find US cities' population, area, and mass transit (bus) fare, in order to determine if there is a relationship between population density and public transportation costs.
    • Without SPARQL, you might tackle this by writing a first query to pull information from cities' pages on Wikipedia, a second query to retrieve mass transit data from another source, and then code to extract the population and area and bus fare data for each city.
    • With SPARQL, this application can be accomplished by writing a single SPARQL query that federates the appropriate data source. The application developer need only write a single query and no additional code.
  • SPARQL builds on other standards including RDF, XML, HTTP, and WSDL. This allows reuse of existing software tooling and promotes good interoperability with other software systems. Examples:
    • SPARQL results are expressed in XML: XSLT can be used to generate friendly query result displays for the Web
    • It's easy to issue SPARQL queries, given the abundance of HTTP library support in Perl, Python, php, Ruby, etc.

Finally, I scribbled down some of my own thoughts on how SPARQL takes the appealing principles of a Service Oriented Architecture (SOA) one step further:

  • With SOA, the idea is to move away from tightly-coupled client-server applications in which all of the client code needs to be written specifically for the server code and vice versa. SOA says that if instead we just agree on service interfaces (contracts) then we can develop and maintain services and clients that adhere to these interfaces separately (and therefore more cheaply, scalably, and robustly).
  • SPARQL takes some of this one step further. For SOA to work, services (people publishing data) still have to define a service, a set of operations that they'll use to let others get at their information. And someone writing a client application against such a service needs to adhere to the operations in the service. If a service has 5 operations that return various bits of related data and a client application wants some data from a few services but doesn't want most of it, the developer still must invoke all 5 services and then write the logic to extract and join the data relevant for her application. This makes for marginally complex software development (and complex == costly, of course).
  • With SPARQL, a service-provider/data-publisher simply provides one service: SPARQL. Since it's a query language accessible over a standard protocol (HTTP), SPARQL can be considered a 'universal service'. Instead of the data publisher choosing a limited number of operations to support a priori and client applications being forced to conform to these operations, the client application can ask precisely the questions it wants to retrieve precisely the information it needs. Instead of 5 service invocations + extra logic to extract and join data, the client developer need only author a single SPARQL query. This makes for a simpler application (and, of course, less costly).

As an example, consider an online book merchant. Suppose I want to create a Web site that finds books by my favorite author that are selling for less than $15, including shipping. The merchant supplies three relevant services:

  1. Search. Includes search by author. Returns book identifiers.
  2. Book lookup. Takes a book identifier and returns the title, price, abstract, shipping weight, etc.
  3. Shipping lookup. Takes total order weight, shipping method, and zip code, and returns a shipping cost.

To create my Web site without SPARQL, I'd need to:

  1. Invoke the search service. (Query 1)
  2. Write code to extract the result identifiers and, for each one, invoke the book lookup service. (Code 1, Query 2 (issued multiple times))
  3. Write code to extract the price and, for each book, invokes the shipping lookup service with that book's weight (Code 2, Query 3 (issued multiple times))
  4. Write code to add each book's price and shipping cost and check if it's less than $15. (Code 3)

Now, suppose the book merchant exposed this same data via a SPARQL endpoint. The new approach is:

  1. Use the SPARQL protocol to ask a SPARQL query with all the relevant parameters (Query 1 (issued once))

For the record, the query might look something like:

SELECT ?book ?title
  FROM :inventory
    a :book ; :author ?author ; 
    :title ?title ; :price ?price ;
    :weight ?weight .
  ?author :name "My favorite Author" .
  FILTER(?price + :shipping(?weight) < 15) .

(This example also illustrates another feature of SPARQL: SPARQL is extensible via the use of new FILTER functions that can allow a query to invoke operations (in this case, a function (:shipping) that gives shipping cost for a particular order weight) defined by the SPARQL endpoint.)

December 20, 2007

Scientific American: "The Semantic Web in Action"

I'm pleased to write that the December 2007 issue of Scientific American contains an article titled "The Semantic Web in Action", coauthored by Ivan Herman, Tonya Hongsermeier, Eric Neumann, Susie Stephens, and myself.

We were invited to write the article as a follow-up to the original 2001 Scientific American Semantic Web article by Tim Berners-Lee, Jim Hendler, and Ora Lassila. We wanted to share some practical examples of problems currently being solved with Semantic Web technologies, particularly in health care and life sciences. The article presents two detailed case studies. The first is the work of a team at Cincinnati Children's Hospital Medical Center who use RDF in conjunction with PageRank-esque algorithms to prioritize potential drug targets for cardiovascular diseases. The second case focuses on the University of Texas Health Science Center's SAPPHIRE system. SAPPHIRE integrates information from various health care providers to allow public health officials to better assess potential emerging public health risks and disease epidemics. The article also talks about the potential for Semantic Web technologies and the work of companies such as Agfa and Partners to help health care providers deal with the rate of knowledge acquisition and change in their clinical decision support (CDS) systems.

Aside from these case studies, the article takes somewhat of a whirlwind tour across the current landscape of Semantic Web applications. Along the way, RDF, OWL, SPARQL, GRDDL, and FOAF all get mentions. Science Commons and DBpedia are briefly touched on, and the article acknowledges a variety of companies that are engaged in Semantic Web application research, prototyping, or deployment: British Telecom, Boeing, Chevron, MITRE, Ordnance Survey, Vodafone, Harper's Magazine, Joost, IBM, Hewlett-Packard, Nokia, Oracle, Adobe, Aduna, Altova, @semantics, Talis, OpenLink, TopQuadrant, Software AG, Eli Lilly, Pfizer, Garlik. And there were loads that couldn't be included in the end due to space restrictions, all of which is a testament to the continued growth in adoption of these technologies.

Unfortunately, the article is not currently available for free online. An electronic version is available (along with the rest of the December 2007 issue) from Scientific American's Web site for US$7.95, and the issue should also be available at newsstands in the US for a bit longer. I'm not sure when/if the article is available on newsstands across the rest of the world. I've been working with the copyright editors at Scientific American in an attempt to procure the rights to publish the article on my own Web site (and/or possibly on the W3C's site), but they haven't yet responded to my application.

In any case, it was a fantastic experience working with my colleagues to bring some information on the progress of the Semantic Web to the readers of Scientific American. I've gotten some great feedback family, friends, and colleagues who have read the article. Several people in the Semantic Web community have let me know that they've found the article to be useful material for helping introduce people to the ideas and applications behind Semantic Web technologies. So please check out the article if you're so inclined, and I'd love to hear what you think. I'll also be sure to update this space if I'm able to secure the rights to publish the full text of the article here.

26-Mar-2008 Update: I've since received permission to publish the article. Enjoy!

October 24, 2007

Announcing: Open Anzo 2.5 released

As promised, the Open Anzo project has released version 2.5 of the Anzo enterprise RDF store. Version 2.5 is a stable release with a collection of bug fixes and new features since the fork from Boca. The release notes enumerate the additions, improvements, and changes, but here are some of the more significant ones:

  • Add Oracle database support
  • Add GROUP BY clause and COUNT(*) to Glitter SPARQL engine (more on this in a separate post, but along the lines of what exists in ARQ, Virtuoso, and RAP)
  • Query performance improvements against both named graphs and metadata graphs
  • Extensive Javadocs for all public classes, interfaces, methods, and member variables

Things you can do:

  • Download and install Open Anzo: release 2.5, nightly snapshots, or the source from SVN
  • Learn from the Open Anzo wiki
  • View open tickets showing some of what's coming
  • Join the Open Anzo development community
  • Peruse the Anzo 2.5 Javadocs

October 14, 2007

Introducing: Cambridge Semantics and the Open Anzo project

It's been a while since I last posted here to muse about the differences between "the Semantic Web" and "Semantic Web technologies". Since then, I've been quite pleased to see the Linking Open Data project continue to soar, including an extremely successful BoF and panel at WWW 2007 in Banff. New data sources continue to be linked in to the Semantic Web, including data from Wikicompany, flickr, and GovTrack. The project maintains a list and a picture of the growing Web of linked open data.

Meanwhile, I have not been idle in my work to advance Semantic Web technologies inside enterprises. In July, I left IBM and co-founded Cambridge Semantics, Inc. Building upon the work that began with the open-source IBM Semantic Layered Research Platform, Cambridge Semantics is dedicated to building feature-rich semantic middleware that can power a vast breadth of semantic applications that realize the potential of the full stack of Semantic Web technologies.

One of the first things that we've done at Cambridge Semantics is setup the Open Anzo project. Anzo is an open-source fork of Boca, an enterprise RDF store. Anzo starts with the same rich feature set of Boca, including named graphs, replication, notification, access controls, and full revision histories. To this, Anzo (so far) adds a number of bug fixes and support for running on top of an Oracle RDBMS. There's a new release of Anzo coming quite soon, and we're quite excited about some of the current and future development going on for Anzo. To learn more, feel free to join the Open Anzo discussion group, check out the wiki, or download the source or a nightly build. We're also actively looking for like-minded folk to work with us to enhance and improve Anzo and to expand the scope of the project. Let me know if you might be interested in sponsoring, using, or contributing to Anzo.

I'll have a lot more to share about our team, our vision, and our software in the coming weeks and months. It's an exciting time, both for me personally, but more so for the promise of the Semantic Web and Semantic Web technologies. I'm glad to be blogging once more, and look forward to having more to say.

April 22, 2007

QotD: Word Choice

Danny picked up an interesting take on the foes of the Semantic Web from Morten Frederiksen. I was surfing that way today and noticed this gem in the latest comment from Keith Alexander:

Perhaps the word that causes the trouble isn’t Semantic, but The?

I believes in an ultimate goal similar to that of Tim Berners-Lee and also that of the Linking Open Data SWEO community project. But I also see tremendous value in the adoption of Semantic Web technologies within enterprise applications and in limited, narrowly-scoped corners of the Internet and intranets. To me, it's clear that these goals are not incompatible with each other. But I do find myself constantly juggling the appropriate use of the phrases the Semantic Web and Semantic Web technologies depending on my audience. There's a lot of signifiance and (dare I say?) semantics in that innocent-looking three-letter word...

February 7, 2007

Updates to sparql.js

I'm not sure if anyone is using Elias and my sparql.js JavaScript library for issuing SPARQL queries. (Probably not, given its Firefox-and-friends-only orientation and the standard cross-site XMLHttpRequest security restrictions.) Since I first blogged about the library last year, we've made a few changes to the library, Most notably, we've removed the dependency on the Yahoo! connection manager (or on any other third-party libraries, for that matter). Additionally, we've added a setRequestHeader method which passes the given headers and values along to the underlying HTTP request object. We use this functionality, for example, to provide user credentials (via HTTP Basic Auth) when SPARQLing against a Boca server.

The update should be transparent to any current uses of the library. Please let me know if you try it out and experience any problems.

January 19, 2007

Announcing: Boca 1.8 - new database support

While I've been writing dense treatises on Semantic Web development, Matt's been hard at work on the latest release of Boca. Matt's announcement of Boca 1.8 carries all the details as well as a look at what Boca 2.0 will bring. Amidst the usual slew of bug fixes, usability improvements, and performance fixes, the major addition to Boca is support for three new databases beyond DB2. Boca now also runs on MySQL, PostgreSQL, and HSQLDB. Cool stuff.

In other Semantic Layered Research Platform news, we're working towards pushing out stable releases(with documentation and installation packaging) of two more of our components: Queso (Atom-driven Web interface to Boca) and DDR (binary data repository with metadata-extractor infrastructure to store metadata within Boca). We're hoping to get these out by the middle of February, so stay tuned.

January 18, 2007

Using RDF on the Web: A Vision

(This is the second part of two posts about using RDF on the Web. The first post was a survey of approaches for creating RDF-data-driven Web applications.) All existing implementations referred to in this post are discussed in more detail and linked to in part one.

Here's what I would like to see, along with some thoughts on what is or is not implemented. It's by no means a complete solution and there are plenty of unanswered questions. I'd also never claim that it's the right solution for all or most applications. But I think it has a certain elegance and power that would make developing certain types of Web applications straightforward, quick, and enjoyable. Whenever I refer to "the application" or "the app", I'm talking about browser-based Web application implemented in JavaScript.

  • To begin with, I imagine servers around the Web storing domain-specific RDF data. This could be actual, materialized RDF data or virtual RDF views of underlying data in other formats. This first piece of the vision is, of course, widely implemented (e.g. Jena, Sesame, Boca, Oracle, Virtuoso, etc.)

  • The application fetches RDF from such a server. This may be done in a variety of ways:

    • An HTTP GET request for a particular RDF/XML or Turtle document
    • An HTTP GET request for a particular named graph within a quad store (a la Boca or Sesame)
    • A SPARQL CONSTRUCT query extracting and transforming the pieces of the domain-specific data that are most relevant to the application
    • A SPARQL DESCRIBE query requesting RDF about a particular resource (URI)

    In my mind, the CONSTRUCT approach is the most appealing method here: it allows the application to massage data which it may be receiving from multiple data sources into a single domain-specific RDF model that can be as close as possible to the application's own view of the world. In other words, reading the RDF via a query effectively allows the application to define its own API.

    Once again, the software for this step already exists via traditional Web servers and SPARQL protocol endpoints.

  • Second, the application must parse the RDF into a client-side model. Precisely how this is done depends on the form taken by the RDF received from the server:

    • The server returns RDF/XML. In this case, the client can use Jim Ley's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns Turtle. In this case, the client can use Masahide Kanzaki's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns RDF/JSON. In this case, the client can use Douglas Crockford's JSON parsing library (effectively a regular expression security check followed by a call to eval(...) While the software is implemented here, the RDF/JSON standard which I've cavalierly tossed about so far does not yet exist. Here, I'm imagining a specification which defines RDF/JSON based on the common JavaScript data structure used by the above two parsers. ( A bit of work probably still needs to be done if this were to become a full RDF/JSON specification, as I do not believe the current format used by the two parsers can distinguish blank node subjects from subjects with URIs.)

    In any case, we now have on the client a simple RDF graph of data specific to the domain of our application. Yet as I've said before, we'd like to make application development easier by moving away from triples at this point into data structures which more closely represent the concepts being manipulated by the application.

  • The next step, then, is to map the RDF model into a application-friendly JavaScript object model. If I understand ActiveRDF correctly (and in all fairness I've only had the chance to play with it a very limited amount), it will examine either the ontological statements or instance data within an RDF model and will generate a Ruby class hierarchy accordingly. The introduction to ActiveRDF explains the dirty-but-well-appreciated trick that is used: "Just use the part of the URI behind the last ”/” or ”#” and Active RDF will figure out what property you mean on its own." Of course, sometimes there will be ambiguities, clashes, or properties written to which did not already exist (with full URIs) in the instance data received; in these cases, manual intervention will be necessary. But I'd suggest that in many, many cases, applying this sort of best-effort heuristics to a domain-specific RDF model (especially one which the application has selected especially via a CONSTRUCT query) will result in extremely natural object hierarchies.

    None of this piece is implemented at all. I'd imagine that it would not be too difficult, following the model set forth by the ActiveRDF folks.

    Late-breaking news: Niklas Lindström, developer of the Python RDF ORM system Oort followed up on my last post and said (among other interesting things):

    I use an approach of "removing dimensions": namespaces, I18N (optionally), RDF-specific distinctions (collections vs. multiple properties) and other forms of graph traversing.

    Sounds like there would be some more simplification processes that could be adapted from Oort in addition to those adapted from ActiveRDF.

  • The main logic of the Web application (and the work of the application developer) goes here. The developer receives a domain model and can render it and attach logic to it in any way he or she sees fit. Often this will be via a traditional model-view-controller approach: this approach is facilitated by toolkits such as dojo or even via a system such as nike templates (nee microtemplates). Thus, the software to enable this meat-and-potatoes part of application development already exists.

    In the course of the user interacting with the application, certain data values change, new data values are added, and/or some data items are deleted. The application controller handles these mutations via the domain-specific object structures, without regards to any RDF model.

  • When it comes time to commit the changes (this could happen as changes occur or once the user saves/commits his or her work), standard JavaScript (i.e. a reusable library, rather than application-specific code) recognizes what has changed and maps (inverts) the objects back to the RDF model (as before, represented as arrays of triples). This inversion is probably performed by the same library that automatically generated the object structure from the RDF model in the first place. As with that piece of this puzzle, this library does not yet exist.

    Reversing the RDF ORM mapping is clearly challenging, especially when new data is added which has not been previously seen by the library. In some cases--perhaps even in most?--the application will need to provide hints to the library to help the inversion. I imagine that the system probably needs to keep an untouched deep copy of the original domain objects to allow it to find new, removed, and dirty data at this point. (An alternative would be requiring adds, deletes, and mutations to be performed via methods, but this constrains the natural use of the domain objects.)

  • Next, we determine the RDF difference between our original model and our updated model. The canonical work on RDF deltas is a design note by Tim Berners-Lee and Dan Connolly. Basically, though, an RDF diff amounts simply to a collection of triples to remove and a collection of triples to add to a graph. No (JavaScript) code yet exists to calculate RDF graph diffs, though the algorithms are widely implemented in other environments including cwm, rdf-utils, and SemVersion. We also work often with RDF diffs in Boca (when the Boca client replicates changes to a Boca server). I'd hope that this implementation experience would translate easily to a JavaScript implementation.

  • Finally, we serialize the RDF diffs and send them back to the data source. This requires two components that are not yet well-defined:

    • A serialization format for the RDF diffs. Tim and Dan's note uses the ability to quote graphs within N3 combined with a handful of predicates (diff:replacement, diff:deletion, and diff:insertion). I can also imagine a simple extension of (whatever ends up being) the RDF/JSON format to specify the triples to remove and add:
          'add' : [ RDF/JSON triple structures go here ],
          'remove' : [ RDF/JSON triple structures go here ]
    • An endpoint or protocol which accepts this RDF diff serialization. Once we've expressed the changes to our source data, of course, we need somewhere to send them. Preferably, there would be a standard protocol (à la the SPARQL Protocol) for sending these changes to a server. To my knowledge, endpoints that accept RDF diffs to update RDF data are not currently implemented. (Late-breaking addition: on my first post, Chris and Richard both pointed me to Mark Baker's work on RDF forms. While I'm not very familiar with any existing uses of this work, it looks like it might be an interesting way to describe the capabilities of an RDF update endpoint.)

    As an alternative for this step, the entire client-side RDF model could be serialized (to RDF/XML or to N-Triples or to RDF/JSON) and HTTP PUT back to an origin server. This strategy seems to make the most sense in a document-oriented system; to my knowledge this is also not currently implemented.

That's my vision, as raw and underdeveloped as it may be. There are a large number of extensions, challenges and related work that I have not yet mentioned, but which will need to be addressed when creating or working with this type of Web application. Some discussion of these is also in order.

Handling Multiple Sources of Data

To use the above Web-application-development environment to create Web 2.0-style mash-ups, most of the steps would need to be performed once per data source being integrated. This adds to the system a provenance requirement, whereby the libraries could offer the application a unified view of the domain-specific data while still maintaining links between individual data elements and their source graphs/servers/endpoints to facilitate update. When the RDF diffs are computed, they would need to be sent back to the proper origins. Also, the sample JavaScript structures that I've mentioned as a base for RDF/JSON and the RDF/JSON diff serialization would likely need to be augmented with a URI identifying the source graph of each triple. (That is, we'd end up working with a quad system, though we'd probably be able to ignore that in the object hierarchy that the application deals with.) In many cases, though, an application that reads from many data sources will write only to a single source; it does not seem particularly onerous for the application to specify a default "write-back" endpoint.

Inverting SPARQL CONSTRUCT Queries

An appealing part of the above system (to me, at least) is the use of CONSTRUCT queries to map origin data to a common RDF model before merging it on the client and then mapping it into a domain-specific JavaScript object structure. Such transformations, however, would make it quite difficult--if not impossible--to automatically send the proper updates back to the origin servers. We'd need a way of inverting the CONSTRUCT query which generated the triples the application has (indirectly) worked with, and while I have not given it much thought, I imagine that that is quite difficult, if not impossible.


The DAWG has postponed any work on updating graphs for the initial version of SPARQL, but Max Völkel and Richard Cyganiak have started a bit of discussion on what update in SPARQL might look like (though Richard has apparently soured on the idea a bit since then). At first blush, using SPARQL to update data seems like a natural counterpart to using SPARQL to retrieve the data. However, in the vision I describe above, the application would likely need to craft a corresponding SPARQL UPDATE query for each SPARQL CONSTRUCT query that is used to retrieve the data in the first place. This would be a larger burden on the application developer, so should probably be avoided.

Related Work

I wanted to acknowledge that in several ways this whole pattern is closely related to but (in some mindset, at least) the inverse of a paradigm that Danny Ayers has floated in the past. Danny has suggested using SPARQL CONSTRUCT queries to transition from domain-specific models to domain-independent models (for example, a reporting model). Data from various sources (and disparate domains) can be merged at the domain-independent level and then (perhaps via XSLT) used to generate Web pages summarizing and analyzing the data in question. In my thoughts above, we're also using the CONSTRUCT queries to generate an agreed-upon model, but in this case we're seeking an extremely domain-specific model to make it easier for the Web-application developer to deal with RDF data (and related data from multiple sources).

Danny also wrote some related material to www-archive. It's not the same vision, but parts of it sound familiar.

Other Caveats

Updating data has security implications, of course. I haven't even begun to think about them.

Blank nodes complicate almost everything; this may be sacrilege in some circles, but in most cases I'm willing to pretend that blank nodes don't exist for my data-integration needs. Incorporating blank nodes makes the RDF/JSON structures (slightly) more complicated; it raises the question of smushing together nodes when joining various models; and it significantly complicates the process of specifying which triples to remove when serializing the RDF diffs. I'd guess that it's all doable using functional and inverse-functional properties and/or with told bnodes, but it probably requires more help from the application developer.

I have some worries about concurrency issues for update. Again, I haven't thought about that much and I know that the Queso guys have already tackled some of those problems (as have many, many other people I'm sure), so I'm willing to assert that these issues could be overcome.

In many rich-client applications, data is retrieved incrementally in response to user-initiated actions. I don't think that this presents a problem for the above scheme, but we'd need to ensure that newly arriving data could be seamlessly incorporated not only into the RDF models but also into the object hierarchies that the application works with.

Bill de hÓra raised some questions about the feasibility of roundtripping RDF data with HTML forms a while back. There's some interesting conversation in the comments there which ties into what I've written here. That said, I don't think the problems he illustrates apply here--there's power above and beyond HTML forms in putting an extra JavaScript-based layer of code between the data entry interface (whether it be an HTML form or a more specialized Web UI) and the data update endpoint(s).

OK, that's more than enough for now. These are still ideas clearly in progress, and none of the ideas are particularly new. That said, the environment as I envision doesn't exist, and I suppose I'm claiming that if it did exist it would demonstrate some utility of Semantic Web technologies via ease of development of data- and integration-driven Web applications. As always, I'd enjoy feedback on these thoughts and also any pointers to work I might not know about.

January 16, 2007

Using RDF on the Web: A Survey

(This is part one of two posts exploring building read-write Web applications using RDF. Part two will follow, shortly. Update: Part two is now available, also.)

The Web permeates our world today. Far more than static Web sites, the Web has come to be dominated by Web applications--useful software that runs inside a Web browser and on a server. And the latest trend in Web applications, Web 2.0, encourages--among other things--highly interactive Web sites with rich user interfaces featuring content from various sources around the Web integrated within the browser.

Many of us who have drank deeply from the Semantic Web Kool-Aid are excited about the potential of RDF, SPARQL, and OWL to provide flexible data modeling, easier data integration, and networked data access and query. It's no coincidence that people often refer to the Semantic Web as a web of data. And so it seems to me that RDF and friends should be well-equipped to make the task of generating new and more powerful Web mash-ups simple, elegant, and enjoyable. Yet while there are a great number of projects using Semantic Web technologies to create Web applications, there doesn't seem to have emerged any end-to-end solution for creating browser-based read-write applications using RDF which focus on data integration and ease of development.

Following a discussion on this topic at work the other day, I decided to do a brief survey of what approaches do already exist for creating RDF-based Web applications. I want to give a brief overview of several options, assess how they fit together, and then outline a vision for some missing pieces that I feel might greatly empower Web developers working with Semantic Web technologies.

First, a bit on what I'm looking for. I want to be able to quickly develop data-driven Web applications that read from and write back to RDF data sources. I'd like to exploit standard protocol and interfaces as much as possible, and limit the amount of domain-specific code that needs to be written. I'd like the infrastructure to make it as easy as possible for the application developer to retrieve data, integrate the data, and work with it in a convenient and familiar format. That is, in the end, I'm probably looking for a system that allows the developer to work with a model of simple, domain-specific JavaScript object hierarchies.

In any case, here's the survey. I've tried to include most of the systems I know of which involve RDF data on the Web, even those which are not necessarily appropriate for creating generalized RDF-based Web apps. I'll follow-up with a vision of what could be in my next post.

Semantic Mediawiki

This is an example of a terrific project which is not what I'm looking for here. Semantic Mediawiki provides wiki markup that captures the knowledge contained within a wiki as RDF which can then be exported or queried. While an installation of Semantic Mediawiki will allow me to read and write RDF data via the Web, I am constrained within the wiki framework; further, the interface to reading and writing the RDF is markup-based rather than programmatic.

The Semantic Bank API

The SIMILE project provides an HTTP POST API for publishing and persisting RDF data found on local Web pages to a server-side bank (i.e. storage). They also provide a JavaScript library (BSD license) which wraps this API. While this API supports writing a particular type of RDF data to a store, it does not deal with reading arbitrary RDF from across the Web. The API also seems to require uploaded data to be serialized as RDF/XML before being sent to a Semantic Bank. This does not seem to be what I'm looking for to create RDF-based Web applications.

The Tabulator RDF parser and API

MIT student David Sheets created a JavaScript RDF/XML parser (W3C license). It is fully compliant with the RDF/XML specification, and as such is a great idea for any Web application which needs to gather and parse arbitrary RDF models expressed in RDF/XML. The Tabulator RDF parser populates an RDFStore object. By default, it populates an RDFIndexedFormula store, which inherits from the simpler RDFForumla store. These are rather sophisticated stores which perform (some) bnode and inverse-functional-property smushing and maintain multiple triple indexes keyed on subjects, predicates, and objects.

Clearly, this is an excellent API for developers wishing to work with the full RDF model; naturally, it is the appropriate choice for an application like the Tabulator which at its core is an application that eats, breathes, and dreams RDF data. As such, however, the model is very generic and there is no (obvious, simple) way to translate it into a domain-specific, non-RDF model to drive domain-specific Web applications. Also, the parser and store libaries are read-only: there is no capability to serialize models back to RDF/XML (or any other format) and no capability to store changes back to the source of the data.

(Thanks to Dave Brondsema for an excellent example of using the Tabulator RDF parser which clarified where the existing implementations of the RDFStore interface can be found.)

Jim Ley's JavaScript RDF parser

Jim Ley created perhaps the first JavaScript library for parsing and working with RDF data from JavaScript within a Web browser. Jim's parser (BSD license) handles most RDF/XML serializations and returns a simple JavaScript object which wraps an array of triples and provides methods to find triples by matching subjects, predicates, and objects (any or all of which can be wildcards). Each triple is a simple JavaScript object with the following structure:

  subject: ...,
  predicate: ...,
  object: ...,
  type: ...,
  lang: ...,
  datatype: ...

The type attribute can be either literal or resource, and blank nodes are represented as resources of the form genid:NNNN. This structure is a simple and straightforward representation of the RDF model. It could be relatively easily mapped into an object graph, and from there into a domain-specific object structure. The simplicity of the triple structure makes it a reasonable choice for a potential RDF/JSON serialization. More on this later.

Jim's parser also provides a simple method to serialize the JavaScript RDF model to N-Triples, though that's the closest it comes to providing support for updating source data with a changed RDF graph.

Masahide Kanzaki's Javascript Turtle parser

In early 2006, Masahide Kanzaki wrote a JavaScript library for parsing RDF models expressed in Turtle. This parser is licenses under the terms of the GPL 2.0 and can parse into two different formats. One of these formats is a simple list of triples, (intentionally) identical to the object structure generated by Jim Ley's RDF/XML parser. The other format is a JSON representation of the Turtle document itself. This format is appealing because a nested Turtle snippet such as:

@prefix : <> .

:lee :address [ :city "Cambridge" ; :state "MA" ] .

translates to this JavaScript object:

  "@prefix": "<>",
  "address": {
    "city": "Cambridge",
    "state": "MA"

While this format loses the URI of the root resource (, it provides a nicely nested object structure which could be manipulated easily with JavaScript such as:

  var lee = turtle.parse_to_json(jsonStr);
  var myState = lee.address.state; // this is easy and domain-specific - yay!

Of course, things get more complicated with non-empty namespace prefixes (the properties become names like ex:name which can't be accessed using the obj.prop syntax and instead need to use the obj["ex:name"] syntax). This method of parsing also does not handle Turtle files with more than a single root resource well. And an application that used this method and wanted to get at full URIs (rather than the namespace prefix artifacts of the Turtle syntax) would have to parse and resolve the namespaces prefixes itself. Still, this begins to give ideas on how we'd most like to work with our RDF data in the end within our Web app.

Masahide Kanzaki also provides a companion library which serializes an array of triples back to Turtle. As with Jim Ley's parser, this may be a first step in writing changes to the RDF back to the data's original store; such an approach requires an endpoint which accepts PUT or POSTed RDF data (in either N-Triples or Turtle syntax).

SPARQL + SPARQL/JSON + sparql.js

The DAWG published a Working Group Note specifying how the results of a SPARQL SELECT or ASK query can be serialized within JSON. Elias and I have also written a JavaScript library (MIT license) to issue SPARQL queries against a remote server and receive the results as JSON. By default, the JavaScript objects produced from the library match exactly the SPARQL results in JSON specification:

  "head": { "vars": [ "book" , "title" ]
  } ,
  "results": { "distinct": false , "ordered": false ,
    "bindings": [
        "book": { "type": "uri" , "value": "" } ,
        "title": { "type": "literal" , "value": "Harry Potter and the Half-Blood Prince" }
      } ,

The library also provides a number of convenience methods which issue SPARQL queries and return the results in less verbose structures: selectValues returns an array of literal values for queries selecting a single variable; selectSingleValue returns a single literal value for queries selecting a single variable which expect to receive a single row; or selectValueArrays which returns a hash relating each of the query's variables to an array of values for that variable. I've used these convenience methods in the SPARQL calendar and SPARQL antibodies demos and found it quite easy for SPARQL queries returning small amounts of data.

Note, however, that this method does not actually work with RDF on the client side .Because it is designed for SELECT (or ASK) queries, the Web application developer ends up working with lists of values in the application (more generally, a table or result set structure). Richard Cyganiak has suggested serializing entire RDF graphs using this method by using the query SELECT ?s ?p ?o WHERE { ?s ?p ?o } and treating the three-column result set as an RDF/JSON serialization. This is a clever idea, but results in a somewhat unwieldy JavaScript object representing a list of triples: if a list of triples is my goal, I'd rather use the Jim Ley simple object format. But in general, I'd rather have my RDF in a form where I can easily traverse the graph's relationships without worrying about subjects, predicates, and objects.

Additionally, the SPARQL SELECT query approach is a read-only approach. There is no current way to modify values returned from a SPARQL query and send the modified values (along with the query) back to an endpoint to change the underlying RDF graph(s).


Benjamin Nowack implemented the SPARQL JSON results format in ARC (W3C license), and then went a bit further. He proposes three additions/modifications to the standard SPARQL JSON results which result in saved bandwidth, more directly usable structures, and the ability to instruct a SPARQL endpoint to return JavaScript above and beyond the results object itself.

  • JSONC: Benjamin suggests an additional jsonc parameter to a SPARQL endpoint; the value of this parameter instructs the server to flatten certain variables in the result set. The result structure contains only the string value of the flattened variables, rather than a full structure containing type, language, and datatype information.
  • JSONI: JSONI is another parameter to the SPARQL endpoint which instructs the server to return certain selected variables nested within others. Effectively, this allows certain variables within the result set to be indexed based on the values of other variables. This results in more naturally nested structures which can be more closely aligned with domain-specific models and hence more directly useful by JavaScript application developers.
  • JSONP: JSONP is one solution to the problem of cross-domain XMLHttpRequest security restrictions. The jsonp parameter to a SPARQL server would specify a function name which the resulting JSON object will be wrapped in in the returned value. This allows the SPARQL endpoint to be used via a <script src="..."></script> invocation which avoids the cross-domain limitation.

The first two methods here are similar to what the sparql.js feature provides on the client side for transforming the SPARQL JSON results format. By implementing them on the server, JSONC and JSONI can save significant bandwidth when returning large result sets. However, in most cases bandwidth concerns can be alleviated by sending gzip'ed content, and performing the transforms on the client allow for a much wider range of possible transformations (and no burden on SPARQL endpoints to support various transformations for interoperability). As far as I know, ARC is currently the only SPARQL endpoint that implements JSONC and JSONI.

JSONP is a reasonable solution in some cases to solving the cross-domain XMLHttpRequest problem. I believe that other SPARQL endpoints (Joseki, for instance) implement a similar option via an HTTP parameter named callback. Unfortunately, this method often breaks down with moderate-length SPARQL queries: these queries can generate HTTP query strings which are longer than either the browser (which parses the script element) or the server is willing to handle.


Queso is the Web application framework component of the IBM Semantic Layered Research Platform. It uses the Atom Publishing Protocol to allow a browser-based Web application to read and write RDF data from a server. RDF data is generated about all Atom entries and collections that are PUT or POSTed to the server using the Atom OWL ontology. In addition, the content of Atom entries can contain RDF as either RDF/XML or as XHTML marked up with RDFa; the Queso server extracts the RDF from this content and makes it available to SPARQL querying and to other (non-Web) applications.

By using the Atom Publishing Protocol, an application working against a Queso server can both read and write RDF data from that Queso server. While Queso does contain JavaScript libraries to parse the Atom XML format into usable JavaScript objects, libraries do not yet exist to extract RDF data from the content of the Atom entries. Nor do libraries exist yet that can take RDF represented in JavaScript (perhaps in the JIm Ley fashion) and serialize it to RDF/XML inthe content of an Atom entry. Current work with Queso has focused on rendering RDFa snippets via standard HTML DOM manipulation, but have not yet worked with the actual RDF data itself. In this way, Queso is an interesting application paradigm for working with RDF data on the Web, but it does not yet provide a way to work easily with domain-specific data within a browser-based development environment.

(Before Ben, Elias, and Wing come after me with flaming torches, I should add that Queso is still very much evolving: we hope that the lessons we learn from this survey and discussion about a vision of RDF-based Web apps (in my next post) will help guide us as Queso continues to mature.)

RPC / RESTful API / the traditional approach

I debated whether to put this on here and decided it was incomplete without it. This is the paradigm that is probably most widely used and is extremely familiar. A server component interacts with one or more RDF stores and returns domain-specific structures (usually serialized as XML or JSON) to the JavaScript client in response to domain-specific API calls. This is the approach taken by an ActiveRDF application, for instance. There are plenty of examples of this style of Web application paradigm: one which we've been discussing recently is the Boca Admin client, a Web app. that Rouben is working on to help administer Boca servers.

This is a straightforward, well-understood approach to creating well-defined, scalable, and service-oriented Web applications. Yet it falls short in my evaluation in this survey because it requires a server and client to agree on a domain-specific model. This means that my client-sde code cannot integrate data from multiple endpoints across the Web unless those endpoints also agree on the domain model (or unless I write client code to parse and interpret the models returned by every endpoint I'm interested in). Of course, this method also requires the maintenance of both server-side and client-side application code, two sets of code with often radically different development needs.

This is still often a preferred approach to creating Web applications. But it's not really what I'm thinking of when I contemplate the power of driving Web apps with RDF data, and so I'm not going to discuss it further here.

That's what I've got in my survey right now. I welcome any suggestions for things that I'm missing. In my next post, I'm going to outline a vision of what I see a developer-friendly RDF-based Web application environment looking like. I'll also discuss what pieces are already implemented (mainly using systems discussed in this survey) and which are not yet implemented. There'll also be many open questions raised, I'm sure. (Update: Part two is now available, also.)

(I didn't examine which of these approaches provide support for simple inferencing of the owl:sameAs and rdfs:subPropertyOf flavor, though that would be useful to know.)

January 9, 2007

Who loves RDF/XML?

I wrote the following as a comment on Seth's latest post about RDF/XML syntax, but the blog engine asked me to add two unspecified numbers, and I had a great deal of difficulty doing that correctly. So instead, it will live here, and I'd love to learn answers to this question from Seth or anyone else who might have any answers. Quoting myself:

Hi Seth,

This is a completely serious question: Who are these people who are insisting on RDF/XML as the/a core of the semantic web? Where can I meet them? Or have I met them and not realized it? Or are they mostly straw-men, as part of me suspects?

Inquiring minds -- and SWEO members -- want to know.


December 27, 2006

Playing Fetch with the DAWG

The summary: I was looking for an easy way to search through minutes of the DAWG, given that some but not all of the minutes are reproduced in plain text within a mailing list message. All minutes are (in one way or another) URL accessible, however, so I setup Apache Nutch to crawl, index, and search the minutes. I learned stuff along the way, and that's what the rest of this post shares.

One of the first things I'm doing as I'm getting up to speed in my new role as DAWG chair is finding the issues the DAWG has not yet resolved and determining whether we're on target to address the issues. One of the issues raised a few months ago was the syntactical order of the LIMIT and OFFSET keywords within queries. I had remembered that the group had reached a decision about this issue, but did not remember the details. I wanted to find the minutes which recorded the decision.

I could have searched the mailing list for limit and offset and probably found what I needed by perusing the search results. But not all minutes make it into mailing list messages as something other than links or attachments, and I didn't want to wade through general discussion. I'd rather be able to search the minutes explicitly. So here's what I did:

(I work in a Windows XP environment with a standard Cygwin installatoin.)

  1. Updated the DAWG homepage, adding links to minutes of the the past few months' teleconferences.
  2. Dug up a script I'd written last year to pull links from a Web page where the text of the link matches a certain pattern. Invoked this script with the pattern '\d+\s+?\w{3}' against the URL to pull out all the links to minutes from the Web page. This heuristic approach works well, but it would feel far more elegant to have the markup authoritatively tell me which links were links to minutes. Via RDFa, perhaps. I redirected the list of links produced by this script to the text file, dawg-minutes/root-urls/minutes.
  3. Downloaded the latest version of Apache Nutch and unzipped it, adding a symlink from nutch-install-dir/bin/nutch such that nutch ended up in my path.
  4. Followed instructions #2 and #3 from the Nutch user manual. This involves supplying a name to the user agent which Nutch crawls the Web with and also specifying a URL filter that decides which pages to crawl (or which pages not to crawl). To be on the safe side, I added these two lines to nutch-install-dir/conf/crawl-urlfilter.txt:
  5. The next step was to crawl the list of links I had already generated. I didn't want to follow any other links from these URLs, so this was a pretty simple invocation of Nutch. I did get trapped for a bit by the fact that earlier versions of Nutch required the command-line argument to be a text file with the list of URLs while the current version requires the argument to be the directory containing lists of links. I ended up invoking nutch as:
      cd dawg-minutes ; nutch crawl root-urls -dir nutch/ -depth 1
    This fetched, crawled, and indexed the set of DAWG minutes (but no other links thanks to the -depth 1) and stored the resulting data structures within the nutch subdirectory.
  6. At this point, I had (still unresolved) trouble getting the command-line search tool to work:
      nutch org.apache.nutch.searcher.NutchBean apache
    Regardless of the working directory from which I executed this, I always received Total hits: 0. This problem led me to discover Luke, the Lucene Index Toolbox, which confirmed for me that my indexes had been properly created and populated.
  7. I pressed ahead with getting Nutch's Web interface setup. I already had an installation of Apache Tomcat 5.5, so no installation needed there. Instead, I copied the file nutch-install-dir/nutch-version.war to nutch.war at the root of my Tomcat webapps directory.
  8. I started Tomcat from the dawg-minutes/nutch directory (where Nutch had put all of its indexes and other data structures), and launched a Web browser to http://localhost:5000/nutch. (The default Tomcat install runs on port 8080, I believe; I have too many programs clamoring for my port 8080.)
  9. The Nutch search interface appeared, but again any searches that I performed led to no hits being returned!
  10. Some Web searching led me to a mailing-list message which suggested investigating the searcher.dir property in webapps/nutch/WEB-INF/classes/nutch-site.xml. I added this property with a value of c:/documents and settings/.../dawg-minutes/nutch and restarted tomcat.
  11. All's well that ends well.

So I ran into a few speed bumps, but in the end I've got a relatively lightweight system for indexing and searching DAWG minutes. Hooray!

Searching the DAWG minutes with Apache Nutch

December 14, 2006

ODO: Semantic Web libraries in Perl

My colleague Stephen Evanchik has announced the release of ODO, part of the IBM Semantic Layered Research Platform:

ODO is an acronym for "Ontologies, Databases, and Optimizations," which are the three items I was most interested in experimenting with at the time. They were also the three categories of functionality I couldn't find in the existing Perl RDF libraries. ODO is still evolving and I have some more features to push out but right now it supports:

  • Nodes, statements and graph backed by memory
  • RDFS and OWL-Lite to Perl code generators
  • Queries using RDQL with SPARQL on its way
  • RDF/XML and NTriple parsers

The second point on that list is our Perl analog of the Jastor project, which generates Java code for RDF data access from OWL ontologies.

November 29, 2006

Open-sourcing the Semantic Layered Research Platform: A Roadmap

Yesterday, I promised a bit more detail on what components we'll be open-sourcing under the SLRP banner over the coming weeks and months. We've added much of this information to the SLRP homepage, but I wanted to include it here as well.

The Semantic Layered Research Platform (SLRP, pronounced slurp) is the collective name for the family of software components produced by the IBM Advanced Technology group to utilize semantics throughout the application stack. We'll be releasing these components to the open-source community over the next few months as we polish the initial versions of them and prepare supporting materials (examples, how-tos, documentation, ...). This post is a summary of the components that we'll be releasing, with a brief description of each. The list is arranged in a rough approximation of the order in which we think we'll be able to release them, but the order is very much subject to change.

  1. Boca1. Boca is the foundation of many of our components. It is an enterprise-featured RDF store that provides support for multiple users, distributed clients, offline work, real-time notification, named-graph modularization, versioning, access controls, and transactions with preconditions. Matt's written more about Boca here. Along with Boca are included two subsystems which may also be interesting on their own:

    • Glitter. Glitter is a SPARQL engine independent of any particular backend. It allows interfaces to backend data sources to plugin to the core engine and generate solutions for portions of SPARQL queries with varying granularity. The core engine orchestrates query rewriting, optimization, and execution, and composes solutions generated by the backend. A Boca-specific backend allows SPARQL queries to be compiled to Boca's temporal database schema.
    • Sleuth. Sleuth provides full-text search capabilities for text literals within Boca. Text literals are indexed with Apache Lucene, and the index also stores information about the named graph, subject, and predicate to which the literal is attached.
  2. DDR. The Distributed Data Repository (DDR) is the binary counterpart to Boca. It's a write-once, read-many store for binary data. Content within DDR receives an LSID, and a registry of metadata extractors ("scrapers") allows metadata to be pulled from the content and stored into a companion Boca server. DDR contains an LSID resolver that returns the stored binary content for the LSID getData() call and returns the Boca named graph containing the metadata in response to the LSID getMetadata() call.

  3. Queso. Queso is a semantic web-application framework. It stores content (HTML, CSS, JavaScript, etc.), user data, and application data within Boca, and provides mechanisms for deploying modular applications and services that (modulo access control) can remix and reuse service endpoints and semantic data. Ben, Elias, and Wing have already written more about Queso.

  4. ODO. ODO is a family of Perl 5 libraries for parsing, manipulating, persisting, and serializing RDF data. ODO also contains Plastor, a Perl analog of Jastor, which generates Perl classes from an OWL ontology.

  5. Telar. Telar is a family of Java libraries that provide services for creating applications driven by RDF. Some Telar libraries focus on the user interface, supplying bindings between RDF data and SWT widgets, the Eclipse Graphical Editing Framework (GEF), and Eclipse RCP perspectives, editors, and views. Other Telar libraries focus on data management; these libraries provide functionality to manipulate RDF datasets (collections of named graphs) and to perform tasks such as resolving human-readable labels for resources within an RDF graph.

  6. Salsa. Salsa is a Boca application that brings together semantic technologies and spreadsheets. Salsa serializes spreadsheets to a central Boca server, and also uses a transform language to map cells and cell ranges to their RDF semantics. Salsa is an experiment in separating data from layout within spreadsheets, and also in adapting the familiar spreadsheet user-interface paradigm for RDF data.

  7. Taco. Taco is a framework for measuring performance of RDF stores. It includes utilities that build various kinds of RDF graph structures and the ability to add measurable operations, which can include queries, statement adds and statement removes, for those structures. The perfomance log is also stored in RDF with a defined ontology and can be easily queried with a report generator.

1 As you might be able to tell, several of our projects were named at lunch time, as we gazed longingly across the street.

November 28, 2006

Semantic Web Technologies in the Enterprise


Over the past two years, my good friend and coworker Elias Torres has been blogging at an alarming rate about a myriad of technology topics either directly or indirectly related to the Semantic Web: SPARQL, Atom, Javascript, JSON, and blogging, to name a few. Under his encouragement, I began blogging some of my experiences with SPARQL, and two of our other comrades, Ben Szekely and Wing Yung, have also started blogging about semantic technologies. And in September, Elias, Wing and Ben blogged a bit about Queso, our project which combines Semantic Web and Web 2.0 techniques to provide an Atom-powered framework for rapid development and deployment of RDF-backed mash-ups, mash-ins, and other web applications.

But amidst all of this blogging, we've all been working hard at our day jobs at IBM, and we've finally reached a time when we can talk more specifically about the software and infrastructure that we've been creating in recent years and how we feel it fits in with other Semantic Web work. We'll be releasing our components as part of the IBM Semantic Layered Research Platform open-source project on SourceForge over the next few months, and we'll be blogging examples, instructions, and future plans as we go. In fact, we've already started with our initial release of the Boca RDF store, which Wing and Matt have blogged about recently. I'll be posting a roadmap/summary of the other components that we'll be releasing in the coming weeks later today or tomorrow, but first I wanted to talk about our overall vision.


The family of W3C-endorsed Semantic Web technologies (RDF, RDFS, OWL, and SPARQL being the big four) have developed under the watchful eyes of people and organizations with a variety of goals. It's been pushed by content providers (Adobe) and by Web-software organizations (Mozilla), by logicians and by the artificial-intelligence community. More recently, Semantic Web technologies have been embraced for life sciences and government data. And of course, much effort has been put towards the vision of a machine-readable World Wide Web—the Semantic Web as envisioned by Tim Berners-Lee (and as presented to the public in the 2001 Scientific American article by Berners-Lee, Jim Hendler, and Ora Lassila).

Our adtech group at IBM first took note of semantic technologies from the confluence of two trends. First, several years ago we found our work transitioning from the realm of DHTML-based client runtimes on the desktop to an annotation system targeted at life-sciences organizations. As we used XML, SOAP, and DB2 to develop the first version of the annotation system along with IBM Life Sciences, we started becoming familiar with the enormity of the structured and unstructured, explicit and tacit data that abounds throughout the life sciences industry. Second, it was around the same time that Dennis Quan—a former member of our adtech team—was completing his doctoral degree as he designed, developed, and evangelized Haystack, a user-oriented information manager built on RDF.

Our work continued, and over the next few years we became involved with new communities both inside and outside of IBM. Via a collaboration with Dr. Tom Deisboeck, we became involved with the Center for the Development of a Virtual Tumor (CViT) and developed plans for a semantics-backed workbench which cancer modelers from different laboratories and around the world could use to drive their research and integrate their work with that of other scientists. We met Tim Clark from the MIND Center for Interdisciplinary Informatics and June Kinoshita and Elizabeth Wu of the Alzheimer Research Forum as they were beginning work on what would become the Semantic Web Applications in Neuromedicine (SWAN) project. We helped organize what has become a series of internal IBM semantic-technology summits, such that we've had the opportunity to work with other IBM research teams, including those responsible for the IBM Integrated Ontology Development Toolkit (IODT).

All of which (and more) combines to bring us to where we stand today.

Our Vision

While we support the broad vision of a Semantic World Wide Web, we feel that there are great benefits to be derived from adapting semantic technologies for applications within an enterprise. In particular, we believe that RDF has several very appealing properties that position it as a data format of choice to provide a flexible information bus across heterogeneous applications and throughout the infrastructure layers of an application stack.

Name everything with URIs

When we model the world with RDF, everything that we model gets a URI. And the attributes of everything that we model get URIs. And all the relationships between the things that we model get URIs. And the datatypes of all the simple values in our models get URIs.

URIs enable selective and purposeful reuse of concepts. When I'm creating tables in MySQL and name a table album, my table will share a name with thousands of other database tables in other databases. If I have software that operates against that album table, there's no way for me to safely reuse it against album tables from other databases. Perhaps my albums are strictly organized by month, whereas another person's album table might contain photos spanning many years. Indeed, some of those other album tables might hold data on music albums rather than photo albums. But when I assert facts about, there's no chance of semantic ambiguity. Anyone sharing that URI is referencing the same concept that I am, and my software can take advantage of that. The structured, universal nature of URIs guarantees that two occurrences of the same identifier carry the same semantics.

While URIs can be shared and reused, they need not be. Anyone can mint their own URI for a concept, meaning that identifier creation can be delegated to small expert groups or to individual organizations or lines of business. Later, when tools or applications encounter multiple URIs that may actually reference the same concept, techniques such as owl:sameAs or inverse-functional properties can allow the URIs to be used interchangeably.

Identifying everything with URIs has other benefits as well. Many URIs (especially HTTP or LSID URIs) can be resolved and dereferenced to discover (usually in a human-readable format) their defined meaning. The length of URIs is also an advantage, albeit a controversial one. URIs are much longer than traditional identifiers, and as such they often are or contain strings which are meaningful to humans. While software should and does treat URIs as opaque identifiers, there are very real benefits to working with systems which use readable identifiers. At some level or another, all systems are engineered, debugged, and maintained by people, and much of the work of such people is made significantly simpler when they can work with code, data, and bug reports which mention rather than companyId:80402.

RDF: A flexible and expressive data model

By representing all data as schema-less triples comprising a data graph, RDF provides a model which is expressive, flexible, and robust. The graph model frees the data modeler and/or application developer from defining a schema a priori, though RDF Schema and OWL do provide the vocabulary and semantics to specify the shape of data when that is desirable. And even when the shape of data has been prescribed, new, unexpected and heterogeneous data can be seamlessly incorporated into the overall data graph without disturbing the existing, schema-conformant data.

Furthermore, a graph-based data model allows a more accurate representation of many entities, attributes, and relations (as opposed to the relational model or (to a lesser extent) XML's tree/infoset model). Real objects, concepts, and processes often have a ragged shape which defies rigid data modeling (witness the pervasiveness of nullable columns in relational models) but is embraced by RDF's graph model. And while all common data models force certain artificial conventions (e.g. join tables) onto the entities being modeled, a directed, labeled graph can better handle common constructs such as hierarchical structures or homogeneous, multi-valued attributes.

The Semantic Web layer cake

Thankfully, RDF does not exist in a vacuum. Instead, it is and has always been a part of a larger whole, building upon lower-level infrastructure and laying the groundwork for more sophisticated capabilities including query, inference, proof, and trust. OWL powers consistency checking of rules and of instance data. Rules languages (e.g. SWRL or N3) enable reasoning over existing RDF data and inference of new data. And SPARQL provides a query language which combines the flexibility of RDF's data model with the ability to query data sources scattered across the network with a single query. The upper layers of abstraction in this stack are largely still in development, but the beauty of this modular approach is that we can begin benefiting from the lower layers while areas such as trust and proof are still evolving.

the semantic web layer cake of technologies

A standard lingua franca for machines

RDF is an open W3C specification. So is OWL. And so is SPARQL. The Rules Interchange Format (RIF) working group is standardizing rule interoperability within the semantic-technologies arena. So what do we have here? We have a flexible data model which can adjust effortlessly to unexpected data sources and new shapes of data and which has a high degree of expressivity to model real-world objects. The data model can be queried in a network-aware manner. It can be organized in ontologies, reasoned over, and it can have new data inferred within it. And all of this is occurring via a set of open, standard technologies. At IBM, we look at this confluence of factors and come to one conclusion:

RDF is uniquely positioned as a machine-readable lingua franca for representing and integrating relational, XML, proprietary, and most other data formats.

One particularly appealing aspect of this thesis is that it does not require that all data be natively stored as RDF. Instead, SPARQL can be used in a heterogeneous environment to query both native RDF data and data which is virtualized as RDF data at query time. Indeed, we've exploited that paradigm before in our demonstrations of using SPARQL to create Web 2.0-style mash-ups.

The Semantic Web in the enterprise

These benefits of semantic technologies could (and will) do great things when applied to the Web as we know it. They can enable more precise and efficient structured search; they can lower the silo walls that separate one Web site's data from the next; they can drive powerful software agents that perform sophisticated managing of our calendars, our business contacts, and our purchases. For these dreams to really come into their own, though, requires significant cross-organization adoption to exploit the network effect on which semantic web technologies thrive. I have little doubt that the world is moving in this direction (and in some areas much more quickly than in others), but we are not there yet.

In the meantime, we believe that there is tremendous value to be derived from adopting semantic technologies within the enterprise. Over the past few years, then, we've sought to bring the benefits enumerated above into the software components that comprise a traditional enterprise application stack. The introduction of semantic technologies throughout the application stack is an evolutionary step that enhances existing and new applications by allowing the storage, query, middleware, and client-application layers of the stack to more easily handle ragged data, unanticipated data, and more accurate abstractions of the domain of discourse. And because the components are all semantically aware, the result is a flexible information bus to which new components can be attached without being tied down by the structure or serialization of content.

Anywhere and everywhere

Once we've created an application stack which is permeated with semantic technologies, we can develop applications against it in many, many different settings. Some of this is work that we've been engaged in at IBM, and much of it is work that has engrossed other members of the community. But all of it can benefit when semantics are pushed throughout the infrastructure. Some of the possibilities for where we can surface the benefits of semantic technologies include:

  • On the Web. We can quickly create read-only Web integrations using SPARQL within a Web browser. Or we can use Queso to create more full-featured Web applications by leveraging unified RDF storage and modularity along with the simplicity of the Atom Publishing Protocol. Or we can leverage community knowledge in the form of semantic wikis.

  • Within rich client applications. Full-featured desktop applications can be created which take advantage of semantic technologies. To do this we must develop new programming models based on RDF and friends, and use those models to create rich environments capable of highly dynamic and adaptive user experiences. We've done work in this direction with a family of RCP-friendly data-management and user interfaces libraries named Telar, and many of the ideas are shared by the Semedia group's work with DBin.

  • Amidst traditional desktop data. Whether in the form of Haystack, Gnowsis, or perhaps some other deliverables from the Nepomuk consortium, there are tremendous amounts of semi-structured and unstructured data lurking in the cobwebbed corners of our desktops. Emails, calendar appointments, contacts, documents, photographs... all can benefit from semantic awareness and from attaching to a larger semantic information bus that integrates the desktop with content from other arenas.

Where do we go from here?

We're going to be working hard inside IBM, with our customers and business partners, and within the semantic web community to realize all these benefits I've written about. We'd like to take advantage of the myriad of other open-source semantic components that already exist and are being built, and we'd like others to take advantage of our software. We believe the time is right for a semantic-technologies development community to work together to create interoperable software components. We also think that the time is ripe to educate industry leaders and decision makers as to the value proposition of semantic web technologies (and at the same time, to dispel many of the mistaken impressions about these technologies). To this end, Wing and I have joined the fledgling W3C Semantic Web Education and Outreach interest group (SWEO), where we look forward to working with other semantic-technology advocates to raise awareness of the benefits and availability of semantic web technologies and solutions.

October 6, 2006

The SPARQL FAQ FAQ (printing, updates, JavaScript, and more)

Thanks to everyone who's provided valuable feedback since I put the first version of the SPARQL FAQ online last week. I just wanted to take a brief opportunity to address some of the comments/questions I've received since then.

Can I receive updates when the SPARQL FAQ changes?

As I've said before, I'm still working on new questions and updated answers for the FAQ, and I intend to continue updating it regularly as SPARQL matures and is used more widely. Elias asked if there was a feed that he could use to keep up with the latest. Well there wasn't, but now there is. Feed readers/aggregators should be able to autodiscover the feed from the SPARQL FAQ page itself, but in case that fails, here's an Atom 1.0 feed of new questions/answers to the FAQ. As many aggregators won't currently re-show items with an updated atom:updated timestamp, I'll still post here on my blog when significant updates have been made to existing answers.

How can I print the SPARQL FAQ? How can I view all the answers at once?

The FAQ includes Expand All and Collapse All links that can be used to accomplish this. Depending on the CSS support of your browser, these links should either be in the top-right corner of the web page or else in the top-left corner. They're small, so they're somewhat easy to miss. So to print the FAQ, just click the Expand All link and then use your browser's print command to print the web page as normal.

How was the FAQ constructed? Can I reuse the infrastructure?

A side goal of creating this SPARQL FAQ (aside from answering common questions about SPARQL) was to try to create a more usable FAQ. After several discussions with coworkers (especially Sean and Elias), I came up with a few properties that I wanted to have in the FAQ:

  • The FAQ should have an easy-to-navigate table-of-contents view, but I didn't want to have to maintain that table-of-contents independently of the questions and answers themselves.
  • The FAQ should be on a single page and should be viewable all at once, to facilitate printing. (See above.)
  • Both the FAQ itself and each individual question/answer within the FAQ should be linkable/bookmarkable.
  • The FAQ should be completely accessible to browsers without JavaScript, though it's OK if the user experience is not as comprehensive without JavaScript.

To accomodate these properties, I chose the following technical approach:

  1. The FAQs are maintained in a simple XHTML format. On its own, the XHTML simply contains the category headers, questions, and answers, without the permalinks, table of contents, or expanding/collapsing behavior.
  2. CSS (currently inline, but there's no good reason for that) is used to style the FAQ appropriately. (Or inappropriately if you will, I'm not much of a designer.)
  3. JavaScript acts in response to the page's onload event to enhannce the user experience. This JavaScript does the following:
    • Iterates through the categories and questions, adding numbers to them and collapsing (hiding) the answers
    • Adds the permalinks in based on the ids of the questions, making it easy to copy a link for emailing, linking, bookmarking, etc.
    • Adds the Expand All and Collapse All links.
    • Adds event handlers to handle clicks to show and hide questions.
    The JavaScript code also examines the URL used to get to the page. If the URL contains a fragment (#some-question) then the JavaScript expands that question. (The browser will handle navigating to that part of the document.)
  4. An XSLT acts on the XHTML to generate the Atom feed of the FAQ (see above). Currently I run this XSLT manually to generate the Atom document whenever the SPARQL FAQ changes.

The JavaScript to enable this was designed to be reusable and can be downloaded here. (It has very basic instructions for how to use it inside. If there's much demand, I'll be glad to write up a brief tutorial to creating a FAQ using this library.) If you're interested in the XSLT which generates the Atom feed, just drop me a note.

September 29, 2006

Announcing: the SPARQL FAQ

I joined my first W3C working group, the DAWG, about 16 months ago. Since then, I've had the opportunity to see all sorts of questions about SPARQL in all sorts of places: I've fielded questions from many IBMers whom I represent on the working group; I've seen question after question posed on the public DAWG comments mailing list; I've answered SPARQL questions posed on #swig; I've read blog posts and comments about SPARQL; and in recent months I've followed the action on the nascent public-sparql-dev mailing list.

As SPARQL has started to receive more attention and slowly build momentum before and during these 16 months, the questions I've seen (both technical and more general) have become more and more familiar and more and more frequent. With this in mind, I set out a couple of weeks ago to assemble a SPARQL FAQ. It's still a work in progress (in fact I have a new batch of questions that I'm already working on writing up), but the first version is ready for your viewing, learning, and referencing pleasure. Enjoy the SPARQL FAQ, and please let me know if you have any feedback, suggestions for new questions, or suggestions for answers.

August 17, 2006

Life sciences on the web with SPARQL

I've been meaning to write up my experiences from WWW2006 in Edinburgh since getting back at the end of May, but the arrival of heaps of summer interns and the projects that accompany them (including Queso, a semantic-web-powered web-application framework) seems to have defeated that desire.

At the least, though, I wanted to mention the SPARQL/RDF life-sciences web mashup that I demoed at the Advancements in Semantic Web session of the W3C track on Friday. (And in doing so, follow my own example.) In his demo, we use RDF representations and SPARQL queries to integrate protein data from the NCBI with antibodies information from the Alzheimer Research Forum Antibody Database The presentation that I gave in Edinburgh has some more information on how the demo is put together and what Elias and I learned from our work on it.

How to use the demo

  1. Navigate to the demo at
  2. Enter a search term to find related proteins. For the purposes of trying out the demo, enter p53 and click Find Proteins.1
  3. Up to twelve proteins found in the search are rendered on the display, along with the protein's species, description, and NCBI number. Click on a protein to search for antibodies that target that protein. For the purposes of trying out the demo, click on NP_000537.2
  4. Any antibodies found are displayed in a column on the right-half of the page. The information displayed includes the distributor of the antibody, the distributor's catalog number, the immunogen used to generate the antibody, the specificity of the antibody, and the uses for which the antibody is appropriate.

This demo makes use of some of the early work being done by Alan Ruttenberg in conjunction with the BioRDF subgroup of the W3C's Semantic Web in Health care and Life Sciences Interest Group.

Behind the scenes

1 Two SPARQL queries are used to do this initial search. First, we use a service written by Ben Szekely which performs an NCBI Entrez search and returns the LSIDs of the resulting objects within a simple RDF graph. For each of these LSIDs, we make use of a second one of Ben's services which allows us to resolve the metadata for an LSID via a simple HTTP GET. We use the URLs to this service as the graphs for a second SPARQL query which retrieves the details of the proteins. We take the results of this second SPARQL query as JSON and bind them to a microtemplate to render the protein information.

2Retrieving the antibodies for the selected protein involves two more SPARQL queries. First, we query against a map created by Alan Ruttenberg in order to find AlzForum antibody IDs that correspond to the target protein. We need the results of this query to generate HTTP URLs which search the AlzForm antibody database for the proper antibodies. (If we had a full RDF representation of the antibody database, this query would be unnecessary.) These search URLs are wrapped in a service we created that scrapes the HTML from the antibody search results Web page and generates RDF (how I yearn for RDFa adoption) and then uses these wrapped URLs as the graphs for a second SPARQL query. This query joins the NCBI data with Alan's mapping and the antibody details to retrieve the information that is rendered for each antibody of the target protein.

July 7, 2006

I'm a SPARQL Junkie

My coworker Wing and I wanted to send out evites to all of our immediate coworkers and the various interns that are working with us this summer. I told Wing that if he would write up the text of the evite that I would gather the email addresses. At IBM, we have a corporate directory called BluePages and I was trying to avoid manually searching for each person and looking up and copying their (internet) email addresses.

Over the years, IBMers have developed a slew of APIs to access the information in BluePages programmatically, but as I'm unfamiliar with most of them, I turned to Elias for help. Elias said:

Why don't you use SPARQL?

In the ensuing conversation, I learned that Elias had spent some time last week setting up SquirrelRDF to map SPARQL queries to BluePages, as suggested on #swig. He whipped open a browser window with the corporate LDAP schema and a terminal window with the (RDF) configuration file mapping LDAP attributes to RDF predicates.

A few minutes later, we had achieved our goal:

PREFIX foaf: <>
PREFIX ibm: <>
SELECT ?mbox
           _:elias foaf:name "Elias Torres" ; ibm:department ?dept.
           _:person ibm:department ?dept  ; foaf:mbox ?mbox .
   } UNION {
           _:wing foaf:name "Wing C. Yung" ; ibm:department ?dept.
           _:person ibm:department ?dept  ; foaf:mbox ?mbox .
   } UNION {
           _:alex foaf:name "ALEX H. CHAO" ; ibm:department ?dept ; ibm:city _:location .
           _:person ibm:department ?dept  ; foaf:mbox ?mbox ; ibm:city _:location .
} ORDER BY ?mbox

(More info: Our lab in Cambridge is composed organizationally of two different departments, and some of our interns report to yet a third department. The third department also contains people not in Cambridge, so we used seed people from each department, grabbed the information that identifies their department (and location), and found all other people matching the same criteria.) We used this SquirrelRDF config file:

@prefix foaf: <> .
@prefix lmap: <> .
@prefix ibm: <> .
<> a lmap:Map ;
        lmap:server <ldap://localhost/ou=bluepages,> ;
        lmap:mapsProp [ lmap:property foaf:name ; lmap:attribute "cn" ; ] ;
        lmap:mapsProp [ lmap:property ibm:department ; lmap:attribute "dept" ; ] ;
        lmap:mapsProp [ lmap:property foaf:mbox ; lmap:attribute "mail" ; ] ;
        lmap:mapsProp [ lmap:property ibm:city ; lmap:attribute "workLoc" ; ] ;

(This post co-authored by Elias, Wing, and myself.)

July 4, 2006

I see the Semantic Web everywhere

Several weeks ago, Elias mused about one way in which semantic web technologies could improve his day-to-day life. Even though I've been working with semantic web technologies myself for a couple of years now, it's only recently that I've found that I'm seeing the Semantic Web all around me. In the past month, I've had more conversations about the Semantic Web with (technical and non-technical) friends of mine, and more and more potential benefits of the Semantic Web seem to crystallize around me constantly.

Yesterday, I took advantage of my day off to watch Lynn do her job (with great aplomb and skill, I must say) over at Roxbury District Court. As I sat for a few hours in the courthouse watching arraignments, bail arguments, default removals, and probation restrictions, I couldn't help but see the massive quantities of data flying around the room in the form of reams and reams of printed and handwritten paper materials. Instead of blonde, brunette, redhead I was seeing criminal complaints, police reports, and suspects' records moving rapidly from clerk to court officer to defense attorney to copy machine to district attorney to probation officer and beyond.

The system functions, but it functions with massive amounts of duplication of effort, misplaced data, and needless inefficiencies. Any attempts at analysis of past precedents requires expensive, painstaking research into the paper files that record all the stages of our justice system. The creation and installation of an electronic system for these records would be invaluable. And while such a system would have gigantic benefits with technical foundations ranging from relational to XML to proprietary, semantic web technologies would really make it shine.

  • Mountains of data. The amount of data generated from such mundane activities as scheduling court dates for a single criminal charge is staggering. (but routine!)
  • Semi-structured data. The data is a mixture of well-structured form fields (the crime charged, location info, bail amounts, court dates, etc.) and unmined free text (e.g. the text of a complaint).
  • Ragged, open-world data. The data on a particular suspect is an open-world amalgamation of past charges, convictions, and current open cases from (possibly) multiple districts. A particular charge includes data generated by the district attorney's office, the court, one or more defense attorneys, the legislature, the department of correction, and more, and is often incomplete at any given moment in time. Furthermore, different charges mandate differently shaped data, as do different special bail conditions, sentences, and probation restrictions.
  • Organizational data interchange. Of course, the entire legal system is not populated by luddites. Parts of the system exist on top of electronic silos with legacy applications providing access to the data. To realize the full potential of an agile and efficient electronic system, however, data interchange between the organizations that take part in the legal system is paramount.

Yes, all of this can be accomplished with technologies other than RDF and friends. But add-in the ability to search and analyze precedents and to define rules and policies (e.g. for sentencing guidelines or indigency determination), and the complete story told by RDF, RDF-S, OWL, SPARQL, and RIF is compelling.

There are social, inertial, and monetary reasons why this sort of systemic revolution is unlikely to happen anytime soon in the (American) legal system. But as the technologies continue to evolve and standardize and the infrastructure continues to mature we'll discover more and more arenas that will benefit from the promise of semantic web technologies. And eventually the confluence of technological capabilities, infrastructure availability, and the awareness of deicision makers will reach a point where we can do far more than just talk about bringing new industries into the semantic web fold.

June 8, 2006

Exploring the SPARQL Clipboard Demo

Elias pointed out to me Benjamin Nowack's excellent implementation of an RDF clipboard. To use the demo:

  1. Right-click on any of the icons to the left of one of the bloggers and choose Copy from the context menu.
  2. Right-click on an empty icon under either Latest blog post or Resource Description and choose Paste from the context menu.
  3. Oooh and aaah.

This demo is a fantastic example of what we can accomplish with data that is represented in a lingua franca and that is accessible via a query language. If you add in the ability for this data to be distributed across the web, you end up with an almost ridiculously flexible infrastructure that empowers web authors and developers to integrate data in exciting and unforeseen ways with a very low barrier to entry. Throw in a lightweight serialization format and an emerging template-driven presentation technology and the possibilities shine brighter as the time from idea conception to functioning web app becomes less and less. In short, the semantic web can drive remarkable new Web 2.0+ solutions.

(This is a message that Leigh Dodds pitched at XTech and that I presented at WWW2006. I'll have more to say about it in another entry, coming soon.)

In exploring the workings of the clipboard demo, I copied Elias's entry and pasted him into a text editor. Here's what I saw:

    "resID": "_:b0c76b337e4d1fefa75c3477341d717c4_id2246341", 
    "endpoint": "/clipboard/sparql"

Elias is represented as a particular SPARQL endpoint and a resource (in this case, a blank node label, probably a told bnode?). The target of a paste operation has the privilege of determining what SPARQL query to use to best display Elias given identifying information for him. On a whim, I changed the above JSON object to:

    "resID": "", 
    "endpoint": "/clipboard/sparql"

That resource ID is the URI that I penned for myself some time ago. Before attempting to paste this clipboard fragment into the demo SPARQL clipboard, I first used the form at the top of the page to add the triples from my FOAF file to the store being used by the demo. Having done that, I then copied my modified JSON string and pasted it into the Resource Description section of the demo and voila—a simple rendition of my FOAF information appeared. (I haven't investigated the SPARQL queries done for latest blog post, but I believe it goes against Planet RDF, and I don't have any recent posts there to be found in a query.)

The SPARQL Clipboard Demo. All the qualities of a good technology demo: impressive, novel, explainable, hackable, and applicable to real problems. Good stuff indeed.

April 26, 2006

SPARQL Calendar Demo: Retrieving Calendar Events

This is the seventh in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

After one or more people have been discovered, the calendar demo allows us to select one or more people and click the refresh link in the Calendars section of the righthand panel in order to retrieve calendar events for the selected person(s). Clicking refresh with the Alice demo user selected runs this SPARQL query:

PREFIX foaf: <> 
PREFIX rdf: <> 
PREFIX dc: <> 
PREFIX ical: <> 
PREFIX rdfs: <> 
SELECT DISTINCT ?title ?start ?end ?name ?location ?g 
      <> rdfs:seeAlso ?g .
       ?g rdf:type ical:Vcalendar .
       OPTIONAL {
        <> foaf:name ?name 
      } .
       OPTIONAL {
        <> rdfs:label ?name 
      } .
  } .
  GRAPH ?g_wrapped {
    _:ev ical:summary ?title ; 
            ical:dtstart ?start ; 
            ical:dtend ?end ; 
            ical:location ?location .
    FILTER (regex(str(?start), '2006-04.*') ||
            regex(str(?end), '2006-04.*')).
  FILTER regex(str(?g_wrapped), str(?g)) .

In English—and amidst some devilish hackery—this query says:

Fetch Alice's name. Also, for every event in the graph containing her calendar entries, fetch the event's title, start time, end time, and location.

At a first glance, we see that this query finds a person's calendar events using the same breadcrumbs protocol we've discussed previously. But as with the other SPARQL queries we've looked at so far, there are several other points worthy of note and discussion in this query:

  • How does this query deal with multiple people? With this query, we are retrieving individual people's calendar events. Because one person's calendar, in this context, is independent of another's, we can fetch multiple people's calendars simultaneously by using the SPARQL UNION keyword to retrieve different rows (events) for different people.
  • Why does this query retrieve the name of the person, when we've already seen a query that retrieves names? Much of what is done in the SPARQL calendar demo could be enhanced by saving client-side state. When Elias and I developed the demo, we made a point of avoiding state as often as we could, in favor of using SPARQL queries to do as much heavy-lifting as possible. Since we wish to display a person's name alongside their calendar in the righthand panel, we retrieve that information along with the calendar events in this query.
  • What's the purpose of the filter expressions on ?start and ?end? There's no need to retrieve any calendar events which we can't render on the current calendar control on the main part of the webpage. To limit the events returned, we require that either the start date or the end date of each event match a regular expression which specifies the current month shown on the main calendar. (We know that the dates will conform to this format given that the RDF Calendar note specifies that the range of these properties is the XML Schema dateTime type.) This filter expression uses the SPARQL regex filter function to perform the matching.
  • What's the difference between ?g and ?g_wrapped? How are they related? Very little published calendar data on the web is represented in RDF; much is published in iCalendar format. Thanks to DanC's python wizardry and Elias's hosting, though, we have resolvable URLs that resolve to iCalendar data represented as RDF. So, we can include a URL like as a named graph in the RDF dataset of our query, and then match events within that calendar by sticking a pattern inside the SPARQL GRAPH keyword. But, it's unrealistic and semantically incorrect for someone to publish these triples
      <http://example/me> a foaf:Person .
      <http://example/me> rdfs:seeAlso <> .
      <> rdf:type ical:Vcalendar .
    rather than
      <http://example/me> a foaf:Person .
      <http://example/me> rdfs:seeAlso <http://example/myCalendar.ics> .
      <http://example/myCalendar.ics> rdf:type ical:Vcalendar .

    If in our query, then, we used the same variable for both the object of rdfs:seeAlso and also for the graph name in the GRAPH clause, we wouldn't get back any rows, because the former's value is <http://example/myCalendar.ics> while the latter's value is <>. So rather than sharing a variable, we use two different variables and use the built-in regex function as a poor man's (and unreliable) String.subStringOf substitute.

    This is an ugly hack. Anyone reading this should be outraged. I was, but sometimes deadlines beckon. This is an ugly hack not least of all because the all of those ics2rdf URLs are invalid URL syntax (the reserved characters in the ical query parameter should be URL escaped). A possible solution would be to extend our calendar convention to require that a person's FOAF data contain a triple along the lines of: <http://example/myCalendar.ics> ex:asRDF <>. That's ugly, also, but perhaps a bit better. When we look at the queries that deal with people's interests and make use of data from, we'll see that this is a more general problem: when we access non-RDF data as RDF, how do we semantically associate the (different) URLs of (different) representations of the same data in a manner which is relatable within a SPARQL query? Elias wrote about this dilemma and solicited opinions ranging from handling this at an application level (but how do we do that with the current SPARQL Protocol over the web?) to extending the expressibility of SPARQL FROM NAMED clauses.

It's not always pretty when one takes a peak behind the curtain. As I've said before, these aren't new issues, but as far as I can tell they're sitll unsolved issues, and as such deserve whatever attention we can give to them.

April 15, 2006

Google Calendar + SPARQL = Baseball??

Elias was showing me earlier today the demo he had whipped up earlier today that shows one way of leveraging SPARQL as a query interface to Google calendar data. As he wrote, when he showed me the demo he asked me to think about some interesting queries we could do with it. Here's the first one I came up with:

  1. Navigate to the demo
  2. On the Search tab, enter mets and click Go
  3. In the results table, select 2006 Mets Schedule and click Add to Calendars
  4. On the Calendars tab, ensure that only the 2006 Mets Schedule calendar is selected
  5. On the Query tab, enter this query:
    SELECT ?when ?matchup ?broadcast 
    WHERE {
      GRAPH ?g {
        _:game ical:dtstart ?when ; 
                     ical:summary ?matchup; 
                     ical:description ?broadcast.
         FILTER (?broadcast != "")
  6. Click Get Results

The calendar in question is setup to use the description field to list information on Mets games that will be broadcast on national TV this season. Thus, the filter ensures that the query returns only those Mets games which can be viewed nationwide this year, particularly useful for displaced fans like myself!1 Pretty cool, eh?

For the record, I do feel a minor case of the willies that I'm using generic calendar predicates to retrieve semantic data on baseball-game schedules and broadcasting. But, hey, we've got to start somewhere. Let's go Mets, and let's go SPARQL!

1 (Of course, in reality, I subscribe to MLB.TV and miss nary a minute of the season, heavens forbid!)

April 14, 2006

SPARQL Calendar Demo: Growing Our RDF Dataset

This is the sixth in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

When the discover more people link is clicked, the calendar demo uses this SPARQL query to expands its dataset before rerunning the search for people:

PREFIX foaf: <> 
PREFIX rdf: <> 
PREFIX ical: <> 
PREFIX rdfs: <> 
SELECT DISTINCT ?url ?named_url 
     _:x rdfs:seeAlso ?named_url .
     ?named_url rdf:type ical:Vcalendar 
  } UNION {
    ?someone foaf:knows ?known .
      ?known rdfs:seeAlso ?url 
    } .

This query asks:

Give me all URIs in the current dataset for the FOAF files of people known by people in the dataset; also give me all sources of additional-information that are typed as calendars.
In doing so, it takes advantages of the two breadcrumbs protocols I wrote about previously. What can we learn from this query?

  • How does the UNION combined two query patterns that don't share variables? The SQL relational union operator requires that the two resultsets being unioned together share the same columns. SPARQL, on the other hand—in the RDF spirit of ragged datasets—allows for differently shaped sets of bindings to be unioned together; variables not appearing in one operand of the union are simply unbound in the solutions contributed by that operand. Thus, this is effectively a way to combine two simple queries in one. We end up with a resultset which has some rows that only contain bindings for ?url and some that only bind ?named_url.
  • What's up with the calendar gunk? A key feature of the calendar demo is connecting people with their calendar events in RDF (and hence in SPARQL). To do this required choosing an approach to the event discovery problem. There's no widely accepted predicate that links a foaf:Person to a resource which of rdf:type ical:Vcalendar. Dan Connolly, for example, uses foaf:currentProject to link his own foaf:Person with the events in his calendar. With RDF calendar work still in development, many people store their calendar data in iCal format, devoid of any links with their RDF FOAF data.

    Because we were also demonstrating the ability to wrap .ics files as RDF, we decided to adopt a convention that treated calendars as documents. Following the lead of the breadcrumbs protocol for discovering new FOAF files, we chose to use rdfs:seeAlso to relate a foaf:Person to a URI at which events can be found that belong to that person's calendar. To add a bit more semantics to this convention, we also required that the URI object of rdfs:seeAlso is explicitly typed as an ical:VCalendar. When we find such a URI, we add it to our dataset as a named URL.

  • Why do we distinguish between ?url and ?named_url? The convention described above relies on semantics implicit in where calendar-event triples are found. That is, we associate events with a person based upon the document (named graph) in which we find the calendar-event triples. To query across this link successfully, then, we need to be able to model this link in our queries using the SPARQL GRAPH keyword, and to do this requires that we include the calendar graphs as named graphs in our dataset. In short, the calendar graphs are included as named graphs because our conventions impose semantics on the source of calendar-events triples.
While this convention that we adopted for finding calendars has been successful in the context of this demo, it has several drawbacks. First, it is not widely used (or used at all!), which requires that people massage their data into this format before taking advantage of the calendar demo with their own data.

Second, I feel that it is semantically dodgy. I'd much prefer the cleaner semantics of a chain of triples such as:

lee:LDF ex:calendar lee:LDFcalendar .
lee:LDFcalendar ical:component _:c  .
_:c ical:Vevent _:ev1 .
_:ev1 ...
Of course, as with any other semantic data, these triples can be publishes in multiple documents and distributed throughout the (semantic or world wide) web. As long as the dataset contains all of these triples, we could query calendar events without relying on the source-document semantics imposed by our use of the GRAPH keyword. Does anything exist that could take the place of the ex:calendar predicate? Are people using this construct anywhere? Of course, this would make it difficult, if not impossible, to point to an iCal file from within your FOAF data.

Finally, I think that our convention may be conflating the graph URL with the contents at the graph URL. Our convention effectively says, "See this URI, u, for more information. Oh yeah, u is a calendar, by the way." But that's not really the case. The URI is a graph that contains a calendar for the person in question, so we'd really want something like u foaf:primaryTopic ical:VCalendar. I'm not convinced that that's particularly accurate, either. Again, none of this is new, but I'm a firm believer that it never hurts to reiterate how easy it is to model the world incorrectly. These things are important to get right, and therefore important to write about when you think you've gotten them wrong!

April 12, 2006

SPARQL Calendar Demo: Using SPARQL to Find, Identify, and Name People

This is the fifth in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

This entry is the first of a few entries that will examine the specific SPARQL queries used in the calendar demo. While SPARQL bears surface resemblances to SQL, querying an RDF graph is a distinct approach from querying a relational data store, and there are several idioms and subtleties that are unique to the SPARQL language. (None of these ideas are new, of course! But as SPARQL has just moved to Candidate Recommendation status, I thought it might be useful to throw some real SPARQL queries out into the wild.)

This query is issued against the current dataset every time a new URI is added to the dataset (either manually or via the discover more people link):

PREFIX foaf: <>
PREFIX rdf: <>
PREFIX ical: <>
PREFIX rdfs: <>
SELECT ?who ?name ?id ?cal 
  ?who rdf:type foaf:Person .
  OPTIONAL { ?who foaf:name ?name }
  OPTIONAL { ?who rdfs:label ?name }
    { ?who foaf:mbox ?id } 
    { ?who foaf:mbox_sha1sum ?id } 
    ?who rdfs:seeAlso ?cal .
     ?cal rdf:type ical:Vcalendar 
} ORDER BY ?name

In English, this query asks:

Show me all people along with their names (if found), unique IDs (if found), and calendar URLs (if found) in my current RDF dataset.
There are a few interesting observations that we can take away from this query:

  • Why all the OPTIONALs? We want to build as exhaustive list of people as we can given our current dataset. When people reference their friends in their FOAF files, the amount of information that they include about them ranges from a URI-only to an IFP-only to a full suite of URI, name, and IFP information. Because we do not know the shape of the data we are querying, we take advantage of the SPARQL OPTIONAL keyword which allows us to include triple patterns which are allowed to not match the data being queried. That is, OPTIONAL ensures that if a person has a name but not an id (an IFP) that we'll receive the name and vice versa; the query will return all the information it can find without failing due to shaggy data.
  • Why are there two different OPTIONAL blocks that can bind the ?name variable? This idiom takes advantage of the fact that the OPTIONAL keyword is left-associative to express an ordered preference between predicates within our SPARQL query1. That is:
      OPTIONAL { ?who foaf:name ?name }
      OPTIONAL { ?who rdfs:label ?name }
    can be read as (given that ?who is already bound by the first (non-optional) triple pattern in the query):
    Bind ?name to the object of either the foaf:name or rdfs:label predicates; but if both such bindings exist, we prefer the object of foaf:name.
    It's a very useful idiom for sure, especially in the absence of a rules-enabled datastore that could map one predicate to another in the absence of a triple with a more-desirable predicate.
  • Why don't we use the same trick for finding bindings to ?id? This SPARQL query uses the ?id variable to bind to the values of inverse-functional properties (?ifp would likely have been a better name for the variable). Each such property uniquely identifies a person, and the calendar demo uses them to smush together seemingly distinct foaf:Person URIs or bnodes that actually refer to the same person. Because of this, we want to learn about as many IDs as we can and therefore we use the SPARQL UNION keyword to disjunctively include all possible bindings for ?id. (Of course, we wrap the UNION in an OPTIONAL because we want the query pattern to match a person even if no IFPs are found for that person.)
  • What's that oddness with the calendar gunk in the query? And why is that in this query? OK, you got me there. This bit of functionality doesn't belong here, and in fact is duplicated in the SPARQL query which mines the current RDF dataset to discover new default and named graphs to add to the dataset. I'll discuss that query next time, and explain what this bit of SPARQL is saying. Until then, happy SPARQLing...

1 The nitty gritty: SPARQL defines A OPTIONAL B OPTIONAL C as (A OPTIONAL B) OPTIONAL C. In the case in question, A is our required triple pattern which binds ?who to the resource or bnode representing a foaf:Person. As per the definition of OPTIONAL then, the parenthesized portion of (A OPTIONAL B) OPTIONAL C will match successfully no matter what (since we're assuming A has already matched a foaf:Person), but will include bindings for B (that is, bindings of ?name to the object of foaf:name) if they exist. In either case, we then examine C. If B matched, then C can only match if it shares the same binding for ?name, so any other value as the object of rdfs:label gets ignored. If B failed to match then ?name remains unbound, and any object of rdfs:label will be bound to ?name. Voila—we have the desired behavior of expressing an ordered preference.

April 5, 2006

SPARQL Calendar Demo: A SPARQL JavaScript Library

This is the fourth in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

A key component of the calendar demo is our SPARQL JavaScript library. Leigh Dodds blogged about his SPARQL AJAX client a few months back. As one of our motivations for the calendar demo was to explore the JSON serialization of SPARQL queries, though, we whipped up our own library for SPARQL queries. This library features:

  • ...issuing SPARQL SELECT or ASK queries using the SPARQL Protocol for RDF extended with a parameter named output. Joseki as deployed on SPARQLer currently supports:
    • No output specified; results are returned in the SPARQL Query Results XML Format with a MIME type of application/sparql-results+xml.
    • output=xml or output=sparql; results are returned in the SPARQL Query Results XML Format with a MIME type of text/plain.
    • output=json; results are returned via the JSON serialization with a MIME type of text/javascript.
    • output=any-other-value; results are returned in RDF/XML with MIME type text/plain as a graph using the DAWG's result-set vocabulary for test cases.
  • ...automatically validating and parsing JSON return values into JavaScript objects.
  • ...providing several query wrapper methods and accompanying result transformations to enable direct access to single-valued query results, vectors of query results, and boolean results (for ASK queries). This mechanism could be easily extended to support parsing the XML result format.
  • ...allowing either HTTP GET or HTTP POST to be used when sending queries.
  • ...providing distinct service and query objects such that dataset graphs, prefixes, and other settings can be set service-wide or on a per-query basis.

The library currently has an unmotivated dependency on the Yahoo! connection manager, but this dependency could (and likely will) be easily removed.

Finally, some example usages of the library's API:

var sparqler = new SPARQL.Service("");

// graphs and prefixes defined here
// are inherited by all future queries
sparqler.setPrefix("foaf", ""); 
sparqler.setPrefix("rdf", "");
// "json" is the default output format

var query = sparqler.createQuery();

// these settings are for this query only

// query wrappers:

// passes standard JSON results object to success callback
query.setPrefix("ldf", "");
  "SELECT ?who ?mbox WHERE { ldf:LDF foaf:knows ?who . ?who foaf:mbox ?mbox }",
  {failure: onFailure, success: function(json) { for (var x in json.head.vars) { ... } ...}}

// passes boolean value to success callback
  "ASK ?person WHERE { ?person foaf:knows [ foaf:name "Dan Connolly" ] }",
  {failure: onFailure, success: function(bool) { if (bool) ... }}

// passes a single vector (array) of values 
// representing a single column of results 
// to success callback
query.setPrefix("ldf", "");
var addresses = query.selectValues(
  "SELECT ?mbox WHERE { _:someone foaf:mbox ?mbox }",
  {failure: onFailure, success: function(values) { for (var i = 0; i < values.length; i++) { ... values[i] ...} } }

// passes a single value representing a single 
// row of a single column (variable) to success callback
query.setPrefix("ldf", "");
var myAddress = query.selectSingleValue(
  "SELECT ?mbox WHERE {ldf:LDF foaf:mbox ?mbox }",
  {failure: onFailure, success: function(value) { alert("value is: " + value); } }
// shortcuts for all of the above 
// (w/o ability to set any query-specific graphs or prefixes)

Feel free to download and use the library as you see fit. I'll post here when there's any substantive updates to it. In the next entry, I'll start delving into the specific SPARQL queries that drive the calendar demo.

March 30, 2006

SPARQL Calendar Demo: Step-by-step Example

This is the third in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

On #swig yesterday, bengee noted that he was unable to get the calendar demo to work. We didn't have a chance to delve into the particulars, but I'm pretty confident that the reason was the lack of a good answer to one of these two questions:

  1. What in the world do I do with this thing?
  2. How can I get this to work with my data?

The answer to the second question is that there are a number of conventions that data needs to observe for it to display properly in the calendar demo. I'll discuss those conventions as I go through the rest of this series, but most of them are also presented briefly on SparqlCalendarDemoUsage ESW wiki page. So without further ado, here is a walkthrough of using the SPARQL calendar demo:

  1. Navigate to the demo
  2. Add one or more URLs that resolve to FOAF files. For this walkthrough, click the add Alice link to add the Alice (demo) persona's FOAF file to the dataset. A list of people appears: in this case, we see Alice, along with URIs for Bob and myself.
  3. (optional) Click discover more people. This follows the FOAF breadcrumbs protocol to build the dataset and find more people, and more information on already known people. In this case, we discover the names of Bob and myself, and also find DanC and Elias.
  4. Make sure that April (2006) is displayed on the calendar. If attempting this in March, click the big right arrow next to the calendar to advance to April.
  5. Check the checkboxes next to Alice and Bob.
  6. Click refresh in the Calendars section of the righthand column to display all of Alice's and Bob's calendar events. You should see several events for each of them.
  7. With Alice and Bob still checked, click on refresh in the Shared interests section of the righthand column to display the foaf:interests that Alice and Bob have in common. In this case, both Alice and Bob are interested in theater and in jazz.
  8. Check the checkbox next to Theater.
  9. Click the what can we do together? link. In this case, this shows that there's a performance of Arcadia going on in Boston, Massachusetts (mouseover events to see their location) from April 26–29. Because both Alice and Bob will be in Boston on April 28 and because both share an interest in theater, this event is shown as an activity that both people might enjoy attending together.

The mocked up data should work at least through the end of April. I'll update this space as I update the demo data to keep the demo personas functional.

March 29, 2006

SPARQL Calendar Demo: Following the Breadcrumbs

This is the second in a series of entries about the SPARQL calendar demo. If you haven't already, you can read the previous entry.

The calendar demo assumes a paradigm whereby the larger the RDF dataset being queried, the more questions can be asked and the more complete the answers are likely to be. New graphs can be added to the dataset in three ways:

  1. The Add button in the top right
  2. The discover more people link
  3. The SPARQL query that populates the People panel

While the first of these three actions grows the dataset directly (the URI is added as a default graph), the second two are examples of simple breadcrumbs protocols. From the tabulator 'about' page:

A breadcrumbs protocol is convention by which information is left to allow another to follow. When the information provider follows the convention on the breadcrumbs to leave, and an information seeker follows the convention on what links to follow, then the protocol that the seeker will be able to solve certain sorts of problem.
In fact, the discover more people link uses exactly the FOAF-link breadcrumbs-protocol example given in the tabulator 'about' page; for all triples p rdfs:seeAlso u, if p has rdf:type foaf:Person then we add u as a default graph in our dataset.

The other breadcrumbs protocol used was developed for this demo and is invoked as part of the SPARQL query that populates the People panel in the righthand column of the demo. In the absence of any widely-accepted manner of associating FOAF data with calendar data, we adopted this protocol as a convention for locating calendars; for all triples p rdfs:seeAlso u, if p has rdf:type foaf:Person and u has rdf:type <>, then we add u as a named graph in our dataset for future queries involving calendars.

In future entries, I'll examine this and the other SPARQL queries in the calendar demo in detail. I'll also look in more detail at this convention, its drawbacks, and possible alternatives. For now, the first lesson that we learned creating this demo was a simple one: finding more data → finding more answers.

March 28, 2006

SPARQL Calendar Demo: Overview

Elias and I gave a lightning presentation of our SPARQL calendar demo at the W3C Tech Plenary SWIG meeting in Mandelieu. Albeit short and glossing over many of the important sticking points that we encountered while creating it, the demo was well received. Since then, we've spent some time polishing up some rough ends (e.g. at the conference the spotty Internet connection required us to rig the demo to run off local copies of FOAF and calendar data accessed via a locally running Joseki service) and have posted the calendar demo for people to play with and discuss. The demo can be accessed live online and an archive of the HTML, JavaScript, and CSS that comprise the demo can be freely downloaded and used.

The two relevant ESW wiki pages contain some details about the demo, but I'd like to spend my next few blog entries presenting some of the lessons that I learned from writing the calendar demo. Almost none of these lessons are new, but as the DAWG prepares to move SPARQL towards CR status (and with the birth of the new public-sparql-dev mailing list), I feel it's worthwhile to present them here. I welcome any comments, opinions, or discussions.

For now, just an overview.

Recently, TimBL and co.'s tabulator has received a great deal of attention as a possible incarnation of a semantic web browser. In my view, the tabulator presents the power of the interconnected semantic web by allowing a user to browse the data arbitrarily and then to tabulate any particular slice through the data. In one breath I could use the tabulator to display my students' contact information (a people-centric view), and in the next I could pivot within the same data and display a course-enrollment tabulation (a course-centric view).

For semantic-web technologies to succeed and thrive—particularly within enterprises—I feel that the capability to build query- and report-based applications is as important as (and complementary to) browser-based applications. Even in the presence of a web of domain-neutral data, we must be able to formulate and evaluate domain-specific, often complex queries. SPARQL, naturally, is the query language of choice for these queries. Of course, we can still benefit (greatly!) from the semantic relations and connectivity of semantic-web data, especially as it enables a new ease of integration (at the query level) of disparate data sources.

The SPARQL calendar demo, then, is an exercise in writing non-trivial SPARQL queries to integrate disparate data sources in answering domain-centric questions. We also took advantage of our work with Kendall on a draft specification of a JSON serialization of SPARQL SELECT and ASK query results, as we crafted the demo as a largely stateless, AJAX-driven web application. Currently, the calendar demo can answer the following requests:

  1. What people exist in a given RDF dataset?
  2. Show me all the calendar events of (some of) these people.
  3. What are the shared interests of (some of) these people?
  4. Show me events fitting (some of) these shared interests that we can all attend together.

More on the nitty-gritty of the calendar demo to follow...