TechnicaLee Speaking: June 2009 Archives

We’ll be releasing the first versions of our Anzo products in July. Between now and then I’m going to try to do some blogging showing various parts of the products. But before I begin that, I’ve been thinking a bunch recently about how to characterize our use of Semantic Web technologies, and I wanted to write a bit on that.

Our software views the world of enterprise data in a pretty straightforward way:

Bring together as much data as possible.
Do stuff with the data.
Allow anyone to consume the data however (& whenever) they want.

This is a very simple take on what we do, but it gets to the heart of why we care about semantics: We love semantics because semantics is the “secret sauce” that makes possible each of these three aspects of what we do.

Here’s how:

Bring together as much data as possible

First of all, in most cases we don’t actually physically copy data around. That sort of warehouse approach is appropriate in some cases, but in general we prefer to leave data where it is and bring it together virtually. Our semantic middleware product, the Anzo Data Collaboration Server, provides unified read, write, and query interfaces to whatever data sources we’re able to connect to. We often refer to the unified view of heterogeneous enterprise data as a semantic fabric, but really it’s linked data for the enterprise.

Semantic Web technologies make this approach feasible. RDF is a data standard that is both expressive enough to represent any type of data that’s connected to the server and also flexible enough to handle new data sources incrementally. URIs provide a foundation for minting identifiers that don’t clash unexpectedly as new data sources are brought into the fold. Named graphs give us a simple abstraction upon which we can engineer practical concerns like security, audit trails, offline access, real-time updates, and caching. And, of course, GRDDL gives us a standard way to weave XML source data into the fabric.

Without Semantic Web technologies we’d need to worry about defining a master relational schema up front, or we’d have to constantly figuring out how to structurally relate or merge XML documents. And when we’re talking about data that originates not only in one or two big relational databases but also in hundreds or thousands or hundreds of thousands of Excel spreadsheets, the old ways just don’t cut it at all. Semantic Web technologies, on the other hand, provide the agile data foundation we need to bring data together.

But bringing together as much data as possible is not an end in itself. What’s the point of doing this?

Do stuff with the data

This one’s intentionally vague, because there are lots of things that lots of different people want—and need—to do with data, and Anzo is a platform that accommodates many of those things. In general, though, Semantic Web standards again lay the groundwork for the types of things that we want to do with data:

Data access. SPARQL gives us a way to query information from multiple data sources at once.
Describing data. RDF Schema and OWL are extremely expressive ways to describe (the structure of) data, particularly compared to alternatives like relational DDL or XML Schema. We can (and do) use data descriptions to do things like build user interfaces, generate pick lists (controlled vocabularies), validate data entry, and more.
Transform data. There are all kinds of ways in which we need to derive new data from existing data. We might do this via inference (enabled by RDFS and OWL) or via rules (enabled by SPARQL CONSTRUCT queries, by RIF, or by SWRL) or simply via something like SPARQL/Update.

Without Semantic Web technologies, we’d probably end up using a proprietary approach for querying across data sources. We’d have to hardcode all of our user interface or else invent or adopt a non-standard way of describing our data beyond what a relational schema gives us. And then we might choose a hodgepodge of rules engines, SQL triggers, and application-specific APIs to handle transforming our data. And this might all work just fine, but we’d have to put in all the time, effort, and money to make all the pieces work together.

To me, that’s the beauty of the much-maligned Semantic Web layer cake. The fact that semantic technologies represent a coherent set of standards (i.e. a set of disparate technologies that have been designed to play nice together) means that I can benefit from all of the “glue” work that’s already been done by the standards community. I don’t need to invent ways to handle different identifier schemes across technologies or how to transform from one data model to another and back again: the standards stack has already done that.

Allow anyone to consume the data however (& whenever) they want

Once we’ve put in place the ability to bring data together and do stuff to that data, the remaining task is to get that information in front of anyone who needs it when they need it. We’ve put in a lot of effort to make bringing data into the fabric and acting on that data easy, and it would be a shame if every time someone needs to consume some information they need to put in a request and wait 6 months for IT to build the right queries, views, and forms for them.

To this end, Anzo on the Web takes the increasingly popular faceted-browsing paradigm and puts it in the hands of non-technical users. Anyone can visually choose the data that they need to see in a grid, a scatter plot, a pie chart, a timeline, etc. and the right view is created immediately. Anyone can choose what properties of the data should be available as facets to filter through the data set via whatever attributes he or she wants.

Once again it’s the flexibility of the Semantic Web technology stacks that makes this possible for us. RDF makes it trivial for us to create, store, and discover customized lenses with arbitrary properties. RDF also lets us introspect on the data to present visual choices to users when configuring views and adding filters. SPARQL is a great vehicle for building the queries that back faceted browsing.

In summary

It bears repeating that as with most technology standards, the things that we accomplish with Semantic Web standards could be done with other technology choices. But using a coherent set of standards backed by a thriving community of both research and practice means that:

We don’t have to invent all the glue that ties different technologies together
Any new standards that evolve within this stack immediately give our software new capabilities (see #1)
There’s a wide range of 3rd party software that will easily interoperate with Anzo (other RDF stores, OWL reasoners, etc.)
We can focus on enabling solutions, rather than on the core technology bits. All of the above frees us up to do things like build an easy to use faceted browsing tool, build Anzo for Excel to collect and share spreadsheet data, build security and versioning and real-time updates, and much more.

Again, the semantics is really the secret sauce that makes much of what we do possible, but there’s a lot more innovation and engineering that turns that secret sauce into practical solutions. I’ll have some takes on what this looks like in practice in the coming weeks, and we’d love to show you in person if you’ll be in the Boston, MA area or if you’ll be at SemTech in San Jose, CA.

Posted by Lee Feigenbaum at 3:54 PM | Permalink

TechnicaLee Speaking

Software designs, implementations, solutions, and musings by Lee Feigenbaum

June 21, 2009

SPARQLing at SemTech

June 4, 2009

Why we love Semantic Web technologies

Bring together as much data as possible

Do stuff with the data

Allow anyone to consume the data however (& whenever) they want

In summary