Main

March 26, 2008

Now available online - Scientific American: "The Semantic Web in Action"

I blogged previously about my experience co-authoring an article on the Semantic Web for Scientific American. Since then, Scientific American has granted me permission to publish the text of the article on my Web site. So please feel free to enjoy the article and share it with others: "The Semantic Web In Action"

A few notes:

  • The default view of the article breaks it into multiple pages to make it more easily digestible and bookmarkable. There is a link at the top and bottom to a single-page version suitable for printing and reading offline. Or if you just happen to prefer reading it like that.
  • The article text is followed by the text of the article's sidebars. There are links back and forth between the main text and the relevant sidebars. Most of the sidebars in the article included artwork which I do not have permission to reproduce online at this time.
  • At the end of the article I've gathered links to the various companies, projects, and technologies referenced in the article. (The terms of the reproduction rights from Scientific American prohibit adding links within the main content of the article.)

Please let me know what you think. Also, if you have any trouble reading or printing the article, let me know as well. (I whipped together some JavaScript to do the pagination while maintaining the browser's back button and internal anchors and things like that, so there may be some bugs. I'll write more about the JavaScript some other time.)

March 18, 2008

Gathering SPARQL Extensions

I realized that I hadn't blogged a pointer to the compilation of SPARQL extensions that I've created on the ESW wiki. Quoting myself:

Over the DAWG's lifetime (and since publication of the SPARQL Recommendations in January), there have been many important features that have been discussed but did not get included in the SPARQL specifications. I -- and many others -- hope that many of these topics will be addressed by a future working group, though there are no concrete plans for such a group at this time.

In the interest of cataloging these extensions and encouraging SPARQL developers to seek interoperable implementations of SPARQL extensions, I've created:


   http://esw.w3.org/topic/SPARQL/Extensions


That page links to individual pages for (currently) 13 categories of SPARQL extensions. Each of those pages, in turn, discusses the relevant type of SPARQL extension and attempts to provide links to research, discussion, and implementations of the extension.


I also plan to use this list to help encourage user- and implementor-driven discussion of these extensions over the coming months. Again, the goal is to allow SPARQL users to make known what features are most important to them and also to allow implementations to seek common syntaxes and semantics for SPARQL extensions. (All of this, in the end, should help a future working group charter a new version of SPARQL and produce a specification that allows for interoperable SPARQL v2 implementations.)

It's a wiki. Please add references that are not there, new topics, or discussions of existing topics. (I've tried to reuse existing ESW Wiki pages for some topics that already had discussion.)

Where I say "this list" above, I mean public-sparql-dev@w3.org. Please subscribe if you're interested in discussing any or all of these potential SPARQL extensions.

March 12, 2008

Semantic Web tutorial

Last week, Eric Prud'hommeaux and I presented a tutorial on Semantic Web technologies at the Conference on Semantics in Healthcare & Life Sciences (C-SHALS). It was a four-hour session covering an intro to RDF, SPARQL, GRDDL, RDFa, RDFS, and OWL, mostly in the context of health care (patients' clinical examination records) and life sciences (pyramidal neurons in Alzheimer's Disease, as per the W3C HCLS interest group's knowledgebase use case). We reprised the GRDDL and RDFa sections yesterday in a whirlwind 15-20 minute talk at yesterday's Cambridge Semantic Web gathering.

Enjoy the slides. I'd welcome any suggestions so that the slides can be enhanced and reused (by myself and others) in the future.

March 08, 2008

Modeling Statistics in RDF - A Survey and Discussion

At the Semantic Technologies Conference in San Jose in May, Brand Niemann of the U.S. EPA and I are presenting Getting to Web Semantics for Spreadsheets in the U.S. Government. In particular, Brand and I are working to exploit the semantics implicit in the nearly 1,500 spreadsheets that are in the U.S. Census Bureau's annual Statistical Abstract of the United States. The rest of this post discusses various strategies for modeling this sort of statistical data in RDF; for more information on the background of this work, please see my presentation from the February 5, 2008, SICoP Special Conference.)

The data for the Statistical Abstract is effectively time-based statistics. There are a variety of ways that this information can be modeled as semantic data. The approaches differ in simplicity/complexity, semantic expressivity, and verbosity. At least as interestingly, they vary in precisely what they are modeling: statistical data or a particular domain of discourse. The goal of this effort is to examine the potential approaches to modeling this information in terms of ease of reuse, ease of query, ability to integrate with information from all 1,500 spreadsheets (and other sources), and the ability to enhance the model incrementally with richer semantics. There are surely other approaches to modeling this information as well: I'd love to here any ideas or suggestions for other approaches to consider.

Contents

[hide]

D2R Server for Eurostat

The D2R server guys host an RDF copy of the Eurostat collection of European economic, demographic, political, and geographic data. From the start, they make the simplifying assumption that:

Most statistical data are time series, therefore only the latest availabe value is provided here.

In other words, they do not try to capture historic statistics at all. The disclaimer also notes that what is modeled in RDF is a small subset of the available data tables.

Executing a SELECT DISTINCT ?p { ?s ?p ?o } to learn more about this dataset tells us:

   db:eurostat/population_total
   db:eurostat/electricity_consumption_GWh
   db:eurostat/killed_in_road_accidents
   db:eurostat/RnD_exp_mio_euro
   db:eurostat/parentcountry
   db:eurostat/population_male
   rdfs:label
   db:eurostat/RnD_personel_percent_of_act_pop
   db:eurostat/total_average_population
   db:eurostat/population_female
   db:eurostat/unemployment_rate_total
   db:eurostat/avg_annual_population_growth
   db:eurostat/total_area_km2
   db:eurostat/name_encoded
   db:eurostat/disposable_income
   db:eurostat/injured_in_road_accidents
   db:eurostat/electricity_production_capacity_MWh
   db:eurostat/hospital_beds_per100000hab
   db:eurostat/name
   db:eurostat/landuse_total
   db:eurostat/GDP
   db:eurostat/geocode
   owl:sameAs
   rdf:type
   db:eurostat/level_of_internetaccess_households
   db:eurostat/death_rate
   db:eurostat/fertility_rate_total
   db:eurostat/level_of_internet_access
   db:eurostat/marriages
   db:eurostat/ecommerce_via_internet
   db:eurostat/pupils_and_students
   db:eurostat/inflation_rate
   db:eurostat/employment_rate_total
   db:eurostat/average_exit_age_from_laborforce
   db:eurostat/comparative_price_levels
   db:eurostat/GDP_current_prices
   db:eurostat/GDP_per_capita_PPP
   db:eurostat/monthly_labour_costs

I make a few observations from this:

  • Most of these are predicates that correspond to a statistical category. I'm curious what the types of the subjects are. The query here is (the filter is added to limit the question to resources that use the Eurostat predicates):
     SELECT DISTINCT ?t WHERE {  ?s rdf:type ?t .  ?s ?p ?o .
      FILTER(regex(str(?p), 'eurostat') )
     }
    
    The result is two types: regions and countries. Simple enough.
  • I'm also curious as to the types of the objects. Let's see if there are any resources (URIs) as objects. We do the ?s ?p ?o query from before but add in FILTER(isURI(?o)). The result shows that, aside from rdf:type and owl:sameAs (which we expected), only the predicate db:eurostat/parentcountry points to other resources. Doing a query on this predicate, we see that it relates regions (e.g. db:regions/Lorraine) to countries (e.g. db:countries/France).
  • I'd expect that, especially in the absence of time-based data, they don't have object structures with blank nodes. Changing the previous filter to use isBlank confirms that this is true.
  • So what are the types of the other data? Strings? Numbers? Let's find out. Poking around with various values for XXX in the filter FILTER(isLiteral(?o) && datatype(?o) = XXX) we see that some data uses xsd:strings while other data uses xsd:double. Poking around at the remaining predicates, we discover that they use xsd:long for non-decimal numbers.
  • What are they using owl:sameAs for? Executing SELECT ?s ?o { ?s owl:sameAs ?o } shows what I suspected: they're equating URIs that they've minted under a Eurostat namespace (http://www4.wiwiss.fu-berlin.de/eurostat/resource/) to DBPedia URIs (to broaden the linked data Web). Let's see if they use owl:sameAs for anything else. We add FILTER(!regex(str(?o), 'dbpedia')) and the query now returns no results.

The 2000 U.S. Census

Joshua Tauberer converted the 2000 U.S. Census Data into 1 billion RDF triples. He provides a well-documented perl script that can convert various subsets of the census data into N3. One mode that this script can be run in is to output the schema from SAS table layout files. Joshua's about provides an overview of the data. In particular, I note that he is working with tables that are multiple levels deep (e.g. population by sex and then by age).

The most useful part of the writeup, though, is the writeup specifically about modeling the census data in RDF. In general, Joshua models nested levels of statistical tables (representing multiple facets of the data) as a chain of predicates (with the interim nodes as blank nodes). If a particular criterion is further subdivided, then the aggregate total at that level is linked with rdf:value. Otherwise, the value is given as the object itself. Note that the subjects are not real-world entities ("the U.S.") but instead are data tables ("the U.S. census tables"). The entities themselves are related to the data tables via a details predicate. The below excerpt combines both types of information (the entity itself followed by the data tables above the entity):

 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
 @prefix dc: <http://purl.org/dc/elements/1.1/> .
 @prefix dcterms: <http://purl.org/dc/terms/> .
 @prefix : <tag:govshare.info,2005:rdf/census/details/100pct> .
 @prefix politico: <http://www.rdfabout.com/rdf/schema/politico/> .
 @prefix census: <http://www.rdfabout.com/rdf/schema/census/> .

 <http://www.rdfabout.com/rdf/usgov/geo/us>
   a politico:country ;
   dc:title "United States" ;
   census:households 115904641 ;
   census:waterarea "664706489036 m^2" ;
   census:population 281421906 ;
   census:details <http://www.rdfabout.com/rdf/usgov/geo/us/censustables> ;
   dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/al>, <http://www.rdfabout.com/rdf/usgov/geo/us/az>, ...
 .

 <http://www.rdfabout.com/rdf/usgov/geo/us/censustables>  :totalPopulation 281421906 ;     # P001001
   :totalPopulation [
      dc:title "URBAN AND RURAL (P002001)";
      rdf:value 281421906 ;   # P002001
      :urban [
         rdf:value 222360539 ;  # P002002
         :insideUrbanizedAreas 192323824 ;   # P002003
         :insideUrbanClusters 30036715 ;     # P002004
      ] 
      :rural 59061367 ;   # P002005
   ] 
   :totalPopulation [
     dc:title "RACE (P003001)";
     rdf:value 281421906 ;   # P003001
   :populationOfOneRace [
       rdf:value 274595678 ;    # P003002
       :whiteAlone 211460626 ;     # P003003
       :blackOrAfricanAmericanAlone 34658190 ;     # P003004
       :americanIndianAndAlaskaNativeAlone 2475956 ;   # P003005
   ]
 ...

This is an inconsistent modeling (which Joshua admits himself in the description). Note for instance how :totalPopulation > :urban has a rdf:value link to the aggregate US urban population. When you go one level deeper though, :totalPopulation > :urban > :insideUrbanizedAreas has an object which is itself the value of that statistic.

As I see it, this inconsistency could be avoided in two ways:

  1. Always insist that a statistic hangs off of a resource (URI or blank node) via the rdf:value predicate.
  2. Allow a criterion/classificaiton predicate to point both to a literal (aggregate) value, and also to further subdivisions. This would allow the above example to have a triple which was :totalPopulation > :urban > 222360539 in addition to the further nested :totalPopulation > :urban > :insideUrbanizedAreas > 192323824.

The second approach seems simpler to me (less triples). It can be queried with an isLiteral filter restriction. The first approach might be a slightly simpler query, as it would always just query for rdf:value. (The queries would be about the same size, but the rdf:value approach is a bit clearer to read than the isLiteral filter approach.)

As an aside, this statement from Joshua is a telling factor on the value of what we are doing with the U.S. Statistical Abstract data:

(If you followed Region > households > nonFamilyHouseholds you would get the number of households, not people, that are nonFamilyHouseHolds. To know what a "non-family household" is, you would have to consult the PDFs published by the Census.)

Riese: RDFizing and Interlinking the EuroStat Data Set Effort

Riese is another effort to convert the EuroStat data to RDF. It seeks to expand on the coverage of the D2R effort. Project discussion is available on an ESW wiki page, but the main details of the effort are on the project's about page. Currently, riese only provides five million out of the three billion triples that it seeks to provide.

The under the hood section of the about page links to the riese schema. (Note: this is a simple RDF schema; no OWL in sight.) The schema models statistics as items that link to times, datasets, dimensions, geo information, and a value (using rdf:value).

Every statistical data item is a riese:item. riese:items are qualified with riese:dimensions, one of which is, in particular, dimension:Time.

The "ask" page gives two sample queries over the EuroStat RDF data, but those only deal in the datasets. RDF can be retrieved for the various Riese tables and data items by appending /content.rdf to the items' URIs and doing an HTTP GET. Here's an example of some of the RDF for a particular data item (this is not strictly legal Turtle, but you'll get the point):

@prefix : <http://riese.joanneum.at/data/> .
@prefix riese: <http://riese.joanneum.at/schema/core#> .
@prefix dim: <http://riese.joanneum.at/dimension/> .
@prefix dim-schema: <http://riese.joanneum.at/schema/dimension/> .

:bp010 a riese:dataset ;
  # all dc:title's repeated as rdfs:label
  dc:title "Current account - monthly: Total" ;
  riese:data_start "2002m10" ; # proprietary format?
  riese:data_end   "2007m09" ;
  riese:structure  "geo\time" ; # not sure of this format
  riese:datasetOf :bp010/2007m03_ea .

:bp010/2007m03_ea a riese:Item ;
  dc:title "Table: bp010, dimensions: ea, time: 2007m03" ;
  rdf:value "7093" ; # not typed
  riese:dimension dim:geo/ea ;
  riese:dimension dim:time/2007m03 ;
  riese:dataset :bp010 .

dim:geo/ea a dim-schema:Geo .
  dc:title "Euro area (EA11-2000, EA12-2006, EA13-2007, EA15)" .

dim:time/2007m03 a dim-schema:Time .
  dc:title "" . # oops

dim-schema:Geo rdfs:subClassOf riese:Dimension ; dc:title "Geo" .
dim-schema:Time rdfs:subClassOf riese:Dimension ; dc:title "Time" .

(A lot of this is available in dic.nt (39 MB).)

Summary

In summary, these three examples show three distinct approaches for modeling statistics:

  1. Simple, point-in-time statistics. Predicates that fully describe each statistic relate a (geographic, in this case) entity to the statistic's value. There's no way to represent time in this (or other dimensions) into this model other than to create a new predicate for every combination of dimensions (e.g. country:bolivia stat:1990population18-30male 123456). Queries are flat and rely on knowledge of or metadata (e.g. rdfs:label) about the predicates. No way to generate tables of related values easily. Observation: this approach effectively builds a model of the real-world, ignoring statistical artifacts such as time, tables, and subtables.
  2. Complex, point-in-time statistics. An initial predicate relates a (geographic, in this case) entity to both an aggregate value for the statistic, as well as to (via blank nodes) other predicates that represent dimensions. Aggregate values are available off of any point in the predicate chain. Applications need to be aware of the hierarchical predicate structure of the statistics for queries, but can reuse (and therefore link) some predicates amongst different statistcs. Nested tables can easily be constructed from this model. Observation: this approach effectively builds a model of the statistical domain in question (demographics, geography, economics, etc. as broken into statistical tables).
  3. Complex statistics over time. Each statistic (each number) is represented as an item with a value. Dimensions (including time) are also described as resources with values, titles, etc. In this approach, the entire model is described by a small number of predicates. Applications can flexibly query for different combinations of time and other dimensions, though they still must know the identifying information for the dimensions in which they are interested. Applications can fairily easily construct nested tables from this model. Observation: this approach effectively uses a model of statistics (in general) which in turn is used to express statistics about the domains in question.

Statistical Abstract data

Simple with time

One of the simplest data tables in the Statistical Abstract gives statistics for airline on-time arrivals and departures. A sample of how this table is laid out is:

Airport On-time Arrivals On-time Departures
2006 Q1 2006 Q2 2006 Q1 2006 Q2
Total major airports 77.0 76.7 79.0 78.5
Atlanta, Hartsfield 73.9 75.5 76.0 74.3
Boston, Logan International 75.6 66.8 80.5 74.8

Overall, this is fairly simple. Every airport, for each time period has an on-time arrival percentage and an on-time departure percentage. If we simplified it even further by removing the use of multiple times, then it's just a simple grid spreadsheet (relating airports to arrival % and departure %). This does have the interesting (?) twist that the aggregate data (total major airports) is not simply a sum of the constituent data items (since we're dealing in percentages).

Simple point-in-time approach

If we ignore time (and choose 2006 Q1 as our point in time), then this data models as:

 ex:ATL ex:ontime-arrivals 73.9 ; ex:ontime-departures 76.0 .
 ex:BOS ex:ontime-arrivals 75.6 ; ex:ontime-departures 80.5
 ex:us-major-airports ex:ontime-arrivals 77.0 ; ex:ontime-departures 79.0

This is simple, but ignores time. It also doesn't give any hint that ex:us-major-airports is a total/aggregate of the other data. We could encode time in the predicates themselvs (ex:ontime-arrivals-2006-q1), but I think everyone would agree that that's a bad idea. We could also let each time range be a blank node off the subjects, but that assumes all subjects have data conforming to the same time increments. Any such approach starts to get close to the complex point-in-time approach, so let's look at that.

Complex point-in-time approach

If we ignore time and view the "total major airports" as unrelated to the individual airports, then we have no "nested tables" and this approach degenerates to the simple point-in-time approach, effectively:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] .
 ex:BOS a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5
   ] .
 ex:us-major-airports
   dcterms:hasPart ex:ATL, ex:BOS ;
   stat:details [
     ex:on-time-arrivals 77.0 ;
     ex:on-time-departures 79.0 ;
   ] .    

We could treat time as a special-case that conditionalizes the statistics (stat:details) for any particular subject, such as:

 ex:ATL a ex:Airport ;
   dcterms:isPartOf ex:us-major-airports ;
   stat:details [
     stat:start "2006-01-01"^^xsd:date ;
     stat:end   "2006-02-28"^^xsd:date ;
     stat:details [
       ex:on-time-arrivals 73.9 ;
       ex:on-time-departures 76.0
     ] .
   ] .

If we ignore time but view the "total major airports" statistics as an aggregate of the individual airports (which are subtables, then), we get this RDF structure:

 ex:us-major-airports
   ex:on-time-arrivals 77.0 ;
   ex:on-time-departures 79.0 ;
   ex:ATL [
     ex:on-time-arrivals 73.9 ;
     ex:on-time-departures 76.0
   ] ;
   ex:BOS [
     ex:on-time-arrivals 75.6 ;
     ex:on-time-departures 80.5
   ];

This is interesting because it treats the individual airports as subtables of the dataset. I don't think it's really a great way to model the data, however.

Complex Statistics Over Time

 ex:ontime-flights a stat:Dataset ;
   dc:title "On-time Flight Arrivals and Departures at Major U.S. Airports: 2006" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-12-31"^^xsd:date ;
   stat:structure "... something that explains how to display the stats ? ..." ;
   stat:datasetOf ex:atl-arr-2006q1, ex:atl-dep-2006q1, ... ;
 
 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:arrivals ;
   stat:dimension ex:ATL .
 
 ex:atl-dep-2006q1 a stat:Item ;
   rdf:value 76.0 ;
   stat:dataset ex:ontime-flights ;
   stat:dimension ex:Q12006 ;
   stat:dimension ex:departures ;
   stat:dimension ex:ATL .
 
 ... more data items ...
 
 ex:Q12006 a stat:TimePeriod ;
   dc:title "2006 Q1" ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
 
 ex:arrivals a stat:ScheduledFlightTime ;
   dc:title "Arrival time" .
 
 ex:departures a stat:ScheduledFlightTime ;
   dc:title "Departure time" .
 
 ex:ATL a stat:Airport ;
   dc:title "Atlanta, Hartsfield" .
 
 ... more dimension values ...
 
 stat:TimePeriod rdfs:subClassOf stat:Dimension ; dc:title "time period" .
 stat:ScheduledFlightTime rdfs:subClassOf stat:Dimension ; dc:title "arrival or departure" .
 stat:Airport rdfs:subClassOf stat:Dimension ; dc:title "airport" .

First, this seems to be the most verbose. It also seems to give the greatest flexibility in terms of modeling time and querying the resulting data. One related alternative to this approach would replace dimension objects with dimension predicates, as in:

 ex:atl-arr-2006q1 a stat:Item ;
   rdf:value 73.9 ;
   stat:dataset ex:ontime-flights ;
   stat:date_start "2006-01-01"^^xsd:date ;
   stat:date_end "2006-03-31"^^xsd:date .
   stat:airport ex:ATL ;
   stat:scheduled-flight-time ex:arrivals .
 
 stat:airport rdfs:subPropertyOf stat:dimension ; dc:title "airport " .

This may be a bit less verbose, but loses the ability to have multivalued dimensions such as stat:TimePeriod in the first example.

Conclusion

The riese approach seems the best combination of flexibility and usability. It should allow us to recreate the data-table structures with a reasonable degree of fidelity in another environment (e.g. on the Web), as well as to construct a basic semantic repository by attaching definitions to the various statistical entities, facets, and properties. All that said, the proofs in the pudding, and until I'm quite open to other suggestions.

February 27, 2008

Anzo.*: Building Semantic Applications in Heterogeneous Environments

At Cambridge Semantics we're busy working on what will become version 3 of Open Anzo. As I've written about before, our interest in Semantic Web technologies lies in the powerful applications that can be built by taking advantage of RDF's data model. To this end, we've continually sought RDF programming models that contain features necessary to building these applications:

  • Named graphs (quads) support, for modularizing applications' data
  • Replication, for offline applications and snappy user experience
  • Notification, for real-time collaborative updates
  • Role-based access control, to facilitate a multi-user environment
  • Versioning, to maintain an auditable history of data changes

To promote a consistent development experience between the various environments that we support--Java development, Web development, Windows development--we've worked to define a core set of abstract, client-side APIs (documentation is currently sound but not complete) for building semantic applications that can take advantage of these enterprise features. Currently, we have three concrete instantiations of this API: Anzo.java, Anzo.js, and Anzo.NET. Version 3 of Anzo includes many other architectural improvements intended to help us realize Anzo's status as an open-source semantic middleware platform, and we're not done yet. We do our best to keep the latest version of the code in subversion stable, however, so feel free to check it out. The mailing list is a great place to ask questions. As we get closer to a formal release of Anzo 3, we'll have more code samples, tutorials, and demos to share, so stay tuned...

January 25, 2008

Why SPARQL?

I'm quite pleased to have played a part in helping SPARQL become a W3C Recommendation. As we were putting together the press release that accompanied the publication of the SPARQL recommendations, Ian Jacobs, Ivan Herman, Tim Berners-Lee, and myself put together some comments (in bullet point form) explaining some of the benefits of SPARQL. They do a good job of capturing a lot of what I find appealing about SPARQL, and I wanted to share them with other people. I don't think these are the best examples of SPARQL's value or the most eloquently expressed, but I do think it captures a lot of the essence of SPARQL. (While some of the text is attributable to me, parts are attributable to Ian, Ivan, and Tim.)


  • SPARQL is to the Semantic Web (and, really, the Web in general) what SQL is to relational databases. (This is effectively Tim's quotation from the press release.)
  • If we view the Semantic Web as a global collection of databases, SPARQL can make the collection look like one big database. SPARQL enables us to reap the benefits of federation. Examples:
    • Federating information from multiple Web sites (mashups)
    • Federating information from multiple enterprise databases (e.g. manufacturing and customer orders and shipping systems)
    • Federating information between internal and external systems (e.g. for outsourcing, public Web databases (e.g. NCBI), supply-chain partners)
  • There are many distinct database technologies in use, and it's of course impossible to dictate a single database technology at the scale of the Web. RDF (the Semantic Web data model), though, serves as a standard lingua franca (least common denominator) in which data from disparate database systems can be represented. SPARQL, then, is the query language for that data. As such, SPARQL hides the details of a sever's particular data management and structure details. This reduces costs and increases robustness of software that issues queries.
  • SPARQL saves development time and cost by allowing client applications to work with only the data they're interested in. (This is as opposed to bringing it all down and spending time and money writing software to extract the relevant bits of information.)
    • Example: Find US cities' population, area, and mass transit (bus) fare, in order to determine if there is a relationship between population density and public transportation costs.
    • Without SPARQL, you might tackle this by writing a first query to pull information from cities' pages on Wikipedia, a second query to retrieve mass transit data from another source, and then code to extract the population and area and bus fare data for each city.
    • With SPARQL, this application can be accomplished by writing a single SPARQL query that federates the appropriate data source. The application developer need only write a single query and no additional code.
  • SPARQL builds on other standards including RDF, XML, HTTP, and WSDL. This allows reuse of existing software tooling and promotes good interoperability with other software systems. Examples:
    • SPARQL results are expressed in XML: XSLT can be used to generate friendly query result displays for the Web
    • It's easy to issue SPARQL queries, given the abundance of HTTP library support in Perl, Python, php, Ruby, etc.

Finally, I scribbled down some of my own thoughts on how SPARQL takes the appealing principles of a Service Oriented Architecture (SOA) one step further:

  • With SOA, the idea is to move away from tightly-coupled client-server applications in which all of the client code needs to be written specifically for the server code and vice versa. SOA says that if instead we just agree on service interfaces (contracts) then we can develop and maintain services and clients that adhere to these interfaces separately (and therefore more cheaply, scalably, and robustly).
  • SPARQL takes some of this one step further. For SOA to work, services (people publishing data) still have to define a service, a set of operations that they'll use to let others get at their information. And someone writing a client application against such a service needs to adhere to the operations in the service. If a service has 5 operations that return various bits of related data and a client application wants some data from a few services but doesn't want most of it, the developer still must invoke all 5 services and then write the logic to extract and join the data relevant for her application. This makes for marginally complex software development (and complex == costly, of course).
  • With SPARQL, a service-provider/data-publisher simply provides one service: SPARQL. Since it's a query language accessible over a standard protocol (HTTP), SPARQL can be considered a 'universal service'. Instead of the data publisher choosing a limited number of operations to support a priori and client applications being forced to conform to these operations, the client application can ask precisely the questions it wants to retrieve precisely the information it needs. Instead of 5 service invocations + extra logic to extract and join data, the client developer need only author a single SPARQL query. This makes for a simpler application (and, of course, less costly).

As an example, consider an online book merchant. Suppose I want to create a Web site that finds books by my favorite author that are selling for less than $15, including shipping. The merchant supplies three relevant services:

  1. Search. Includes search by author. Returns book identifiers.
  2. Book lookup. Takes a book identifier and returns the title, price, abstract, shipping weight, etc.
  3. Shipping lookup. Takes total order weight, shipping method, and zip code, and returns a shipping cost.

To create my Web site without SPARQL, I'd need to:

  1. Invoke the search service. (Query 1)
  2. Write code to extract the result identifiers and, for each one, invoke the book lookup service. (Code 1, Query 2 (issued multiple times))
  3. Write code to extract the price and, for each book, invokes the shipping lookup service with that book's weight (Code 2, Query 3 (issued multiple times))
  4. Write code to add each book's price and shipping cost and check if it's less than $15. (Code 3)

Now, suppose the book merchant exposed this same data via a SPARQL endpoint. The new approach is:

  1. Use the SPARQL protocol to ask a SPARQL query with all the relevant parameters (Query 1 (issued once))

For the record, the query might look something like:

PREFIX : <http://example.com/service/sparql/>
SELECT ?book ?title
  FROM :inventory
 WHERE {
  ?book 
    a :book ; :author ?author ; 
    :title ?title ; :price ?price ;
    :weight ?weight .
  ?author :name "My favorite Author" .
  FILTER(?price + :shipping(?weight) < 15) .
}

(This example also illustrates another feature of SPARQL: SPARQL is extensible via the use of new FILTER functions that can allow a query to invoke operations (in this case, a function (:shipping) that gives shipping cost for a particular order weight) defined by the SPARQL endpoint.)

December 20, 2007

Scientific American: "The Semantic Web in Action"

I'm pleased to write that the December 2007 issue of Scientific American contains an article titled "The Semantic Web in Action", coauthored by Ivan Herman, Tonya Hongsermeier, Eric Neumann, Susie Stephens, and myself.

We were invited to write the article as a follow-up to the original 2001 Scientific American Semantic Web article by Tim Berners-Lee, Jim Hendler, and Ora Lassila. We wanted to share some practical examples of problems currently being solved with Semantic Web technologies, particularly in health care and life sciences. The article presents two detailed case studies. The first is the work of a team at Cincinnati Children's Hospital Medical Center who use RDF in conjunction with PageRank-esque algorithms to prioritize potential drug targets for cardiovascular diseases. The second case focuses on the University of Texas Health Science Center's SAPPHIRE system. SAPPHIRE integrates information from various health care providers to allow public health officials to better assess potential emerging public health risks and disease epidemics. The article also talks about the potential for Semantic Web technologies and the work of companies such as Agfa and Partners to help health care providers deal with the rate of knowledge acquisition and change in their clinical decision support (CDS) systems.

Aside from these case studies, the article takes somewhat of a whirlwind tour across the current landscape of Semantic Web applications. Along the way, RDF, OWL, SPARQL, GRDDL, and FOAF all get mentions. Science Commons and DBpedia are briefly touched on, and the article acknowledges a variety of companies that are engaged in Semantic Web application research, prototyping, or deployment: British Telecom, Boeing, Chevron, MITRE, Ordnance Survey, Vodafone, Harper's Magazine, Joost, IBM, Hewlett-Packard, Nokia, Oracle, Adobe, Aduna, Altova, @semantics, Talis, OpenLink, TopQuadrant, Software AG, Eli Lilly, Pfizer, Garlik. And there were loads that couldn't be included in the end due to space restrictions, all of which is a testament to the continued growth in adoption of these technologies.

Unfortunately, the article is not currently available for free online. An electronic version is available (along with the rest of the December 2007 issue) from Scientific American's Web site for US$7.95, and the issue should also be available at newsstands in the US for a bit longer. I'm not sure when/if the article is available on newsstands across the rest of the world. I've been working with the copyright editors at Scientific American in an attempt to procure the rights to publish the article on my own Web site (and/or possibly on the W3C's site), but they haven't yet responded to my application.

In any case, it was a fantastic experience working with my colleagues to bring some information on the progress of the Semantic Web to the readers of Scientific American. I've gotten some great feedback family, friends, and colleagues who have read the article. Several people in the Semantic Web community have let me know that they've found the article to be useful material for helping introduce people to the ideas and applications behind Semantic Web technologies. So please check out the article if you're so inclined, and I'd love to hear what you think. I'll also be sure to update this space if I'm able to secure the rights to publish the full text of the article here.

26-Mar-2008 Update: I've since received permission to publish the article. Enjoy!

October 24, 2007

Announcing: Open Anzo 2.5 released

As promised, the Open Anzo project has released version 2.5 of the Anzo enterprise RDF store. Version 2.5 is a stable release with a collection of bug fixes and new features since the fork from Boca. The release notes enumerate the additions, improvements, and changes, but here are some of the more significant ones:

  • Add Oracle database support
  • Add GROUP BY clause and COUNT(*) to Glitter SPARQL engine (more on this in a separate post, but along the lines of what exists in ARQ, Virtuoso, and RAP)
  • Query performance improvements against both named graphs and metadata graphs
  • Extensive Javadocs for all public classes, interfaces, methods, and member variables

Things you can do:

  • Download and install Open Anzo: release 2.5, nightly snapshots, or the source from SVN
  • Learn from the Open Anzo wiki
  • View open tickets showing some of what's coming
  • Join the Open Anzo development community
  • Peruse the Anzo 2.5 Javadocs

October 14, 2007

Introducing: Cambridge Semantics and the Open Anzo project

It's been a while since I last posted here to muse about the differences between "the Semantic Web" and "Semantic Web technologies". Since then, I've been quite pleased to see the Linking Open Data project continue to soar, including an extremely successful BoF and panel at WWW 2007 in Banff. New data sources continue to be linked in to the Semantic Web, including data from Wikicompany, flickr, and GovTrack. The project maintains a list and a picture of the growing Web of linked open data.

Meanwhile, I have not been idle in my work to advance Semantic Web technologies inside enterprises. In July, I left IBM and co-founded Cambridge Semantics, Inc. Building upon the work that began with the open-source IBM Semantic Layered Research Platform, Cambridge Semantics is dedicated to building feature-rich semantic middleware that can power a vast breadth of semantic applications that realize the potential of the full stack of Semantic Web technologies.

One of the first things that we've done at Cambridge Semantics is setup the Open Anzo project. Anzo is an open-source fork of Boca, an enterprise RDF store. Anzo starts with the same rich feature set of Boca, including named graphs, replication, notification, access controls, and full revision histories. To this, Anzo (so far) adds a number of bug fixes and support for running on top of an Oracle RDBMS. There's a new release of Anzo coming quite soon, and we're quite excited about some of the current and future development going on for Anzo. To learn more, feel free to join the Open Anzo discussion group, check out the wiki, or download the source or a nightly build. We're also actively looking for like-minded folk to work with us to enhance and improve Anzo and to expand the scope of the project. Let me know if you might be interested in sponsoring, using, or contributing to Anzo.

I'll have a lot more to share about our team, our vision, and our software in the coming weeks and months. It's an exciting time, both for me personally, but more so for the promise of the Semantic Web and Semantic Web technologies. I'm glad to be blogging once more, and look forward to having more to say.

April 22, 2007

QotD: Word Choice

Danny picked up an interesting take on the foes of the Semantic Web from Morten Frederiksen. I was surfing that way today and noticed this gem in the latest comment from Keith Alexander:

Perhaps the word that causes the trouble isn’t Semantic, but The?

I believes in an ultimate goal similar to that of Tim Berners-Lee and also that of the Linking Open Data SWEO community project. But I also see tremendous value in the adoption of Semantic Web technologies within enterprise applications and in limited, narrowly-scoped corners of the Internet and intranets. To me, it's clear that these goals are not incompatible with each other. But I do find myself constantly juggling the appropriate use of the phrases the Semantic Web and Semantic Web technologies depending on my audience. There's a lot of signifiance and (dare I say?) semantics in that innocent-looking three-letter word...

February 07, 2007

Updates to sparql.js

I'm not sure if anyone is using Elias and my sparql.js JavaScript library for issuing SPARQL queries. (Probably not, given its Firefox-and-friends-only orientation and the standard cross-site XMLHttpRequest security restrictions.) Since I first blogged about the library last year, we've made a few changes to the library, Most notably, we've removed the dependency on the Yahoo! connection manager (or on any other third-party libraries, for that matter). Additionally, we've added a setRequestHeader method which passes the given headers and values along to the underlying HTTP request object. We use this functionality, for example, to provide user credentials (via HTTP Basic Auth) when SPARQLing against a Boca server.

The update should be transparent to any current uses of the library. Please let me know if you try it out and experience any problems.

January 19, 2007

Announcing: Boca 1.8 - new database support

While I've been writing dense treatises on Semantic Web development, Matt's been hard at work on the latest release of Boca. Matt's announcement of Boca 1.8 carries all the details as well as a look at what Boca 2.0 will bring. Amidst the usual slew of bug fixes, usability improvements, and performance fixes, the major addition to Boca is support for three new databases beyond DB2. Boca now also runs on MySQL, PostgreSQL, and HSQLDB. Cool stuff.

In other Semantic Layered Research Platform news, we're working towards pushing out stable releases(with documentation and installation packaging) of two more of our components: Queso (Atom-driven Web interface to Boca) and DDR (binary data repository with metadata-extractor infrastructure to store metadata within Boca). We're hoping to get these out by the middle of February, so stay tuned.

January 18, 2007

Using RDF on the Web: A Vision

(This is the second part of two posts about using RDF on the Web. The first post was a survey of approaches for creating RDF-data-driven Web applications.) All existing implementations referred to in this post are discussed in more detail and linked to in part one.

Here's what I would like to see, along with some thoughts on what is or is not implemented. It's by no means a complete solution and there are plenty of unanswered questions. I'd also never claim that it's the right solution for all or most applications. But I think it has a certain elegance and power that would make developing certain types of Web applications straightforward, quick, and enjoyable. Whenever I refer to "the application" or "the app", I'm talking about browser-based Web application implemented in JavaScript.

  • To begin with, I imagine servers around the Web storing domain-specific RDF data. This could be actual, materialized RDF data or virtual RDF views of underlying data in other formats. This first piece of the vision is, of course, widely implemented (e.g. Jena, Sesame, Boca, Oracle, Virtuoso, etc.)

  • The application fetches RDF from such a server. This may be done in a variety of ways:

    • An HTTP GET request for a particular RDF/XML or Turtle document
    • An HTTP GET request for a particular named graph within a quad store (a la Boca or Sesame)
    • A SPARQL CONSTRUCT query extracting and transforming the pieces of the domain-specific data that are most relevant to the application
    • A SPARQL DESCRIBE query requesting RDF about a particular resource (URI)

    In my mind, the CONSTRUCT approach is the most appealing method here: it allows the application to massage data which it may be receiving from multiple data sources into a single domain-specific RDF model that can be as close as possible to the application's own view of the world. In other words, reading the RDF via a query effectively allows the application to define its own API.

    Once again, the software for this step already exists via traditional Web servers and SPARQL protocol endpoints.

  • Second, the application must parse the RDF into a client-side model. Precisely how this is done depends on the form taken by the RDF received from the server:

    • The server returns RDF/XML. In this case, the client can use Jim Ley's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns Turtle. In this case, the client can use Masahide Kanzaki's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns RDF/JSON. In this case, the client can use Douglas Crockford's JSON parsing library (effectively a regular expression security check followed by a call to eval(...) While the software is implemented here, the RDF/JSON standard which I've cavalierly tossed about so far does not yet exist. Here, I'm imagining a specification which defines RDF/JSON based on the common JavaScript data structure used by the above two parsers. ( A bit of work probably still needs to be done if this were to become a full RDF/JSON specification, as I do not believe the current format used by the two parsers can distinguish blank node subjects from subjects with URIs.)

    In any case, we now have on the client a simple RDF graph of data specific to the domain of our application. Yet as I've said before, we'd like to make application development easier by moving away from triples at this point into data structures which more closely represent the concepts being manipulated by the application.

  • The next step, then, is to map the RDF model into a application-friendly JavaScript object model. If I understand ActiveRDF correctly (and in all fairness I've only had the chance to play with it a very limited amount), it will examine either the ontological statements or instance data within an RDF model and will generate a Ruby class hierarchy accordingly. The introduction to ActiveRDF explains the dirty-but-well-appreciated trick that is used: "Just use the part of the URI behind the last ”/” or ”#” and Active RDF will figure out what property you mean on its own." Of course, sometimes there will be ambiguities, clashes, or properties written to which did not already exist (with full URIs) in the instance data received; in these cases, manual intervention will be necessary. But I'd suggest that in many, many cases, applying this sort of best-effort heuristics to a domain-specific RDF model (especially one which the application has selected especially via a CONSTRUCT query) will result in extremely natural object hierarchies.

    None of this piece is implemented at all. I'd imagine that it would not be too difficult, following the model set forth by the ActiveRDF folks.

    Late-breaking news: Niklas Lindström, developer of the Python RDF ORM system Oort followed up on my last post and said (among other interesting things):

    I use an approach of "removing dimensions": namespaces, I18N (optionally), RDF-specific distinctions (collections vs. multiple properties) and other forms of graph traversing.

    Sounds like there would be some more simplification processes that could be adapted from Oort in addition to those adapted from ActiveRDF.

  • The main logic of the Web application (and the work of the application developer) goes here. The developer receives a domain model and can render it and attach logic to it in any way he or she sees fit. Often this will be via a traditional model-view-controller approach: this approach is facilitated by toolkits such as dojo or even via a system such as nike templates (nee microtemplates). Thus, the software to enable this meat-and-potatoes part of application development already exists.

    In the course of the user interacting with the application, certain data values change, new data values are added, and/or some data items are deleted. The application controller handles these mutations via the domain-specific object structures, without regards to any RDF model.

  • When it comes time to commit the changes (this could happen as changes occur or once the user saves/commits his or her work), standard JavaScript (i.e. a reusable library, rather than application-specific code) recognizes what has changed and maps (inverts) the objects back to the RDF model (as before, represented as arrays of triples). This inversion is probably performed by the same library that automatically generated the object structure from the RDF model in the first place. As with that piece of this puzzle, this library does not yet exist.

    Reversing the RDF ORM mapping is clearly challenging, especially when new data is added which has not been previously seen by the library. In some cases--perhaps even in most?--the application will need to provide hints to the library to help the inversion. I imagine that the system probably needs to keep an untouched deep copy of the original domain objects to allow it to find new, removed, and dirty data at this point. (An alternative would be requiring adds, deletes, and mutations to be performed via methods, but this constrains the natural use of the domain objects.)

  • Next, we determine the RDF difference between our original model and our updated model. The canonical work on RDF deltas is a design note by Tim Berners-Lee and Dan Connolly. Basically, though, an RDF diff amounts simply to a collection of triples to remove and a collection of triples to add to a graph. No (JavaScript) code yet exists to calculate RDF graph diffs, though the algorithms are widely implemented in other environments including cwm, rdf-utils, and SemVersion. We also work often with RDF diffs in Boca (when the Boca client replicates changes to a Boca server). I'd hope that this implementation experience would translate easily to a JavaScript implementation.

  • Finally, we serialize the RDF diffs and send them back to the data source. This requires two components that are not yet well-defined:

    • A serialization format for the RDF diffs. Tim and Dan's note uses the ability to quote graphs within N3 combined with a handful of predicates (diff:replacement, diff:deletion, and diff:insertion). I can also imagine a simple extension of (whatever ends up being) the RDF/JSON format to specify the triples to remove and add:
        {
          'add' : [ RDF/JSON triple structures go here ],
          'remove' : [ RDF/JSON triple structures go here ]
        }
      
    • An endpoint or protocol which accepts this RDF diff serialization. Once we've expressed the changes to our source data, of course, we need somewhere to send them. Preferably, there would be a standard protocol (à la the SPARQL Protocol) for sending these changes to a server. To my knowledge, endpoints that accept RDF diffs to update RDF data are not currently implemented. (Late-breaking addition: on my first post, Chris and Richard both pointed me to Mark Baker's work on RDF forms. While I'm not very familiar with any existing uses of this work, it looks like it might be an interesting way to describe the capabilities of an RDF update endpoint.)

    As an alternative for this step, the entire client-side RDF model could be serialized (to RDF/XML or to N-Triples or to RDF/JSON) and HTTP PUT back to an origin server. This strategy seems to make the most sense in a document-oriented system; to my knowledge this is also not currently implemented.

That's my vision, as raw and underdeveloped as it may be. There are a large number of extensions, challenges and related work that I have not yet mentioned, but which will need to be addressed when creating or working with this type of Web application. Some discussion of these is also in order.

Handling Multiple Sources of Data

To use the above Web-application-development environment to create Web 2.0-style mash-ups, most of the steps would need to be performed once per data source being integrated. This adds to the system a provenance requirement, whereby the libraries could offer the application a unified view of the domain-specific data while still maintaining links between individual data elements and their source graphs/servers/endpoints to facilitate update. When the RDF diffs are computed, they would need to be sent back to the proper origins. Also, the sample JavaScript structures that I've mentioned as a base for RDF/JSON and the RDF/JSON diff serialization would likely need to be augmented with a URI identifying the source graph of each triple. (That is, we'd end up working with a quad system, though we'd probably be able to ignore that in the object hierarchy that the application deals with.) In many cases, though, an application that reads from many data sources will write only to a single source; it does not seem particularly onerous for the application to specify a default "write-back" endpoint.

Inverting SPARQL CONSTRUCT Queries

An appealing part of the above system (to me, at least) is the use of CONSTRUCT queries to map origin data to a common RDF model before merging it on the client and then mapping it into a domain-specific JavaScript object structure. Such transformations, however, would make it quite difficult--if not impossible--to automatically send the proper updates back to the origin servers. We'd need a way of inverting the CONSTRUCT query which generated the triples the application has (indirectly) worked with, and while I have not given it much thought, I imagine that that is quite difficult, if not impossible.

SPARQL UPDATE.

The DAWG has postponed any work on updating graphs for the initial version of SPARQL, but Max Völkel and Richard Cyganiak have started a bit of discussion on what update in SPARQL might look like (though Richard has apparently soured on the idea a bit since then). At first blush, using SPARQL to update data seems like a natural counterpart to using SPARQL to retrieve the data. However, in the vision I describe above, the application would likely need to craft a corresponding SPARQL UPDATE query for each SPARQL CONSTRUCT query that is used to retrieve the data in the first place. This would be a larger burden on the application developer, so should probably be avoided.

Related Work

I wanted to acknowledge that in several ways this whole pattern is closely related to but (in some mindset, at least) the inverse of a paradigm that Danny Ayers has floated in the past. Danny has suggested using SPARQL CONSTRUCT queries to transition from domain-specific models to domain-independent models (for example, a reporting model). Data from various sources (and disparate domains) can be merged at the domain-independent level and then (perhaps via XSLT) used to generate Web pages summarizing and analyzing the data in question. In my thoughts above, we're also using the CONSTRUCT queries to generate an agreed-upon model, but in this case we're seeking an extremely domain-specific model to make it easier for the Web-application developer to deal with RDF data (and related data from multiple sources).

Danny also wrote some related material to www-archive. It's not the same vision, but parts of it sound familiar.

Other Caveats

Updating data has security implications, of course. I haven't even begun to think about them.

Blank nodes complicate almost everything; this may be sacrilege in some circles, but in most cases I'm willing to pretend that blank nodes don't exist for my data-integration needs. Incorporating blank nodes makes the RDF/JSON structures (slightly) more complicated; it raises the question of smushing together nodes when joining various models; and it significantly complicates the process of specifying which triples to remove when serializing the RDF diffs. I'd guess that it's all doable using functional and inverse-functional properties and/or with told bnodes, but it probably requires more help from the application developer.

I have some worries about concurrency issues for update. Again, I haven't thought about that much and I know that the Queso guys have already tackled some of those problems (as have many, many other people I'm sure), so I'm willing to assert that these issues could be overcome.

In many rich-client applications, data is retrieved incrementally in response to user-initiated actions. I don't think that this presents a problem for the above scheme, but we'd need to ensure that newly arriving data could be seamlessly incorporated not only into the RDF models but also into the object hierarchies that the application works with.

Bill de hÓra raised some questions about the feasibility of roundtripping RDF data with HTML forms a while back. There's some interesting conversation in the comments there which ties into what I've written here. That said, I don't think the problems he illustrates apply here--there's power above and beyond HTML forms in putting an extra JavaScript-based layer of code between the data entry interface (whether it be an HTML form or a more specialized Web UI) and the data update endpoint(s).


OK, that's more than enough for now. These are still ideas clearly in progress, and none of the ideas are particularly new. That said, the environment as I envision doesn't exist, and I suppose I'm claiming that if it did exist it would demonstrate some utility of Semantic Web technologies via ease of development of data- and integration-driven Web applications. As always, I'd enjoy feedback on these thoughts and also any pointers to work I might not know about.

January 16, 2007

Using RDF on the Web: A Survey

(This is part one of two posts exploring building read-write Web applications using RDF. Part two will follow, shortly. Update: Part two is now available, also.)

The Web permeates our world today. Far more than static Web sites, the Web has come to be dominated by Web applications--useful software that runs inside a Web browser and on a server. And the latest trend in Web applications, Web 2.0, encourages--among other things--highly interactive Web sites with rich user interfaces featuring content from various sources around the Web integrated within the browser.

Many of us who have drank deeply from the Semantic Web Kool-Aid are excited about the potential of RDF, SPARQL, and OWL to provide flexible data modeling, easier data integration, and networked data access and query. It's no coincidence that people often refer to the Semantic Web as a web of data. And so it seems to me that RDF and friends should be well-equipped to make the task of generating new and more powerful Web mash-ups simple, elegant, and enjoyable. Yet while there are a great number of projects using Semantic Web technologies to create Web applications, there doesn't seem to have emerged any end-to-end solution for creating browser-based read-write applications using RDF which focus on data integration and ease of development.

Following a discussion on this topic at work the other day, I decided to do a brief survey of what approaches do already exist for creating RDF-based Web applications. I want to give a brief overview of several options, assess how they fit together, and then outline a vision for some missing pieces that I feel might greatly empower Web developers working with Semantic Web technologies.

First, a bit on what I'm looking for. I want to be able to quickly develop data-driven Web applications that read from and write back to RDF data sources. I'd like to exploit standard protocol and interfaces as much as possible, and limit the amount of domain-specific code that needs to be written. I'd like the infrastructure to make it as easy as possible for the application developer to retrieve data, integrate the data, and work with it in a convenient and familiar format. That is, in the end, I'm probably looking for a system that allows the developer to work with a model of simple, domain-specific JavaScript object hierarchies.

In any case, here's the survey. I've tried to include most of the systems I know of which involve RDF data on the Web, even those which are not necessarily appropriate for creating generalized RDF-based Web apps. I'll follow-up with a vision of what could be in my next post.

Semantic Mediawiki

This is an example of a terrific project which is not what I'm looking for here. Semantic Mediawiki provides wiki markup that captures the knowledge contained within a wiki as RDF which can then be exported or queried. While an installation of Semantic Mediawiki will allow me to read and write RDF data via the Web, I am constrained within the wiki framework; further, the interface to reading and writing the RDF is markup-based rather than programmatic.

The Semantic Bank API

The SIMILE project provides an HTTP POST API for publishing and persisting RDF data found on local Web pages to a server-side bank (i.e. storage). They also provide a JavaScript library (BSD license) which wraps this API. While this API supports writing a particular type of RDF data to a store, it does not deal with reading arbitrary RDF from across the Web. The API also seems to require uploaded data to be serialized as RDF/XML before being sent to a Semantic Bank. This does not seem to be what I'm looking for to create RDF-based Web applications.

The Tabulator RDF parser and API

MIT student David Sheets created a JavaScript RDF/XML parser (W3C license). It is fully compliant with the RDF/XML specification, and as such is a great idea for any Web application which needs to gather and parse arbitrary RDF models expressed in RDF/XML. The Tabulator RDF parser populates an RDFStore object. By default, it populates an RDFIndexedFormula store, which inherits from the simpler RDFForumla store. These are rather sophisticated stores which perform (some) bnode and inverse-functional-property smushing and maintain multiple triple indexes keyed on subjects, predicates, and objects.

Clearly, this is an excellent API for developers wishing to work with the full RDF model; naturally, it is the appropriate choice for an application like the Tabulator which at its core is an application that eats, breathes, and dreams RDF data. As such, however, the model is very generic and there is no (obvious, simple) way to translate it into a domain-specific, non-RDF model to drive domain-specific Web applications. Also, the parser and store libaries are read-only: there is no capability to serialize models back to RDF/XML (or any other format) and no capability to store changes back to the source of the data.

(Thanks to Dave Brondsema for an excellent example of using the Tabulator RDF parser which clarified where the existing implementations of the RDFStore interface can be found.)

Jim Ley's JavaScript RDF parser

Jim Ley created perhaps the first JavaScript library for parsing and working with RDF data from JavaScript within a Web browser. Jim's parser (BSD license) handles most RDF/XML serializations and returns a simple JavaScript object which wraps an array of triples and provides methods to find triples by matching subjects, predicates, and objects (any or all of which can be wildcards). Each triple is a simple JavaScript object with the following structure:

{
  subject: ...,
  predicate: ...,
  object: ...,
  type: ...,
  lang: ...,
  datatype: ...
}

The type attribute can be either literal or resource, and blank nodes are represented as resources of the form genid:NNNN. This structure is a simple and straightforward representation of the RDF model. It could be relatively easily mapped into an object graph, and from there into a domain-specific object structure. The simplicity of the triple structure makes it a reasonable choice for a potential RDF/JSON serialization. More on this later.

Jim's parser also provides a simple method to serialize the JavaScript RDF model to N-Triples, though that's the closest it comes to providing support for updating source data with a changed RDF graph.

Masahide Kanzaki's Javascript Turtle parser

In early 2006, Masahide Kanzaki wrote a JavaScript library for parsing RDF models expressed in Turtle. This parser is licenses under the terms of the GPL 2.0 and can parse into two different formats. One of these formats is a simple list of triples, (intentionally) identical to the object structure generated by Jim Ley's RDF/XML parser. The other format is a JSON representation of the Turtle document itself. This format is appealing because a nested Turtle snippet such as:

@prefix : <http://example.org/> .

:lee :address [ :city "Cambridge" ; :state "MA" ] .

translates to this JavaScript object:

{
  "@prefix": "<http://example.org/>",
  "address": {
    "city": "Cambridge",
    "state": "MA"
  }
}

While this format loses the URI of the root resource (http://example.org/lee), it provides a nicely nested object structure which could be manipulated easily with JavaScript such as:

  var lee = turtle.parse_to_json(jsonStr);
  var myState = lee.address.state; // this is easy and domain-specific - yay!

Of course, things get more complicated with non-empty namespace prefixes (the properties become names like ex:name which can't be accessed using the obj.prop syntax and instead need to use the obj["ex:name"] syntax). This method of parsing also does not handle Turtle files with more than a single root resource well. And an application that used this method and wanted to get at full URIs (rather than the namespace prefix artifacts of the Turtle syntax) would have to parse and resolve the namespaces prefixes itself. Still, this begins to give ideas on how we'd most like to work with our RDF data in the end within our Web app.

Masahide Kanzaki also provides a companion library which serializes an array of triples back to Turtle. As with Jim Ley's parser, this may be a first step in writing changes to the RDF back to the data's original store; such an approach requires an endpoint which accepts PUT or POSTed RDF data (in either N-Triples or Turtle syntax).

SPARQL + SPARQL/JSON + sparql.js

The DAWG published a Working Group Note specifying how the results of a SPARQL SELECT or ASK query can be serialized within JSON. Elias and I have also written a JavaScript library (MIT license) to issue SPARQL queries against a remote server and receive the results as JSON. By default, the JavaScript objects produced from the library match exactly the SPARQL results in JSON specification:

{
  "head": { "vars": [ "book" , "title" ]
  } ,
  "results": { "distinct": false , "ordered": false ,
    "bindings": [
      {
        "book": { "type": "uri" , "value": "http://example.org/book/book6" } ,
        "title": { "type": "literal" , "value": "Harry Potter and the Half-Blood Prince" }
      } ,
      ...

The library also provides a number of convenience methods which issue SPARQL queries and return the results in less verbose structures: selectValues returns an array of literal values for queries selecting a single variable; selectSingleValue returns a single literal value for queries selecting a single variable which expect to receive a single row; or selectValueArrays which returns a hash relating each of the query's variables to an array of values for that variable. I've used these convenience methods in the SPARQL calendar and SPARQL antibodies demos and found it quite easy for SPARQL queries returning small amounts of data.

Note, however, that this method does not actually work with RDF on the client side .Because it is designed for SELECT (or ASK) queries, the Web application developer ends up working with lists of values in the application (more generally, a table or result set structure). Richard Cyganiak has suggested serializing entire RDF graphs using this method by using the query SELECT ?s ?p ?o WHERE { ?s ?p ?o } and treating the three-column result set as an RDF/JSON serialization. This is a clever idea, but results in a somewhat unwieldy JavaScript object representing a list of triples: if a list of triples is my goal, I'd rather use the Jim Ley simple object format. But in general, I'd rather have my RDF in a form where I can easily traverse the graph's relationships without worrying about subjects, predicates, and objects.

Additionally, the SPARQL SELECT query approach is a read-only approach. There is no current way to modify values returned from a SPARQL query and send the modified values (along with the query) back to an endpoint to change the underlying RDF graph(s).

JSONC, JSONI, and JSONP

Benjamin Nowack implemented the SPARQL JSON results format in ARC (W3C license), and then went a bit further. He proposes three additions/modifications to the standard SPARQL JSON results which result in saved bandwidth, more directly usable structures, and the ability to instruct a SPARQL endpoint to return JavaScript above and beyond the results object itself.

  • JSONC: Benjamin suggests an additional jsonc parameter to a SPARQL endpoint; the value of this parameter instructs the server to flatten certain variables in the result set. The result structure contains only the string value of the flattened variables, rather than a full structure containing type, language, and datatype information.
  • JSONI: JSONI is another parameter to the SPARQL endpoint which instructs the server to return certain selected variables nested within others. Effectively, this allows certain variables within the result set to be indexed based on the values of other variables. This results in more naturally nested structures which can be more closely aligned with domain-specific models and hence more directly useful by JavaScript application developers.
  • JSONP: JSONP is one solution to the problem of cross-domain XMLHttpRequest security restrictions. The jsonp parameter to a SPARQL server would specify a function name which the resulting JSON object will be wrapped in in the returned value. This allows the SPARQL endpoint to be used via a <script src="..."></script> invocation which avoids the cross-domain limitation.

The first two methods here are similar to what the sparql.js feature provides on the client side for transforming the SPARQL JSON results format. By implementing them on the server, JSONC and JSONI can save significant bandwidth when returning large result sets. However, in most cases bandwidth concerns can be alleviated by sending gzip'ed content, and performing the transforms on the client allow for a much wider range of possible transformations (and no burden on SPARQL endpoints to support various transformations for interoperability). As far as I know, ARC is currently the only SPARQL endpoint that implements JSONC and JSONI.

JSONP is a reasonable solution in some cases to solving the cross-domain XMLHttpRequest problem. I believe that other SPARQL endpoints (Joseki, for instance) implement a similar option via an HTTP parameter named callback. Unfortunately, this method often breaks down with moderate-length SPARQL queries: these queries can generate HTTP query strings which are longer than either the browser (which parses the script element) or the server is willing to handle.

Queso

Queso is the Web application framework component of the IBM Semantic Layered Research Platform. It uses the Atom Publishing Protocol to allow a browser-based Web application to read and write RDF data from a server. RDF data is generated about all Atom entries and collections that are PUT or POSTed to the server using the Atom OWL ontology. In addition, the content of Atom entries can contain RDF as either RDF/XML or as XHTML marked up with RDFa; the Queso server extracts the RDF from this content and makes it available to SPARQL querying and to other (non-Web) applications.

By using the Atom Publishing Protocol, an application working against a Queso server can both read and write RDF data from that Queso server. While Queso does contain JavaScript libraries to parse the Atom XML format into usable JavaScript objects, libraries do not yet exist to extract RDF data from the content of the Atom entries. Nor do libraries exist yet that can take RDF represented in JavaScript (perhaps in the JIm Ley fashion) and serialize it to RDF/XML inthe content of an Atom entry. Current work with Queso has focused on rendering RDFa snippets via standard HTML DOM manipulation, but have not yet worked with the actual RDF data itself. In this way, Queso is an interesting application paradigm for working with RDF data on the Web, but it does not yet provide a way to work easily with domain-specific data within a browser-based development environment.

(Before Ben, Elias, and Wing come after me with flaming torches, I should add that Queso is still very much evolving: we hope that the lessons we learn from this survey and discussion about a vision of RDF-based Web apps (in my next post) will help guide us as Queso continues to mature.)

RPC / RESTful API / the traditional approach

I debated whether to put this on here and decided it was incomplete without it. This is the paradigm that is probably most widely used and is extremely familiar. A server component interacts with one or more RDF stores and returns domain-specific structures (usually serialized as XML or JSON) to the JavaScript client in response to domain-specific API calls. This is the approach taken by an ActiveRDF application, for instance. There are plenty of examples of this style of Web application paradigm: one which we've been discussing recently is the Boca Admin client, a Web app. that Rouben is working on to help administer Boca servers.

This is a straightforward, well-understood approach to creating well-defined, scalable, and service-oriented Web applications. Yet it falls short in my evaluation in this survey because it requires a server and client to agree on a domain-specific model. This means that my client-sde code cannot integrate data from multiple endpoints across the Web unless those endpoints also agree on the domain model (or unless I write client code to parse and interpret the models returned by every endpoint I'm interested in). Of course, this method also requires the maintenance of both server-side and client-side application code, two sets of code with often radically different development needs.

This is still often a preferred approach to creating Web applications. But it's not really what I'm thinking of when I contemplate the power of driving Web apps with RDF data, and so I'm not going to discuss it further here.


That's what I've got in my survey right now. I welcome any suggestions for things that I'm missing. In my next post, I'm going to outline a vision of what I see a developer-friendly RDF-based Web application environment looking like. I'll also discuss what pieces are already implemented (mainly using systems discussed in this survey) and which are not yet implemented. There'll also be many open questions raised, I'm sure. (Update: Part two is now available, also.)


(I didn't examine which of these approaches provide support for simple inferencing of the owl:sameAs and rdfs:subPropertyOf flavor, though that would be useful to know.)