The Web
2.0
and Semantic Web are two common ideas that formulate the future of the Web.
It is not yet clear which one will survive, but it is most-likely we will get a
platform containing best ideas from both. Many experts claim that Web 2.0 is
just a “marketing” name for Semantic Web, although some differences may still
exist. The main principle outlined by both paradigms is an ability to extract
and query information across the informational space which includes Web sites,
documents, databases, Web services, libraries or repositories. Semantic Web has
introduced a new computing paradigm based on the notion of non-ambiguous
metadata descriptions that can describe not only things you can find on the Web
but also things that reside in enterprise data stores and even physical
objects. These metadata descriptions have been standardized by the World Wide Web Consortium as
Resource Description Framework (RDF) as early as in 1999.

The SPARQL Protocol and RDF Query
Language (SPARQL) [sparkle] is a query language designed to meet the
requirements and design objectives described in the “RDF Data Access Use
Cases
“. It provides facilities to:

  • extract
    information in the form of URIs, blank nodes,
    plain and typed literals
  • extract
    RDF subgraphs, and
  • construct
    new RDF graphs based on information in the queried graphs.

As a data access language, it is
suitable for both local and remote use. It’s a piece of cake when we try to use
SPARQL locally, but for remote use the SPARQL
Protocol
for RDF has been designed to be more stringent. This protocol is
an interface for conveying SPARQL queries from clients to query processors, and
several bindings like HTTP and SOAP have been introduced to achieve
connectivity.

In this document I will explain how
SPARQL might be used for querying information and how a SPARQL Protocol works
for remote queries. A reader is expected to be familiar with RDF concepts.

Evolution of objectives

Although there are several standards covering RDF with regard to storing and
defining data, there had not been any work done to create standards for
querying or accessing RDF data. Likewise, there was no formal, publicly
standardized data access protocol for interacting with remote or local RDF storage
servers. There were no standards for querying RDF data when RDF storage model
appeared, so many developers in commercial and in open source projects created
query languages for accessing RDF data, over 20 at last count. A full list of
different query language implementations can bee seen at http://www.w3.org/2001/11/13-RDF-Query-Rules/.
But these languages lack both a common syntax and a common semantics. In fact,
the existing query languages cover a significant semantic range: from
declarative, SQL-like languages, to path languages, to rule or production-like
systems. And SPARQL had to fill this gap.

SPARQL provides Web 2.0 users with a query language in
much the same fashion as SQL provides relational database users with a query
language.

The following requirements were taken into
consideration when SPARQL was designed:

  • Graph pattern matching ability – the query language must include the capability to restrict
    matches on a queried graph by providing a graph pattern, which consists of
    one or more RDF triple patterns, to be satisfied in a query;
  • Variable binding results – It must be possible for queries to return zero or more
    bindings of variables. Each set of bindings is one way that the query can
    be satisfied by the queried graph;
  • Subgraph
    results
    – It must be possible
    for query results to be returned as a subgraph
    of the original queried graph;
  • Supportable local queries – The query language must be suitable for use in
    accessing local RDF data – that is, from the same machine or same system
    process;
  • Result limits – It must be possible to specify an upper bound on the
    number of query results returned;
  • Streaming results – It must be possible, when returning multiple unordered
    results, for the client to request that results be streamed. When the
    client requests streaming results, all the data in one result must be
    available to the client before all the data for the next result.
  • WSDL support – The protocol – including its interfaces, their
    operations, results, and types – must be described using WSDL. This is
    essential for remote queries.

Currently SPARQL requirements have stabilized and
SPARQL query language is now a Candidate Recommendation which means that it
will be a standard (W3C Recommendation) at the next stage.

How to write SPARQL queries

An RDF graph is a
set of triples; each triple consists of a subject, a predicate and an object. These triples can come from a variety of
sources. The SPARQL query language is based on matching graph patterns. The
simplest graph pattern is the triple pattern, which is like an RDF triple, but
with the possibility of a variable instead of an RDF term (a simple atom in RDF
structure without blank nodes) in the subject, predicate or object positions.
Combining triple patterns gives a basic graph pattern, where an exact match to
a graph is needed to fulfill a pattern.

The example below
shows a SPARQL query to find the author of a book from the information in the
given RDF graph. Let’s take the following RDF information (example1.rdf):

<http://example.org/book/book1>
<http://purl.org/dc/elements/1.1/author> “Peter Mikhalenko” .

The query consists of two parts,
the SELECT clause and the WHERE clause. The SELECT clause identifies the variables
to appear in the query results, and the WHERE clause has one triple pattern (example1.sparql.txt):

Listing A

SELECT ?author
WHERE
{
  <http://example.org/book/book1>
<http://purl.org/dc/elements/1.1/author> ?author .
}

This is what we will
get from this simplest query:

author
——————
“Peter Mikhalenko”

The terms delimited
by “<>” are IRI references (Internationalized Resource Identifiers,
described by RFC3987).
IRIsare a generalization of
URIs and are fully compatible with URIs and URLs. SPARQL provides two abbreviation mechanisms
for IRIs, prefixed names and relative IRIs.

  • Prefixed names:The PREFIX keyword associates a prefix label with
    an IRI. A prefixed name is a prefix label and a local part, separated by a
    colon “:“. It is mapped to an IRI by concatenating the local part to the
    IRI corresponding to the prefix.
  • Relative IRIs:The BASE keyword defines the Base IRI used to
    resolve relative IRIs.

The general syntax
for literals is a string (enclosed in quotes, either double quotes "" or single quotes '' ), with either an optional language tag
(introduced by @)
or an optional datatype IRI or prefixed name (introduced by ^^).

Variables in SPARQL
queries have global scope; it is the same variable everywhere in the query that
the same name is used. Variables are indicated by “?”; the
“?” does not form part of the variable. “$” is an
alternative to “?”. In a query, $varand ?var are the same variable.

Gathering all above
said, let’s have a look at three examples (example2.sparql.txt,
example3.sparql.txt, example4.sparql.txt) which express the same query.

The same piece of RDF data (example1.rdf)
can be represented in a so-called Turtle format,
which allows URIs to be
abbreviated with prefixes (example1.rdf.turtle.txt):

Listing B

@prefix dc:   <http://purl.org/dc/elements/1.1/> .
@prefix :     <http://example.org/book/> .
:book1  dc:author  “Peter Mikhalenko” .

The term “binding” is used as a
descriptive term to refer to a pair of [variable; RDF term]. However not every
binding needs to exist in every row of the table. This is how optional parts of
the graph pattern may be specified syntactically with the OPTIONAL keyword
applied to a graph pattern:

Let’s take a piece
of data (example2.rdf.turtle.txt):

Listing C

@prefix foaf:       <http://xmlns.com/foaf/0.1/> .
@prefix rdf:        <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

_:a  rdf:type        foaf:Person .
_:a  foaf:name       “Peter” .
_:a  foaf:mbox       <mailto:test@peter.com> .
_:a  foaf:mbox       <mailto:peter@gmail.com> .

_:b  rdf:type        foaf:Person .
_:b  foaf:name       “Mary” .

The query with OPTIONAL pattern
will look like this (example5.sparql.txt):

Listing D

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?mbox
WHERE  { ?x foaf:name  ?name .
         OPTIONAL { ?x  foaf:mbox  ?mbox }
       }

And the result will
be the following:

name     mbox
——-  ———————–
“Peter”  <mailto:test@peter.com>
“Peter”  <mailto:peter@gmail.com>
“Mary”  

There is no value of
mbox in the solution where the name is "Mary". It is unbound. This query finds the names
of people in the data. If there is a triple with predicate mbox and same subject, a solution will contain
the object of that triple as well. In the example, only a single triple pattern
is given in the optional match part of the query but, in general, it is any
graph pattern. The whole graph pattern of an optional graph pattern must match
for the optional graph pattern to add to the query solution.

Results can also be returned in
XML using the SPARQL Variable Binding Results XML Format,
we will examine it later when SPARQL Protocol will be considered.

The results of a
query is the set of all pattern solutions that match the query pattern, giving
all the ways a query can match the graph being queried. Each result is one
solution to the query and there may be zero, one or multiple results to a
query. Say, for example, we have the following data (example3.rdf.turtle.txt):

Listing E

@prefix foaf:  <http://xmlns.com/foaf/0.1/> .

_:a  foaf:name   “John Hijacker” .
_:a  foaf:mbox   <mailto:jh@example.com> .
_:b  foaf:name   “DmitryPovarenko” .
_:b  foaf:mbox   <mailto:dmitry@example.org> .

Then the query (example6.sparql.txt):

Listing F

PREFIX foaf:   <http://xmlns.com/foaf/0.1/>
SELECT ?name ?mbox
WHERE
  { ?x foaf:name ?name .
    ?x foaf:mbox ?mbox }

Will give the
following result:

name                mbox
—-                —-
“John Hijacker”     <mailto:jh@example.com>
“DmitryPovarenko”  <mailto:dmitry@example.org>

The results
enumerate the RDF terms to which the selected variables can be bound in the query pattern.
There are also a number of syntactic forms that abbreviate some common
sequences of triples, for details it’s better to turn to original SPARQL
specification.

An RDF Literal is
written in SPARQL as a string containing the lexical form of the literal,
followed by an optional language tag or an optional datatype. There are
convenience forms for numeric-types literals which are of type xsd:integer,
xsd:decimal,
xsd:double and also for xsd:boolean. The data below contains a number of RDF
literals (example4.rdf.turtle.txt). The pattern in the following query has a solution :x
because 42 is syntax for "42"^^<http://www.w3.org/2001/XMLSchema#integer>:

SELECT ?v WHERE { ?v ?p 42 }

The following query
has a solution with variable vbeing :y:

SELECT ?v WHERE
{ ?v ?p “abc”^^<http://example.org/datatype#specialDatatype> }

Graph pattern
matching creates bindings of variables. It is possible to further restrict
solutions by constraining the allowable bindings of variables to RDF Terms.
Value constraints take the form of boolean-valued
expressions; the language also allows application-specific constraints on the
values in a solution. Let’s take the following data (example5.rdf.turtle.txt) and a query (example7.sparql.txt). The result of the query will be the
following dataset:

title               price
——————  —–
“The Semantic Web”  23

By having a
constraint on the price variable, only book2 matches the query because there is a restriction on the
allowable values of price. Constraints can be given in an optional
graph pattern as this example shows (the same data, example5.rdf.turtle.txt) and a query (example8.sparql.txt). The result will be the following:

title               price
——————  —–
“SPARQL Tutorial”
“The Semantic Web”  23

No price appears for
the book with title “SPARQL Tutorial” because the optional graph
pattern did not lead to a solution involving the variable price.

SPARQL provides a
means of combining graph patterns so that one of several alternative graph
patterns may match. If more than one of the alternatives matches, all the
possible pattern solutions are found. For pattern alternatives in a query you
can use UNION keyword. For example, for this data (example6.rdf.turtle.txt) the query (example9.sparql.txt) will give the following result:

title              
——————————– 
“SPARQL Protocol Tutorial”
“SPARQL”
“SPARQL (updated)”
“SPARQL Query Language Tutorial”

This query finds
titles of the books in the data, whether the title is recorded using Dublin Core (a standardized
set of document properties) properties from version 1.0 or version 1.1.

Query patterns
generate an unordered collection of solutions. These solutions are then treated
as a sequence, initially in no specific order; any sequence modifiers are then
applied to create another sequence. The solution sequence can be modified by
adding the DISTINCT keyword which ensures that every combination
of variable bindings (i.e. each solution) in the sequence is unique For
example, with data example7.rdf.turtle.txt and query example10.sparql.txtresult will be the
following:

name
——-
“Alice”

The ORDER BY clause takes a solution sequence and applies ordering
conditions. An ordering condition can be a variable or a function call. The
direction of ordering is ascending by default. It can be explicitly set to
ascending or descending by enclosing the condition in ASC() or DESC() respectively. If multiple conditions are
given, then they are applied in turn until one gives the indication of the ordering.

The LIMIT form puts an upper bound on the
number of solutions returned. If the number of actual solutions is greater than
the limit, then at most the limit number of solutions will be returned. OFFSET causes the solutions generated to start after the
specified number of solutions.

The SELECT form of
results returns the variables directly. The syntax SELECT * is an abbreviation that selects all of the variables.
For example, for data example8.rdf.turtle.txt and query example11.sparql.txt the result will be the following:

nameX    nameY    nickY
——-  ——-  —–
“Alice”  “Bob”
“Alice”  “Clare”  “CT”

Results can be
thought of as a table or result set, with one row per query solution. Some
cells may be empty because a variable is not bound in that particular solution.
Result sets can be accessed by the local API but also can be serialized into
either XML or an RDF graph. In XML format we will have the same dataset looking
like this (example11.result.sparql.xml):

Listing G

<?xml version=”1.0″?>
<sparqlxmlns=”http://www.w3.org/2005/sparql-results#”>
  <head>
    <variable name=”
nameX”/>
    <variable name=”nameY”/>
    <variable name=”
nickY”/>
  </head>
  <results>
    <result>
      <binding name=”nameX”>
        <literal>Alice</literal>
      </binding>
      <binding name=”nameY”>
        <literal>Bob</literal>
      </binding>
   </result>
    <result>
      <binding name=”nameX”>
        <literal>Alice</literal>
      </binding>
      <binding name=”nameY”>
        <literal>Clare</literal>
      </binding>
      <binding name=”nickY”>
        <literal>CT</literal>
      </binding>
    </result>
  </results>
</sparql>

SPARQL Protocol

SPARQL Protocol is
designed in two ways: first, as an abstract interface independent of any
concrete realization, implementation, or binding to another protocol; second,
as HTTP and SOAP bindings of this interface.

The SPARQL Protocol
is described abstractly with WSDL 2.0 in terms of a Web service that implements
its interface, types, faults, and operations, as well as by HTTP and SOAP
bindings. Current SPARQL Protocol description is hosted by the following
address and can be used by any Web service processors or other applications: http://www.w3.org/TR/rdf-sparql-protocol/sparql-protocol-query.wsdl.

Let’s take a simple
query (example12.sparql.txt) and have a look how it will work through the HTTP
connection. This is an HTTP GET query that SPARQL frontend
will ask from the SPARQL Web service located, say, at
http://sparql.service.com/sparql:

Listing H

GET /sparql/?query=PREFIX+dc:+&
lt;http://purl.org/dc/elements/1.1/&gt;%13SELECT+?book+?who%13WHERE+
{+?book+dc:creator+?who+}
Host: sparql.service.com
User-agent: sparql-client/0.1

In the GET request
there is an URL-encoded SPARQL query (spaces are replaced by ‘+’ symbol, newline symbols are replaced by %13, which is a hexadecimal
value of newline char number). An HTTP server will
return the following for a handled query:

Listing I
HTTP/1.1 200 OK
Date: Fri, 06 May 2005 20:55:12 GMT
Server: Apache/1.3.29 (Unix) PHP/4.3.4 DAV/1.0.3
Connection: close
Content-Type: application/sparql-results+xml

<?xml version=”1.0″?>
<sparqlxmlns=”http://www.w3.org/2005/sparql-results#”>

 <head>
   <variable name=”book”/>
   <variable name=”who”/>
 </head>
 <results distinct=”false” ordered=”false”>
   <result>
     <binding name=”book”><uri>http://www.example/book/book5</uri></binding>
     <binding name=”who”><bnode>r29392923r2922</bnode></binding>
   </result>

   <result>
     <binding name=”book”>
<uri>http://www.example/book/book6</uri></binding>
     <binding name=”who”><bnode>r8484882r49593</bnode></binding>
   </result>
 </results>
</sparql>

A query can be also sent over
SOAP. The file example13.sparql.soap.txt
contains an example of a SOAP query sent over HTTP POST query, and example13.sparql.result.soap.txt
contains the corresponding SOAP response.

An evolving protocol

This article is just an
introduction into SPARQL query language and its binding protocols, because it’s
already evolved into a rich all-sufficient query language suitable for Web 2.0
and Semantic Web platforms, and it is impossible to cover all aspects of the
language and protocol here. For further details please have a look at SPARQL specifications.
There are a number of issues that SPARQL does not address yet; most notably,
SPARQL is read-only and cannot modify an RDF dataset. SPARQL actually consists
of three separate specifications: the query language
specification
, SPARQL data access Protocol, and XML format
of query results.