When I read Bob DuCharme's blog on using custom javascript functions in SPARQL queries I knew I needed to try it out. What I really wanted was to be able to call functions from some npm libraries. Turns out I wasn't able to figure out how to do that. It might not be a simple thing to do because Apache Jena doesn't bundle a node runtime... also I didn't even look at how Apache Jena (ARQ specifically) is evaluating this custom javascript. Still, the ability to call vanilla javascript in a SPARQL query is nice.
But I still wanted to call someone else's library functions so I moved onto another option: SPARQL Value Functions. At least that is what Apache Jena calls this custom fuction pluggability.
Here is what I wanted to do...
Take some non-RDF data with string date representations that were messy and canonicalize them (to xsd:dateTime since SPARQL respects that format).
Here is the csv example we'll work with:
classification_one | classification_two | classification_three | company_system_of_record | company_id | company_name | company_inception |
---|---|---|---|---|---|---|
Energy, chemicals and utilities | CompanyX_DeptY_SystemA | 1 | ABT Inc. | Last day of August 2004 | ||
Energy, chemicals and utilities | Chemicals | CompanyX_DeptY_SystemM | 2 | |||
Energy, chemicals and utilities | Electricity - nuclear | CompanyX_DeptY_SystemZ | 3 | OCT 1989 | ||
Energy, chemicals and utilities | Electricity - renewables | CompanyX_DeptY_SystemZ | 4 | DEC 2011 | ||
Energy, chemicals and utilities | Electricity - renewables | Biomass | CompanyX_DeptY_SystemM | 5 | Fred's Renewables LLC | |
Energy, chemicals and utilities | Electricity - renewables | Hydro | CompanyX_DeptY_SystemP | 6 | WAT Inc. | 9 January 2000 |
Energy, chemicals and utilities | thermal | CompanyX_DeptY_SystemQ | 7 | 5 Mar 1992 | ||
Energy, chemicals and utilities | thermal | coal | CompanyX_DeptY_SystemQ | 8 | 19 Jun 2020 | |
Financial institutions | CompanyX_DeptY_SystemM | 9 | Pet E. Cash Co. | |||
Financial institutions | Asset / investment management and funds | CompanyX_DeptY_SystemM | 10 |
(I started with a .csv I got here and added some columns to it.)
You'll notice there is some taxonomy information in there as well as some company information. Also notice how each source system uses different date representations. I didn't want to just do a regex for each style. I was looking for an 80/20 rule approach and a general way to add functionality to SPARQL queries.
Here are the triples my construct query produced:
@prefix datething: <java:datething.> .
@prefix ex: <http://www.example.com/> .
@prefix fx: <http://sparql.xyz/facade-x/ns/> .
@prefix ns: <http://sparql.xyz/facade-x/ns/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix xyz: <http://sparql.xyz/facade-x/data/> .
[ rdf:type ex:Company ;
ex:categorizedBy ex:thermal , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "1992-03-05T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "7" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemQ>
]
] .
ex:thermal rdf:type skos:Concept ;
skos:broader <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> .
ex:Hydro rdf:type skos:Concept ;
skos:broader <http://www.example.com/Electricity%20-%20renewables> .
<http://www.example.com/Electricity%20-%20renewables>
rdf:type skos:Concept ;
skos:broader <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> .
[ rdf:type ex:Company ;
ex:categorizedBy <http://www.example.com/Financial%20institutions> ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "9" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemM>
] ;
ex:name "Pet E. Cash Co."
] .
<http://www.example.com/Electricity%20-%20nuclear>
rdf:type skos:Concept ;
skos:broader <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> .
ex:Chemicals rdf:type skos:Concept ;
skos:broader <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> .
[ rdf:type ex:Company ;
ex:categorizedBy <http://www.example.com/Asset%20%2F%20investment%20management%20and%20funds> , <http://www.example.com/Financial%20institutions> ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "10" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemM>
]
] .
<http://www.example.com/Asset%20%2F%20investment%20management%20and%20funds>
rdf:type skos:Concept ;
skos:broader <http://www.example.com/Financial%20institutions> .
<http://www.example.com/Energy%2C%20chemicals%20and%20utilities>
rdf:type skos:Concept .
ex:Biomass rdf:type skos:Concept ;
skos:broader <http://www.example.com/Electricity%20-%20renewables> .
[ rdf:type ex:Company ;
ex:categorizedBy ex:Hydro , <http://www.example.com/Electricity%20-%20renewables> , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "2000-01-09T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "6" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemP>
] ;
ex:name "WAT Inc."
] .
[ rdf:type ex:Company ;
ex:categorizedBy <http://www.example.com/Electricity%20-%20nuclear> , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "1989-10-01T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "3" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemZ>
]
] .
[ rdf:type ex:Company ;
ex:categorizedBy <http://www.example.com/Electricity%20-%20renewables> , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "2011-12-01T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "4" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemZ>
]
] .
ex:coal rdf:type skos:Concept ;
skos:broader ex:thermal .
[ rdf:type ex:Company ;
ex:categorizedBy ex:coal , ex:thermal , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "2020-06-19T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "8" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemQ>
]
] .
[ rdf:type ex:Company ;
ex:categorizedBy ex:Chemicals , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "2" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemM>
]
] .
[ rdf:type ex:Company ;
ex:categorizedBy <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasActualStart "2004-08-31T00:00:00.000Z"^^xsd:dateTime ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "1" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemA>
] ;
ex:name "ABT Inc."
] .
[ rdf:type ex:Company ;
ex:categorizedBy ex:Biomass , <http://www.example.com/Electricity%20-%20renewables> , <http://www.example.com/Energy%2C%20chemicals%20and%20utilities> ;
ex:hasID [ rdf:type ex:Identifier ;
ex:identifiedBy "5" ;
ex:inSystem <http://www.example.com/system/CompanyX_DeptY_SystemM>
] ;
ex:name "Fred's Renewables LLC"
] .
<http://www.example.com/Financial%20institutions>
rdf:type skos:Concept .
(NOTE: I wouldn't recommend using blank nodes for the companies like I did here but I am not showing off domain modeling in RDF in this post.)
Here is the query that produced those triples (in a bash command for ease of replication):
curl --silent 'http://localhost:3000/sparql.anything' \
--data-urlencode 'query=
PREFIX xyz: <http://sparql.xyz/facade-x/data/>
PREFIX ns: <http://sparql.xyz/facade-x/ns/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX fx: <http://sparql.xyz/facade-x/ns/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix ex: <http://www.example.com/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX datething: <java:datething.>
construct {?concept_one a skos:Concept .
?concept_two a skos:Concept .
?concept_three a skos:Concept .
?concept_three skos:broader ?concept_two .
?concept_two skos:broader ?concept_one .
?company a ex:Company .
?company ex:hasActualStart ?when .
?company ex:name ?company_name .
?company ex:hasID ?company_identifier .
?company ex:categorizedBy ?concept_one .
?company ex:categorizedBy ?concept_two .
?company ex:categorizedBy ?concept_three .
?company_identifier a ex:Identifier .
?company_identifier ex:identifiedBy ?company_id .
?company_identifier ex:inSystem ?source_system .
}
WHERE {
service <x-sparql-anything:csv.headers=true,location=/app/aa.csv> {
fx:properties fx:csv.null-string "" .
?root a ns:root ;
?slotp ?row .
optional {?row xyz:classification_one ?one_string
bind(uri(concat(str(ex:),encode_for_uri(?one_string))) as ?concept_one)}
optional {?row xyz:classification_two ?two_string
bind(uri(concat(str(ex:),encode_for_uri(?two_string))) as ?concept_two)}
optional {?row xyz:classification_three ?three_string
bind(uri(concat(str(ex:),encode_for_uri(?three_string))) as ?concept_three)}
{
?row xyz:company_id ?company_id .
?row xyz:company_system_of_record ?source_system_string .
bind(uri(concat(str(ex:),"system/",encode_for_uri(?source_system_string))) as ?source_system)
optional{ ?row xyz:company_name ?company_name }
bind(bnode() as ?company)
bind(bnode() as ?company_identifier)
}
optional { ?row xyz:company_inception ?when_string
bind(strdt(datething:parse(?when_string),xsd:dateTime) as ?when)
}
}
}'
These 2 lines do the custom function invocation.
PREFIX datething: <java:datething.>
bind(strdt(datething:parse(?when_string),xsd:dateTime) as ?when)
The function datething:parse
takes a variable (bound to a string) as an argument and returns an xsd:dateTime string representation then we cast it to an xsd:dateTime
with strdt
.
In order to allow that to happen you have to put a .jar on Apache Jena Fuseki's classpath.
I am running SPARQL Anything (which is Fuseki but with some added functionality to allow it to treat non-RDF data as RDF).
Here are the steps to invoke this messy string date parsing function:
-
Build the .jar file for the
datething:parse
functionality by following the instructions here under the "how" section. -
git clone
SPARQL Anything. -
Copy the .jar you produced in (1) to directory you created in (2).
-
Run SPARQL Anything in a docker container (listening on port 3000) by following these instructions. (At this point Fuseki's classpath should have that .jar file on it.)
-
Download
aa.csv
(the .csv example) from this git repo and copy it into the SPARQL Anything directory. -
Run the bash command above (with the SPARQL query) and you should have your constructed triples!
So with this datething:parse
functionality you can triplify, say, a spreadsheet that has a variety of date representations in it and likely get most of the dates canonicalized so you can sort, filter, etc. on them in a SPARQL query.
(Here is my post on using SPARQL Anything to extract triples from a public google sheet.)
This .jar should also work with CSV2RDF as it also uses Apache Jena and the SPARQL construct approach to triplification.
Happy triplifying!