A while back cURL's creator, Daniel Stenberg, tweeted some stats on cURL's git repository.
I have a pretty good idea about how he answered those questions. I bet he used some tools like sed, awk, and grep. If I had to answer those questions I too might use those CLI utilities with a throw away shell pipeline. But I wondered what it would be like to answer those questions semantic web style.
In order to answer questions semantic web style you first have to find or make some thoughtful RDF. I say "thoughtful" because it is possible, though mostly not desirable, to use the semantic web stack (RDF/SPARQL/OWL/SHACL, etc.) without doing much domain modeling.
In our case we can easily get some structured data to start with. Here is a git commit I just made:
commit e3c3d94d1748316dce192e76dca0d2baa82e9c27
Author: Justin <[email protected]>
Date: Tue Aug 23 07:24:29 2022 -0400
some creatures
added four creatures
diff --git a/some.txt b/some.txt
new file mode 100644
index 0000000..882149f
--- /dev/null
+++ b/some.txt
@@ -0,0 +1,4 @@
+some pig.
+some rat.
+some goose.
+some spider.
Notice how compact that representation is.
The meat of that text is the unified output format of the diff
tool.
If you work with git much you probably recognize what most of that is.
But the semantic web isn't about just allowing you to work with data you already know how to decipher.
To participate in the semantic web we need to unpack this compact application-centric representation into a data-centric representation so that others don't need to do the deciphering.
In the semantic web we want data to wear its meaning on its sleeve.
That compact representation is fine for the diff
and patch
tools but doesn't really check any of these boxes:
Ok, I ran my conversion tool on that commit and it transformed that representation into a thoughtful RDF graph.
Let's take a look (using RDFox's graph viz):
You'll notice that commits have parts: hunks.
Those hunks, when applied, produce contiguous lines.
Those contiguous lines:
- occur in a text file with a name
- are identified by a line number
- have a magnitude with a unit of measure (line count)
- and have the literal contained text
Note that I've used Wikidata entities because Wikidata is a nice hub in the semantic web. Here is a Wikidata subgraph with labels that are relevant for the RDF I've produced:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix wd: <http://www.wikidata.org/entity/> .
wd:Q113509427 rdfs:label "hunk"@en .
wd:Q86920 rdfs:label "text file"@en .
wd:Q20058545 rdfs:label "commit"@en .
wd:Q6553274 rdfs:label "line number"@en .
wd:Q113515824 rdfs:label "contiguous lines"@en .
By the way, don't let those Q numbers scare you.
I don't memorize them (well I do have wd:Q2 memorized since it is pretty special).
I use auto completion in my text editor and Wikidata has it here too.
You just type wd:
then press control-enter then type what you want.
Also, Wikidata has some good reasons for using opaque IRIs.
Here is the RDF graph (the same one as in the image above) in turtle serialization:
@prefix : <http://example.com/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix gist: <https://ontologies.semanticarts.com/gist/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=e3c3d94;hunk=5713fa90-73bc-4520-818f-d59aaf20efb0>
rdf:type wd:Q113509427 ;
gist:produces [ rdf:type wd:Q113515824 ;
:containedTextContainer [ rdf:_1 "some pig." ;
rdf:_2 "some rat." ;
rdf:_3 "some goose." ;
rdf:_4 "some spider."
] ;
gist:hasMagnitude [ gist:hasUnitOfMeasure gist:_each ;
gist:numericValue 4
] ;
gist:identifiedBy [ rdf:type wd:Q6553274 ;
gist:numericValue 1
] ;
gist:occursIn [ rdf:type wd:Q86920 ;
gist:name "some.txt"
]
] .
<http://example.com/commit/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=e3c3d94>
rdf:type wd:Q20058545 ;
dcterms:creator [ gist:name "Justin" ;
schema:email "[email protected]"
] ;
dcterms:subject [ dcterms:description "added four creatures" ;
dcterms:title "some creatures"
] ;
gist:atDateTime "2022-08-23T07:24:29-04:00"^^xsd:dateTime ;
gist:hasPart <http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=e3c3d94;hunk=5713fa90-73bc-4520-818f-d59aaf20efb0> ;
gist:isIdentifiedBy [ gist:uniqueText "e3c3d94" ] .
Here is another commit:
commit 0e40522a20754fc82d751000eae710b5ad09e2f3
Author: Justin <[email protected]>
Date: Tue Aug 23 07:25:26 2022 -0400
capitalization
caps on pig and spider.
diff --git a/some.txt b/some.txt
index 882149f..c4ef5c5 100644
--- a/some.txt
+++ b/some.txt
@@ -1,4 +1,4 @@
-some pig.
+some Pig.
some rat.
some goose.
-some spider.
+some Spider.
And the RDF:
@prefix : <http://example.com/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix gist: <https://ontologies.semanticarts.com/gist/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <https://schema.org/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=0e40522;hunk=f2f312bb-49fb-48a1-a62e-323c0519b333>
rdf:type wd:Q113509427 ;
gist:affects [ rdf:type wd:Q113515824 ;
:containedTextContainer [ rdf:_1 "some spider." ] ;
gist:hasMagnitude [ gist:hasUnitOfMeasure gist:_each ;
gist:numericValue 1
] ;
gist:identifiedBy [ rdf:type wd:Q6553274 ;
gist:numericValue 4
] ;
gist:occursIn [ rdf:type wd:Q86920 ;
gist:name "some.txt"
]
] ;
gist:produces [ rdf:type wd:Q113515824 ;
:containedTextContainer [ rdf:_1 "some Spider." ] ;
gist:hasMagnitude [ gist:hasUnitOfMeasure gist:_each ;
gist:numericValue 1
] ;
gist:identifiedBy [ rdf:type wd:Q6553274 ;
gist:numericValue 4
] ;
gist:occursIn [ rdf:type wd:Q86920 ;
gist:name "some.txt"
]
] .
<http://example.com/commit/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=0e40522>
rdf:type wd:Q20058545 ;
dcterms:creator [ gist:name "Justin" ;
schema:email "[email protected]"
] ;
dcterms:subject [ dcterms:description "caps on pig and spider." ;
dcterms:title "capitalization"
] ;
gist:atDateTime "2022-08-23T07:25:26-04:00"^^xsd:dateTime ;
gist:hasPart <http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=0e40522;hunk=eab30641-eab6-4b44-866d-9dc89fb4bc05> , <http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=0e40522;hunk=f2f312bb-49fb-48a1-a62e-323c0519b333> ;
gist:isIdentifiedBy [ gist:uniqueText "0e40522" ] .
<http://example.com/hunk/origin=https%3A%2F%2Foriginhere.com%2Frepo.git;commit=0e40522;hunk=eab30641-eab6-4b44-866d-9dc89fb4bc05>
rdf:type wd:Q113509427 ;
gist:affects [ rdf:type wd:Q113515824 ;
:containedTextContainer [ rdf:_1 "some pig." ] ;
gist:hasMagnitude [ gist:hasUnitOfMeasure gist:_each ;
gist:numericValue 1
] ;
gist:identifiedBy [ rdf:type wd:Q6553274 ;
gist:numericValue 1
] ;
gist:occursIn [ rdf:type wd:Q86920 ;
gist:name "some.txt"
]
] ;
gist:produces [ rdf:type wd:Q113515824 ;
:containedTextContainer [ rdf:_1 "some Pig." ] ;
gist:hasMagnitude [ gist:hasUnitOfMeasure gist:_each ;
gist:numericValue 1
] ;
gist:identifiedBy [ rdf:type wd:Q6553274 ;
gist:numericValue 1
] ;
gist:occursIn [ rdf:type wd:Q86920 ;
gist:name "some.txt"
]
] .
You'll notice that this commit does a little more. The hunk produces contiguous lines as before. The hunk also affects contiguous lines. That is because this commit does not add a new file; it changes an existing file by replacing some contiguous lines with some other contiguous lines.
At this point maybe you're wondering why the data isn't more "direct." The RDF seems to spread things out and use generic predicates (produces, occurs in, etc.). That is intentional.
My conversion utility does use some intermediate "direct" data:
{
"abbreviated_commit_hash": "0e40522",
"author_name": "Justin",
"author_email": "[email protected]",
"author_date": "2022-08-23T07:25:26-04:00",
"subject": "capitalization",
"origin": "https://originhere.com/repo.git",
"body": "caps on pig and spider."
}
[
{
"old_source_start_line": "4",
"new_source_start_line": "4",
"old_source_line_count": "1",
"new_source_line_count": "1",
"old_filename": "some.txt",
"new_filename": "some.txt",
"old_content": [
"some spider."
],
"new_content": [
"some Spider."
],
"commit-id": "0e40522",
"origin": "https://originhere.com/repo.git"
},
{
"old_source_start_line": "1",
"new_source_start_line": "1",
"old_source_line_count": "1",
"new_source_line_count": "1",
"old_filename": "some.txt",
"new_filename": "some.txt",
"old_content": [
"some pig."
],
"new_content": [
"some Pig."
],
"commit-id": "0e40522",
"origin": "https://originhere.com/repo.git"
}
]
But that data does not snap together with other data like RDF does. It does not have formal semantics. It has not unpacked the meaning of the data. It is more like an ad hoc projection of data. It is not something I would want to pass around between applications.
There are some nice things about using RDF to express the content of a git repository. This is not a comprehensive list but rather just stuff that I thought of while doing this project:
(1)
You can start anywhere with queries.
If you want to find all things with names you just:
select * where {
?s gist:name ?name .
}
If you want to find all files with names:
select * where {
?s a wd:Q86920 .
?s gist:name ?name .
}
You don't need to know structurally where these "fields" live.
(2)
You define things in terms of more primitive things.
For example, if you look on Wikidata you'll see that commit
is defined in terms of changeset
, and version control
.
Hunk
is defined in terms of diff unified format
and line
.
Eventually definitions bottom out in really primitive things that aren't defined in terms of anything else.
One of the reasons this is helpful is that you can query against the more primitive things and get back results containing more composite things (built up from the more primitive things).
(3)
You are encouraged (if you use a thoughtful upper ontology such as Gist) to unpack meaning.
I think of the semantic web as something like the exploded part diagram for the web's data.
Yes, it takes up more space than a render of fully assembled thing but all the components you might want to talk about are addressable and their relationship to other components is evident.
One example of how not unpacking makes question answering harder is how Wikidata packs up postal code ranges with an en dash (–).
If you query Wikidata to see what region has postal code "10498" allocated to it you won't find any results.
You'll instead have to write a query to find a postal code (some of them are really a range of postal codes designated with an en dash) by making a procedure that gets the start and stop symbols (numbers in this case) and enumerates the range and does a where in
or something similar.
If you require users to unpack all your representations before they use them then maybe they'll lose interest and move on to something else.
A thoughtful ontology will help you carve the world at its joints, putting points of articulation between things, by having a thoughtful set of generic predicates. You might not be using a thoughtful ontology if you can connect any two arbitrary things with a single edge.
The unified output format for diff works well for the git
and patch
programs but not for humans asking questions.
Sure, unpacked representations mean more data (triples) but the alternatives (application-centric data, LPGs/RDF-Star, etc.) are like bodge wires:
They are acceptable for your final act, , but not something you'd want to build upon.
(4)
RDF allows for incremental enrichment.
As a followup to this project I think it would be interesting to transform CWEs (Common Weakness Enumeration) and CVEs (Common Vulnerabilities and Exposures) into RDF and connect them to the git repositories where the vulnerability producing code is.
(5)
More people can ask questions of the data.
SPARQL is a declarative query language. The ease of using SPARQL has a bit to do with the thoughtfulness of the domain modeling.
Below I pose several questions to the data and I obtain answers with SPARQL.
The cURL git repo has about 29k commits and commits going back to 1999.
My conversion tool turned it into just under 8 million triples in 70 minutes. I haven't focused on execution efficiency yet. I wanted to run queries against the data to get a feel the utility of this approach before I refine the tool.
Let's answer some questions about the development of cURL.
How many deleted lines per person?
PREFIX : <http://example.com/>
PREFIX schema: <https://schema.org/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?creator_email (SUM(?num_deleted_lines) AS ?total_deleted_lines)
WHERE
{ ?commit gist:hasPart ?hunk .
?commit dcterms:creator/schema:email ?creator_email .
?hunk gist:affects/gist:hasMagnitude/gist:numericValue ?start_line_count .
?hunk gist:produces/gist:hasMagnitude/gist:numericValue ?end_line_count .
FILTER ( ?end_line_count < ?start_line_count )
BIND(( ?start_line_count - ?end_line_count ) AS ?num_deleted_lines)
}
GROUP BY ?creator_email
ORDER BY DESC(?total_deleted_lines)
Result:
creator_email | total_deleted_lines |
---|---|
[email protected] | 316996 |
[email protected] | 35028 |
[email protected] | 20653 |
[email protected] | 16097 |
[email protected] | 6392 |
[email protected] | 4540 |
[email protected] | 3986 |
[email protected] | 3367 |
[email protected] | 3293 |
[email protected] | 2823 |
[email protected] | 2083 |
[email protected] | 1923 |
[email protected] | 1876 |
... |
How many deleted files per person?
PREFIX : <http://example.com/>
PREFIX schema: <https://schema.org/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?creator_email (COUNT(?start) AS ?file_count)
WHERE
{ VALUES ?end { "/dev/null" }
?commit gist:hasPart ?hunk .
?commit dcterms:creator/schema:email ?creator_email .
?hunk gist:affects/gist:occursIn/gist:name ?start .
?hunk gist:produces/gist:occursIn/gist:name ?end
}
GROUP BY ?creator_email
ORDER BY DESC(?file_count)
Result:
creator_email | file_count |
---|---|
[email protected] | 936 |
[email protected] | 39 |
[email protected] | 30 |
[email protected] | 14 |
[email protected] | 13 |
[email protected] | 13 |
[email protected] | 12 |
[email protected] | 10 |
[email protected] | 7 |
... |
What files did a particular person delete and when?
PREFIX schema: <https://schema.org/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT ?creator_email ?start_file ?at
WHERE
{ VALUES ?end_file { "/dev/null" }
VALUES ?creator_email { "[email protected]" }
?commit gist:hasPart ?hunk .
?commit dcterms:creator/schema:email ?creator_email .
?commit gist:atDateTime ?at .
?hunk gist:affects/gist:occursIn/gist:name ?start_file .
?hunk gist:produces/gist:occursIn/gist:name ?end_file
}
Result:
creator_email | start_file | at |
---|---|---|
[email protected] | install-sh | 2015-04-29T22:51:04+02:00 |
[email protected] | mkinstalldirs | 2015-04-29T22:51:04+02:00 |
[email protected] | missing | 2015-04-30T10:06:09+02:00 |
Which commits affected lib/http.c in 2019 only?
PREFIX schema: <https://schema.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT DISTINCT ?commit_id ?creator_email ?creator_name ?at
WHERE
{ VALUES ?start_file {"lib/http.c"}
?commit gist:hasPart ?hunk .
?commit gist:isIdentifiedBy/gist:uniqueText ?commit_id .
?commit gist:atDateTime ?at .
?hunk gist:affects/gist:occursIn/gist:name ?start_file
OPTIONAL
{ ?commit dcterms:creator/schema:email ?creator_email }
OPTIONAL
{ ?commit dcterms:creator/gist:name ?creator_name }
FILTER ( ( ?at >= "2019-01-01T00:00:00"^^xsd:dateTime ) &&
( ?at < "2020-01-01T00:00:00"^^xsd:dateTime ) )
}
ORDER BY ?at
Result:
commit_id | creator_email | creator_name | at |
---|---|---|---|
ebe658c1e | [email protected] | Daniel Stenberg | 2019-01-04T23:34:50+01:00 |
05b100aee | [email protected] | Daniel Stenberg | 2019-02-08T09:33:42+01:00 |
942eb09e8 | [email protected] | Daniel Stenberg | 2019-02-18T08:14:52+01:00 |
62a2534e4 | [email protected] | Daniel Stenberg | 2019-02-25T11:17:53+01:00 |
f1d915ea4 | [email protected] | Daniel Stenberg | 2019-02-27T22:30:32+01:00 |
65eb65fde | [email protected] | Daniel Stenberg | 2019-02-28T11:36:26+01:00 |
5345b04a4 | [email protected] | Daniel Stenberg | 2019-03-03T11:17:52+01:00 |
dd8a19f8a | [email protected] | Marc Schlatter | 2019-03-11T17:15:34+01:00 |
2f44e94ef | [email protected] | Daniel Stenberg | 2019-04-05T16:38:36+02:00 |
... |
Which persons have authored commits with the same email but different names?
PREFIX schema: <https://schema.org/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT ?creator_email ?creator_name
WHERE
{ { SELECT ?creator_email (COUNT(DISTINCT ?creator_name) AS ?creator_name_count)
WHERE
{ ?commit dcterms:creator/schema:email ?creator_email .
?commit dcterms:creator/gist:name ?creator_name
}
GROUP BY ?creator_email
HAVING ( COUNT(DISTINCT ?creator_name) > 1 )
}
{ SELECT DISTINCT ?creator_email ?creator_name
WHERE
{ ?commit dcterms:creator/schema:email ?creator_email .
?commit dcterms:creator/gist:name ?creator_name
}
}
}
ORDER BY DESC(?creator_name_count) ?creator_email
Result:
creator_email | creator_name |
---|---|
Ian Blanes | |
Thomas Vegas | |
z2_ on hackerone | |
[email protected] | Evgeny Grin |
[email protected] | Evgeny Grin (Karlson2k) |
[email protected] | Karlson2k |
[email protected] | YAMADA Yasuharu |
[email protected] | Yamada Yasuharu |
[email protected] | Yasuharu Yamada |
[email protected] | Philip H |
[email protected] | pheiduck on githuh |
[email protected] | neutric |
[email protected] | neutric on github |
[email protected] | Niall |
[email protected] | Niall O'Reilly |
... |
Which persons have authored commits with the same name and different email?
PREFIX schema: <https://schema.org/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT ?creator_email ?creator_name
WHERE
{ { SELECT ?creator_name (COUNT(DISTINCT ?creator_email) AS ?creator_email_count)
WHERE
{ ?commit dcterms:creator/schema:email ?creator_email .
?commit dcterms:creator/gist:name ?creator_name
}
GROUP BY ?creator_name
HAVING ( COUNT(DISTINCT ?creator_email) > 1 )
}
{ SELECT DISTINCT ?creator_email ?creator_name
WHERE
{ ?commit dcterms:creator/schema:email ?creator_email .
?commit dcterms:creator/gist:name ?creator_name
}
}
}
ORDER BY DESC(?creator_email_count) ?creator_name
Result:
creator_email | creator_name |
---|---|
[email protected] | Marcel Raad |
[email protected] | Marcel Raad |
[email protected] | Marcel Raad |
[email protected] | Marcel Raad |
[email protected] | Patrick Monnerat |
[email protected] | Patrick Monnerat |
[email protected] | Patrick Monnerat |
[email protected] | Patrick Monnerat |
[email protected] | Alessandro Ghedini |
[email protected] | Alessandro Ghedini |
[email protected] | Alessandro Ghedini |
... |
Which persons have authored commits in libcurl's lib/ directory (this includes deleting something in there)?
PREFIX schema: <https://schema.org/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT DISTINCT ?creator_email ?creator_name # 727
# SELECT DISTINCT ?creator_name # 678
# SELECT DISTINCT ?creator_email # 699
WHERE
{ ?commit gist:hasPart ?hunk
OPTIONAL
{ ?hunk gist:affects/gist:occursIn/gist:name ?start_file }
OPTIONAL
{ ?hunk gist:produces/gist:occursIn/gist:name ?end_file }
OPTIONAL
{ ?commit dcterms:creator/schema:email ?creator_email }
OPTIONAL
{ ?commit dcterms:creator/gist:name ?creator_name }
FILTER ( regex(?start_file, "^lib/") || regex(?end_file, "^lib/") )
}
Result:
creator_email | creator_name |
---|---|
[email protected] | Daniel Stenberg |
[email protected] | Sterling Hughes |
[email protected] | sm |
[email protected] | Joern Hartroth |
[email protected] | Jean-Philippe Barette-LaPierre |
... |
And depending on how you count the people the query finds between 678 and 727 people that authored commits in libcurl's lib/ directory. That was Daniel's first question. He got 629 with his method but that was a few months ago and I don't know exactly what his method of counting was. He may not have included the act of deleting a file in that directory like I did.
To answer his next three questions I'd need to record each commit's parent commit (I don't yet -- one of my many TODOs) and simulate the application of hunks in SPARQL or add the output of git blame
to the RDF.
Daniel likely used the output of git blame
.
I'll think about adding it to the RDF.
In another blog post I might describe how the conversion utility works. It is written in Clojure and it uses SPARQL Anything (which is built upon Apache Jena). I expect to push it to Github soon.
It is fun to imagine having all the git repos in Github as RDF graphs in a massive triplestore and asking questions with SPARQL.
In my example queries I didn't make use of the fact that each source code line is in the RDF. Most triplestores have full text search capabilities so I'll write some queries that make use of that too. In general I haven't been overly impressed with search built-into Gitlab and Bitbucket (I haven't used Github's search much) so I wonder if keeping an RDF representation with full text search would be a useful approach. I'd love to see a SPARQL endpoints for searching hosted git platforms!
I think this technique could be applied to other application-centric file formats. SPARQL Anything gets you part of the way there for several file formats but I'd like to hear if you have other ideas.
Please feel free to start a discussion above if you have ideas!