Merge join #428

hmottestad · 2023-10-23T20:41:26Z

Issue resolved (if any): #

Description of this pull request:

Please check all the lines before posting the pull request:

I've created tests for all my changes
My pull request isn't fixing or changing multiple unlinked elements (please create one pull request for each element)
I've applied the code formatter (mvn formatter:format on the backend, npm run format on the frontend) before posting my pull request, mvn formatter:validate to validate the formatting on the backend, npm run validate on the frontend
All my commits have relevant names
I've squashed my commits (if necessary)

hmottestad · 2023-10-23T20:48:39Z

I now have a mostly functional merge join implementation. There was a bit of work getting it working in qEndpoint, especially since I needed to migrate the code from RDF4J 4.x to 5.x. A big change is that the checked exception support in the CloseableIteration has been removed.

To try out merge join you first have to checkout the branch in this PR here: eclipse-rdf4j/rdf4j#4822 and run: mvn install -DskipTests

Then you can build the branch of this PR to try it out in qEndpoint.

I've assumed that all data is sorted by subject, since the default index in HDT is SPO. I've also assumed that we can do a string comparison on the values.

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/experimental/QEPSailStore.java

ate47 · 2023-10-31T10:38:39Z

Hello, the implementation in the package com/the_qa_company/qendpoint/store/experimental isn't actually the one we're using, this one is a prototype for a future store.

The one we are currently using is inside the store package, EndpointStore is the sail class and EndpointStoreConnection the connection.

Then, inside the store, we are using custom Value, they're all implementing HDTValue

qendpoint-store\src\main\java\com\the_qa_company\qendpoint\model\HDTValue.java
qendpoint-store\src\main\java\com\the_qa_company\qendpoint\model\SimpleBNodeHDT.java
qendpoint-store\src\main\java\com\the_qa_company\qendpoint\model\SimpleIRIHDT.java
qendpoint-store\src\main\java\com\the_qa_company\qendpoint\model\SimpleLiteralHDT.java

then main difference is that you can get a numerical ID from the element inside the HDT file.

Another big issue is that the iterators resolved from a HDT file aren't "sorted" by string, but by the ID inside the HDT. For the subjects and objects it's to compress the shared elements between these 2 sections and for the predicate section it is due to the fact that the HDT string sort is done on the UTF8 bytes and not on the string (see this issue rdfhdt/hdt-java#177)

hmottestad · 2023-10-31T13:54:37Z

I'll take a look and add the new methods to the non-experimental store.

If predicates have a different comparator to the subject/object/context then that could be a bit problematic. During the SPARQL evaluation we use a binding set where we only retain the variable name and value. Even if we did retain if the value was originally a subject, predicate or object we would still run into problems when the variable is a predicate in one statement and a subject in another.

ate47 · 2023-10-31T14:13:48Z

For predicate vs subject/object part, I think you can ignore or consider the two streams as non easily "mergeable". I'm not triple store user, but I don't think this case is used a lot.

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointTripleSource.java

hmottestad · 2023-11-01T11:36:31Z

@ate47 is there any chance that we could have a PSOC index? It would be great for the fairly common pattern ?a foaf:age ?b and since the predicate (P) would be bound the data would be ordered by subject (S).

D063520 · 2023-11-01T11:39:32Z

@hmottestad: @ate47 is working on this since friday #429. We are on it ... just we need some more time .... (and it is holiday today in france ...)

hmottestad · 2024-01-03T08:46:38Z

qendpoint-core/src/main/java/com/the_qa_company/qendpoint/core/storage/QEPDataset.java

 	public QueryCloseableIterator search(CharSequence subject, CharSequence predicate, CharSequence object)
 			throws QEPCoreException {
 		QEPDatasetContext ctx = createContext();
-		return search(ctx, subject, predicate, object).attach(ctx);
+		return (QueryCloseableIterator) search(ctx, subject, predicate, object).attach(ctx);
 	}



This may not be the correct fix, but it was the best I could manage when migrating to RDF4J 5.0.0.

hmottestad · 2024-01-03T08:56:30Z

qendpoint-core/src/main/java/com/the_qa_company/qendpoint/core/storage/QEPDataset.java

+public class QEPDataset implements Closeable, Serializable {
+
+	@Serial
+	private static final long serialVersionUID = 7525689572432598258L;


There are some requirements for serializable classes in RDF4J that were not explicitly tested for. From what I remember when migrating the code to RDF4J 5.0.0 this class is used transitively from within a query plan because it is used in QEPComponent which is used in QEPCoreBNode which is used in StatementPattern through Var.

hmottestad · 2024-01-03T09:00:39Z

...in/java/com/the_qa_company/qendpoint/core/storage/iterator/CloseableAttachQueryIterator.java

 	@SafeVarargs
 	public static QueryCloseableIterator of(QueryCloseableIterator it,
-			AutoCloseableGeneric<QEPCoreException>... closeables) {
+			AutoCloseableGeneric<? extends RuntimeException>... closeables) {
 		Objects.requireNonNull(it, "it can't be null!");
 		if (closeables.length == 0) {
 			return it;


Again I'm not sure if this is the correct approach to fixing the issues when migrating to RDF4J 5.0.0.

hmottestad · 2024-01-03T09:02:35Z

qendpoint-core/src/main/java/com/the_qa_company/qendpoint/core/triples/impl/BitmapTriples.java

@@ -472,6 +474,7 @@ public void load(InputStream input, ControlInfo ci, ProgressListener listener) t

 	@Override
 	public void mapFromFile(CountInputStream input, File f, ProgressListener listener) throws IOException {
+		log.info("Mapping BitmapTriples from {}", f.getName());


I found this useful during the application startup to see that the application is not hanging but in fact loading a large file.

hmottestad · 2024-01-03T09:04:28Z

...core/src/main/java/com/the_qa_company/qendpoint/core/triples/impl/BitmapTriplesIterator.java


 /**
 * @author mario.arias
 */
-public class BitmapTriplesIterator implements SuppliableIteratorTripleID {
+public class BitmapTriplesIterator implements SuppliableIteratorTripleID, IndexReportingIterator {


I've added a new interface to RDF4J which allows the query explanation to include the name of the index used in StatementPattern.

hmottestad · 2024-01-03T09:09:25Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/compiler/SailCompiler.java

@@ -400,8 +400,7 @@ public LinkedSail<? extends NotifyingSail> compileNode(Value node) throws SailCo
 		 */
 		public Value searchOne(Resource subject, IRI predicate) throws SailCompilerException {
 			Value out;
-			try (CloseableIteration<? extends Statement, SailCompilerException> it = connection.getStatements(subject,
-					predicate, null)) {
+			try (CloseableIteration<? extends Statement> it = connection.getStatements(subject, predicate, null)) {


In RDF4J 5.0.0 the CloseableIteration interface and all other iterators/iterations no longer support generic declaration of exceptions. You can still throw an exception, but it can no longer be declared and thus it needs to be a RuntimeException.

hmottestad · 2024-01-03T09:11:33Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/model/HDTValue.java

@@ -13,7 +13,7 @@ static int compare(HDTValue v1, HDTValue v2) {
 			return c;
 		}

-		return Long.compare(v1.getHDTPosition(), v2.getHDTPosition());
+		return Long.compare(v1.getHDTId(), v2.getHDTId());


The HDTPosition is checked above, so I assume that the ID should be compared if the positions are equal.

hmottestad · 2024-01-03T09:12:16Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/model/SimpleIRIHDT.java

+			if (o instanceof IRI) {
+				return toString().equals(o.toString());
+			} else {
+				return false;
+			}


A small performance improvement.

hmottestad · 2024-01-03T09:13:26Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointStore.java

+	private volatile boolean isMerging = false;

-	public boolean isMergeTriggered = false;
+	public volatile boolean isMergeTriggered = false;


These needed to be volatile for me to read updated values from a different thread.

hmottestad · 2024-01-03T09:14:17Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointStore.java

@@ -760,7 +760,7 @@ public void markDeletedTempTriples() throws IOException {
 		try (InputStream inputStream = new FileInputStream(endpointFiles.getTempTriples())) {
 			RDFParser rdfParser = Rio.createParser(RDFFormat.NTRIPLES);
 			rdfParser.getParserConfig().set(BasicParserSettings.VERIFY_URI_SYNTAX, false);
-			try (GraphQueryResult res = QueryResults.parseGraphBackground(inputStream, null, rdfParser, null)) {
+			try (GraphQueryResult res = QueryResults.parseGraphBackground(inputStream, null, rdfParser)) {


The last argument to this RDF4J method has been removed in RDF4J 5.0.0.

hmottestad · 2024-01-03T09:16:20Z

...point-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointStoreQueryPreparer.java

@@ -128,15 +128,17 @@ protected CloseableIteration<? extends BindingSet, QueryEvaluationException> eva
 			new DisjunctiveConstraintOptimizer().optimize(tupleExpr, dataset, bindings);
 			new SameTermFilterOptimizer().optimize(tupleExpr, dataset, bindings);
 			new QueryModelNormalizerOptimizer().optimize(tupleExpr, dataset, bindings);
-			new QueryJoinOptimizer(evaluationStatistics).optimize(tupleExpr, dataset, bindings);
+			new QueryJoinOptimizer(evaluationStatistics, tripleSource).optimize(tupleExpr, dataset, bindings);


Including the tripleSource here is what allows the QueryJoinOptimizer to check the possible statement orders from the underlying store.

hmottestad · 2024-01-03T09:17:47Z

...point-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointStoreQueryPreparer.java

+		QueryEvaluationStep precompile = strategy.precompile(tupleExpr);
+
+		return precompile.evaluate(bindings);


Here is also a small performance fix that uses the precompile stage in RDF4J. I think this was introduced in RDF4J 4.0.0.

hmottestad · 2024-01-03T09:26:43Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointTripleSource.java

+				CloseableIteration<? extends Statement> repositoryResult1 = this.endpointStoreConnection.getConnA_read()
+						.getStatements(newSubj, newPred, newObj, false, contexts);
+				CloseableIteration<? extends Statement> repositoryResult2 = this.endpointStoreConnection.getConnB_read()
+						.getStatements(newSubj, newPred, newObj, false, contexts);
 				repositoryResult = new CombinedNativeStoreResult(repositoryResult1, repositoryResult2);


If there is a failure when getting repositoryResult2 the repositoryResult1 would not be closed. I've fixed similar issues in RDF4J and just realised that the same could happen here. I've not tried to fix the code though, but just wanted to point it out now that I saw it.

hmottestad · 2024-01-03T09:30:40Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointTripleSource.java

+	}
+
+	@Override
+	public Set<StatementOrder> getSupportedOrders(Resource subj, IRI pred, Value obj, Resource... contexts) {


Most of the code here was copied from the getStatements method.

hmottestad · 2024-01-03T09:35:41Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/HDTConverter.java

@@ -29,7 +24,7 @@ public class HDTConverter {
 	public static final String HDT_URI = "http://hdt.org/";
 	private final EndpointStore endpoint;
 	private final HDT hdt;
-	private final ValueFactory valueFactory = new MemValueFactory();
+	private final ValueFactory valueFactory = SimpleValueFactory.getInstance();


Performance improvement since we don't need the deduplication aspects of the MemValueFactory.

hmottestad · 2024-01-03T09:38:30Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/utils/VariableToIdSubstitution.java

+					Var var1 = new Var(var.getName(), converter.idToHDTValue(id, position), var.isAnonymous(),
+							var.isConstant());
+					var.replaceWith(var1);


In order to reduce bugs in RDF4J we now require that Var objects are not reused.

hmottestad · 2024-01-03T09:42:11Z

qendpoint-store/src/test/java/com/the_qa_company/qendpoint/WikiDataTest.java

+//@formatter:off
+
+//package com.the_qa_company.qendpoint;
+//
+//import com.the_qa_company.qendpoint.core.options.HDTOptions;
+//import com.the_qa_company.qendpoint.core.options.HDTOptionsKeys;


I've commented this class out since it requires the WikiData HDT files.

hmottestad · 2024-01-03T09:43:41Z

...re/src/test/java/com/the_qa_company/qendpoint/store/EndpointSPARQL11QueryComplianceTest.java

+//	public EndpointSPARQL11QueryComplianceTest(String displayName, String testURI, String name, String queryFileURL,
+//			String resultFileURL, Dataset dataset, boolean ordered, boolean laxCardinality)
+//			throws ParserException, NotFoundException, IOException {
+//		super(displayName, testURI, name, queryFileURL, resultFileURL, null, ordered, laxCardinality);
+//		setUpHDT(dataset);
+//		List<String> testToIgnore = new ArrayList<>();
+//		// @todo these tests are failing and should not, they are skipped so
+//		// that we can be sure that we see when
+//		// currently passing tests are not failing. Many of these tests are not
+//		// so problematic since we do not support
+//		// named graphs anyway
+//		testToIgnore.add("constructwhere02 - CONSTRUCT WHERE");
+//		testToIgnore.add("constructwhere03 - CONSTRUCT WHERE");
+//		testToIgnore.add("constructwhere04 - CONSTRUCT WHERE");
+//		testToIgnore.add("Exists within graph pattern");
+//		testToIgnore.add("(pp07) Path with one graph");
+//		testToIgnore.add("(pp35) Named Graph 2");
+//		testToIgnore.add("sq01 - Subquery within graph pattern");
+//		testToIgnore.add("sq02 - Subquery within graph pattern, graph variable is bound");
+//		testToIgnore.add("sq03 - Subquery within graph pattern, graph variable is not bound");
+//		testToIgnore.add("sq04 - Subquery within graph pattern, default graph does not apply");
+//		testToIgnore.add("sq05 - Subquery within graph pattern, from named applies");
+//		testToIgnore.add("sq06 - Subquery with graph pattern, from named applies");
+//		testToIgnore.add("sq07 - Subquery with from ");
+//		testToIgnore.add("sq11 - Subquery limit per resource");
+//		testToIgnore.add("sq13 - Subqueries don't inject bindings");
+//		testToIgnore.add("sq14 - limit by resource");
+//
+//		this.setIgnoredTests(testToIgnore);
+//	}


The test class in RDF4J has changed quite a bit in RDF4J 5.0.0 and there is no longer a constructor that provides the dataset. I've simply commented out the code since I wasn't sure how to fix it in the RDF4J code.

It would be good to understand how it works before merging

I took some time to figure out the RDF4J side of things and I've created a method that gets called with all the test arguments so that we can listen in and read the dataset variable to set up the HDT store.

I just merged it into RDF4J now, so it should be available within 24 hours from the snapshot repo.

hmottestad · 2024-01-03T09:45:19Z

qendpoint-store/src/test/java/com/the_qa_company/qendpoint/store/EndpointStoreTest.java

+			while (store.isMergeTriggered || store.isMerging()) {
+				Thread.onSpinWait();
+			}


This is where we read the two variables I had to make volatile.

hmottestad · 2024-01-03T09:45:43Z

qendpoint-store/src/test/java/com/the_qa_company/qendpoint/store/Utility.java

@@ -150,7 +152,7 @@ public static Statement getFakeStatement(ValueFactory vf, int id) {

 	private static void writeBigIndex(File file) throws IOException {
 		ValueFactory vf = new MemValueFactory();
-		try (FileOutputStream out = new FileOutputStream(file)) {
+		try (OutputStream out = new BufferedOutputStream(new FileOutputStream(file))) {


Performance improvement for tests.

hmottestad · 2024-01-03T09:48:28Z

pom.xml

+    <repositories>
+        <repository>
+            <id>oss.sonatype.org-snapshot</id>
+            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
+            <releases>
+                <enabled>false</enabled>
+            </releases>
+            <snapshots>
+                <enabled>true</enabled>
+            </snapshots>
+        </repository>
+    </repositories>


I have note published a new milestone build of RDF4J 5.0.0 yet, so we need to use the snapshots repo in the meantime.

ate47

fine for me except the small parts I commented

ate47 · 2024-01-08T09:24:17Z

qendpoint-core/pom.xml

+            <groupId>org.eclipse.rdf4j</groupId>
+            <artifactId>rdf4j-common-iterator</artifactId>
+            <version>${rdf4j.version}</version>
+        </dependency>


Our project was to leave the core, which is a clone of the rdfhdt/hdt-java repository as close as possible as the original library. Is this mandatory to add RDF4J to it?

I've managed to move the code to the qEndpoint store model, so we don't need to reference RDF4J from the core model anymore.

ate47 · 2024-01-08T09:50:52Z

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointTripleSource.java

+	 * This flag can be set to false in order to disable the use of merge join.
+	 * This can be useful for comparing performance.
+	 */
+	private static boolean ENABLE_MERGE_JOIN = true;


I think it would be better to use the config of the endpoint with a key.

You can access the options using

HDTOptions spec = endpoint.getHDTSpec(); enableMergeJoin = spec.getBoolean("qendpoint.mergejoin", false);

Thanks. I've changed it use the HDT Spec. I've made it true as default.

ate47 · 2024-01-08T10:08:14Z

...re/src/test/java/com/the_qa_company/qendpoint/store/EndpointSPARQL11QueryComplianceTest.java

+//	public EndpointSPARQL11QueryComplianceTest(String displayName, String testURI, String name, String queryFileURL,
+//			String resultFileURL, Dataset dataset, boolean ordered, boolean laxCardinality)
+//			throws ParserException, NotFoundException, IOException {
+//		super(displayName, testURI, name, queryFileURL, resultFileURL, null, ordered, laxCardinality);
+//		setUpHDT(dataset);
+//		List<String> testToIgnore = new ArrayList<>();
+//		// @todo these tests are failing and should not, they are skipped so
+//		// that we can be sure that we see when
+//		// currently passing tests are not failing. Many of these tests are not
+//		// so problematic since we do not support
+//		// named graphs anyway
+//		testToIgnore.add("constructwhere02 - CONSTRUCT WHERE");
+//		testToIgnore.add("constructwhere03 - CONSTRUCT WHERE");
+//		testToIgnore.add("constructwhere04 - CONSTRUCT WHERE");
+//		testToIgnore.add("Exists within graph pattern");
+//		testToIgnore.add("(pp07) Path with one graph");
+//		testToIgnore.add("(pp35) Named Graph 2");
+//		testToIgnore.add("sq01 - Subquery within graph pattern");
+//		testToIgnore.add("sq02 - Subquery within graph pattern, graph variable is bound");
+//		testToIgnore.add("sq03 - Subquery within graph pattern, graph variable is not bound");
+//		testToIgnore.add("sq04 - Subquery within graph pattern, default graph does not apply");
+//		testToIgnore.add("sq05 - Subquery within graph pattern, from named applies");
+//		testToIgnore.add("sq06 - Subquery with graph pattern, from named applies");
+//		testToIgnore.add("sq07 - Subquery with from ");
+//		testToIgnore.add("sq11 - Subquery limit per resource");
+//		testToIgnore.add("sq13 - Subqueries don't inject bindings");
+//		testToIgnore.add("sq14 - limit by resource");
+//
+//		this.setIgnoredTests(testToIgnore);
+//	}


It would be good to understand how it works before merging

hmottestad · 2024-01-15T21:04:32Z

@ate47 could you run the tests. I'm seeing a test failure locally, but I'm not sure why it's failing.

hmottestad commented Oct 23, 2023

View reviewed changes

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/experimental/QEPSailStore.java Outdated Show resolved Hide resolved

hmottestad commented Oct 31, 2023

View reviewed changes

qendpoint-store/src/main/java/com/the_qa_company/qendpoint/store/EndpointTripleSource.java Outdated Show resolved Hide resolved

ate47 changed the base branch from master to dev November 2, 2023 12:13

hmottestad force-pushed the merge-join branch from 0b2a0a9 to d90cb0f Compare November 4, 2023 19:46

hmottestad force-pushed the merge-join branch from aa0bf49 to 570764a Compare January 3, 2024 08:39

hmottestad commented Jan 3, 2024

View reviewed changes

hmottestad force-pushed the merge-join branch from 92fc82d to dc60b71 Compare January 3, 2024 09:47

hmottestad commented Jan 3, 2024

View reviewed changes

ate47 requested changes Jan 8, 2024

View reviewed changes

hmottestad added 6 commits January 15, 2024 12:16

move to RDF4J 5.0.0-SNAPSHOT

2e149f1

implement support for merge join

3ce9d53

code cleanup

7ae8a01

code cleanup

fdc40d9

code cleanup

d6e9f19

code cleanup

74f46d4

hmottestad force-pushed the merge-join branch from 9af4b87 to 74f46d4 Compare January 15, 2024 11:32

fixed compliance tests

8ed6954

ate47 merged commit a697f7b into the-qa-company:dev Jan 16, 2024
8 checks passed

		QueryEvaluationStep precompile = strategy.precompile(tupleExpr);

		return precompile.evaluate(bindings);

Merge join #428

Merge join #428

Conversation

hmottestad commented Oct 23, 2023

hmottestad commented Oct 23, 2023

ate47 commented Oct 31, 2023

hmottestad commented Oct 31, 2023

ate47 commented Oct 31, 2023

hmottestad commented Nov 1, 2023

D063520 commented Nov 1, 2023 • edited Loading

Choose a reason for hiding this comment

hmottestad Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmottestad Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ate47 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmottestad commented Jan 15, 2024

D063520 commented Nov 1, 2023 •

edited

Loading

hmottestad Jan 3, 2024 •

edited

Loading

hmottestad Jan 12, 2024 •

edited

Loading