RDF (.nq) to DHT : Strategy for converting / querying very large .dht files on rather low end hardware #171

davidandreoletti · 2022-09-09T16:18:55Z

davidandreoletti
Sep 9, 2022

Hi again :-)

Context Constraints:

c1: Data are received as compressed .nq files.
c2: Uncompressed .nq file is estimated to be ~0.5-1Tb while the compressed .nq.gz is ~100Gb.
c3: SPARQL query must be run on the data (SELECT only)
c4: Avoid importing (as a new copy) data into Jena store. This 'excludes' loading '.nq' into Jena's TBD(2)

Now, the '.nq' to '.dht' conversion, as of 2018, requires significant amount of memory ... 120Gb RAM per 1 billion N-Triples. So 16Gb of RAM is insufficiant for a 0.5Tb '.nq' file :-(

Hypothetical solution
Note: I have not tried any of this yet.

Step 1: Convert '.nq' file into '.dht' file
Step 2: Move '.dht' file into a Jena model
Step 3: Offer a SPARQL endpoint

Step 1 details + question: Convert nquad into dht

Current strategy is to:

Split the '.nq' file into smaller '.nq' files of ~1Gb each.
Convert each small '.nq' files into '.dht' with hdt-java-cli's rdf2dht
Merge each individual '.dht' file into a larger '.dht' file using hdt-java-cli's hdtcat.

Q1: Would you do anything different to convert into DHT ?

Step 2 question: Bring .dht into Jena model

Q2: Would bringing .dht into a Jena model involve data copying like the TBD(2) store does ? Seems not but I would prefer a confirmation.

Step 3 question: Offer SPARQL endpoint

Q3: Would running an SPARQL query on the endpoint involve bringing the whole 'rdf graph' in memory when run against a DHT backed Jena model ?

Answered by D063520

Sep 9, 2022

NOTE: HDT does not support quads, neither the java nor the c++ version. If this is a no go, then currently do not use HDT. See #3 and https://github.com/JulianRei/hdtq-java (this is not code to use in production currently).
Q1: currently not, even if we are working on a compression that is better. With 16Gb of RAM you can compress a 16Gb of nt file in one chunk (more or less)
Q2: no. Note that the jena implemention is a bit buggy and might not optimize the queries super well. You can check out this though https://github.com/the-qa-company/qEndpoint
Q3: no, the HDT is only mapped and you need 3% of the data size in memory as a minimum requirement. So for 100Gb of HDT file, 3GB of memory.
N…

View full answer

D063520 · 2022-09-09T16:27:08Z

D063520
Sep 9, 2022
Maintainer

NOTE: HDT does not support quads, neither the java nor the c++ version. If this is a no go, then currently do not use HDT. See #3 and https://github.com/JulianRei/hdtq-java (this is not code to use in production currently).
Q1: currently not, even if we are working on a compression that is better. With 16Gb of RAM you can compress a 16Gb of nt file in one chunk (more or less)
Q2: no. Note that the jena implemention is a bit buggy and might not optimize the queries super well. You can check out this though https://github.com/the-qa-company/qEndpoint
Q3: no, the HDT is only mapped and you need 3% of the data size in memory as a minimum requirement. So for 100Gb of HDT file, 3GB of memory.
Note size of HDT = (more or less) to a ntriple gzipped file

1 reply

davidandreoletti Sep 9, 2022
Author

Note: Got it. The original '.nq' file will be need to be converted into multiple '.nt', one per 'context'.

Q1: Thanks for the rule of thumb on the '.nt' size / GB of RAM ratio to feed rdf2dht with.
Q2: Thanks on the head up and the alternative suggestion: will check it out!
Q3: Got it!

ate47 · 2022-09-09T16:33:13Z

ate47
Sep 9, 2022

Q1: we're working on it, but you don't have any other possibility except this one yet

Q2: it plugs the HDT into a Jena endpoint without doing any copy

Q3: The HDT is mapped and not loaded, so you don't have to use as much memory as with a Jena store, because HDT is compressed data, it's also true if you load the HDT in memory.

3 replies

davidandreoletti Sep 9, 2022
Author

Got it. Thanks too :-)

davidandreoletti Sep 16, 2022
Author

@ate47

NOTE: HDT does not support quads, neither the java nor the c++ version. If this is a no go, then currently do not use HDT

With If you don't me clarifying, "not supporting quads" means (a) HDT does not support multiple labelled graph in a single HDT file (NOTE: As per my understanding, 1 labelled graph would be 1 dataset in " Header implementation overview") OR (b) something else ?

ate47 Sep 16, 2022

You can store a RDF graph in an HDT, but you can't express a label for a triple. The header is a non-compressed RDF file, so you can put whatever you want to describe your graph in it (once it respects the minimum requirements)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDF (.nq) to DHT : Strategy for converting / querying very large .dht files on rather low end hardware #171

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RDF (.nq) to DHT : Strategy for converting / querying very large .dht files on rather low end hardware #171

davidandreoletti Sep 9, 2022

Replies: 2 comments · 4 replies

D063520 Sep 9, 2022 Maintainer

davidandreoletti Sep 9, 2022 Author

ate47 Sep 9, 2022

davidandreoletti Sep 9, 2022 Author

davidandreoletti Sep 16, 2022 Author

ate47 Sep 16, 2022

davidandreoletti
Sep 9, 2022

Replies: 2 comments 4 replies

D063520
Sep 9, 2022
Maintainer

davidandreoletti Sep 9, 2022
Author

ate47
Sep 9, 2022

davidandreoletti Sep 9, 2022
Author

davidandreoletti Sep 16, 2022
Author