RDF (.nq) to DHT : Strategy for converting / querying very large .dht files on rather low end hardware #171
-
Hi again :-) Context Constraints:
Now, the '.nq' to '.dht' conversion, as of 2018, requires significant amount of memory ... 120Gb RAM per 1 billion N-Triples. So 16Gb of RAM is insufficiant for a 0.5Tb '.nq' file :-( Hypothetical solution Step 1: Convert '.nq' file into '.dht' file Step 1 details + question: Convert nquad into dht Current strategy is to:
Q1: Would you do anything different to convert into DHT ? Step 2 question: Bring .dht into Jena model Q2: Would bringing .dht into a Jena model involve data copying like the TBD(2) store does ? Seems not but I would prefer a confirmation. Step 3 question: Offer SPARQL endpoint Q3: Would running an SPARQL query on the endpoint involve bringing the whole 'rdf graph' in memory when run against a DHT backed Jena model ? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
NOTE: HDT does not support quads, neither the java nor the c++ version. If this is a no go, then currently do not use HDT. See #3 and https://github.com/JulianRei/hdtq-java (this is not code to use in production currently). |
Beta Was this translation helpful? Give feedback.
-
Q1: we're working on it, but you don't have any other possibility except this one yet Q2: it plugs the HDT into a Jena endpoint without doing any copy Q3: The HDT is mapped and not loaded, so you don't have to use as much memory as with a Jena store, because HDT is compressed data, it's also true if you load the HDT in memory. |
Beta Was this translation helpful? Give feedback.
NOTE: HDT does not support quads, neither the java nor the c++ version. If this is a no go, then currently do not use HDT. See #3 and https://github.com/JulianRei/hdtq-java (this is not code to use in production currently).
Q1: currently not, even if we are working on a compression that is better. With 16Gb of RAM you can compress a 16Gb of nt file in one chunk (more or less)
Q2: no. Note that the jena implemention is a bit buggy and might not optimize the queries super well. You can check out this though https://github.com/the-qa-company/qEndpoint
Q3: no, the HDT is only mapped and you need 3% of the data size in memory as a minimum requirement. So for 100Gb of HDT file, 3GB of memory.
N…