Large number of threads / limit number of threads ? #7

tfrancart · 2019-03-06T09:52:21Z

Hello

On a query involving a large amount of entities (tens of thousands) and a join between 2 sources in the federation (the tens of thousands of entities have a property linking them to an entity in the other source), I am seeing a lot of errors like the following, and the query does not terminate :

[269,472s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,472s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,472s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,473s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,473s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,473s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,473s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,473s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,474s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,475s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,475s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,475s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,475s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,476s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,476s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,476s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,476s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,476s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
[269,477s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

I am taking the hypothesis that FedX needs to create a lot of threads and the threads creation fails. How can I control the threads being created to avoid such errors ?

The text was updated successfully, but these errors were encountered:

aschwarte10 · 2019-03-07T07:43:41Z

@tfrancart Thanks for providing all the feedback, this is really helpful in understanding how FedX behaves in actual use cases.

I just double checked the code: for executing joins (and also unions) in parallel I am using a thread pool executor with a defined number of threads. The number of available slots can be configured using the FedX config option Config.getConfig().getJoinWorkerThreads() (defaulting to 20). The thread pool is also backed by a LinkedBlockingQueue (which basically maintains the runnables waiting for their execution in order).

It is unclear to me how above error can happen.

The above error messages look like system call errors. Can you explain to me how you were able to log those? Have they been printed to stderr? Are there maybe details (e.g. a stacktrace from where the thread is being attempted to be created?

Also I did a quick google search on the error: first result indicate that an EAGAIN error occurs if the process has run out of memory. Would you try to re-run your tests by giving your JVM more max memory (e.g. using -Xmx6G) or event more? Not sure why the process doesn't run into an out-of-memory, but rather in this kind of error.

To understand your data a bit better, does it look like many of the following pattern?

endpoint 1:
source1:x :isRelated source2:otherEntitiy

endpoint 2:
source2:otherEntity rdfs:label "Other Entity "

And a query like

SELECT * WHERE {
 ?x :isRelated ?y .
 ?y rdfs:label ?label
}

Is my understanding of your scenario correct?

tfrancart · 2019-03-07T08:11:16Z

Thanks, this is very helpful. A memory leak is possible, although I did not see any OutofMemoryError in the logs. I will do further tests and see if I can give more memory. These errors are printed in the log file of my Tomcat server. This is not a blocking issue for us at the moment, since we may prevent our users to use the query that generated this error (still in discussion). As a general comment : how can I monitor the possible exceptions happening during query execution ? Your understanding of the data structure is correct. "Endpoint 1" is a large graph with a lot of entities; these entities "refer to" / "are indexed on" URIs from the LOD : Geonames places or SKOS Concepts from some thesaurus. The data from these URIs is dynamiccaly fetched and stored in Endpoint 2. So Endpoint 2 contains actually very few entities (few dozens) compared to endpoint 1 - since many of the entities from Endpoint 1 will refer to the same entites from Endpoint 2. Endpoint 2 acts as a central controlled vocabularies storage for all the other members in the federation. So the query could be divided in 2 steps : - search for all the criterias to select entities in "Endpoint 1" (this can be a fairly complex graph pattern), and read the value of their related entities in endpoint2 - "?x :isRelatedTo ?y" in your example. - from the list of related entities in endpoint 2, read the additionnal properties ("?y rdfs:label ?label" in your example; this is actually latitude, longitude and labels) I am pretty sure this kind of scenario could be optimised if FedX used some statistics or additionnal information on the content of each repository (I did play with Costfed that uses this approach but this too unstable, and unmaintained). In the next months we might need some help with FedX, and possibly need some optimisations based on this scenario and others, that would go beyond the support you are providing us here (and thanks again for that !). Are you offering paid/commercial support ? if yes, can we get in touch directly ? my email is thomas[dot]francart [at] sparna[dot]fr. If no, do you know anyone who could provide this kind of support and develop new features on FedX ? Thanks Le jeu. 7 mars 2019 à 08:43, Andreas Schwarte <[email protected]> a écrit :

…

@tfrancart <https://github.com/tfrancart> Thanks for providing all the feedback, this is really helpful in understanding how FedX behaves in actual use cases. I just double checked the code: for executing joins (and also unions) in parallel I am using a thread pool executor with a defined number of threads. The number of available slots can be configured using the FedX config option *Config.getConfig().getJoinWorkerThreads()* (defaulting to 20). The thread pool is also backed by a LinkedBlockingQueue (which basically maintains the runnables waiting for their execution in order). It is unclear to me how above error can happen. The above error messages look like system call errors. Can you explain to me how you were able to log those? Have they been printed to stderr? Are there maybe details (e.g. a stacktrace from where the thread is being attempted to be created? Also I did a quick google search on the error: first result indicate that an EAGAIN error occurs if the process has run out of memory. Would you try to re-run your tests by giving your JVM more max memory (e.g. using -Xmx6G) or event more? Not sure why the process doesn't run into an out-of-memory, but rather in this kind of error. To understand your data a bit better, does it look like many of the following pattern? endpoint 1: source1:x :isRelated source2:otherEntitiy endpoint 2: source2:otherEntity rdfs:label "Other Entity " And a query like SELECT * WHERE { ?x :isRelated ?y . ?y rdfs:label ?label } Is my understanding of your scenario correct? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACmj8c8KibAF-HD9D0NNQg4oH9yemdGIks5vUMMtgaJpZM4bgeTq> .

-- *Thomas Francart* -* SPARNA* Web de *données* | Architecture de l'*information* | Accès aux *connaissances* blog : blog.sparna.fr, site : sparna.fr, linkedin : fr.linkedin.com/in/thomasfrancart tel : +33 (0)6.71.11.25.97, skype : francartthomas

aschwarte10 · 2019-03-08T14:49:45Z

Did you already have the chance to investigate the memory settings?

Thanks for the detailed explanation, I also try to reproduce the scenario. Regarding your question I will contact you via mail directly.

aschwarte10 · 2019-03-21T15:35:02Z

@tfrancart just as additional update: as of #13 I added a hash join operator to FedX (Note: it is currently the implementation only, but not yet active). This operator may help in cases like this, where there is a large intermediate result as input to a join.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of threads / limit number of threads ? #7

Large number of threads / limit number of threads ? #7

tfrancart commented Mar 6, 2019

aschwarte10 commented Mar 7, 2019

tfrancart commented Mar 7, 2019 via email

aschwarte10 commented Mar 8, 2019

aschwarte10 commented Mar 21, 2019

Large number of threads / limit number of threads ? #7

Large number of threads / limit number of threads ? #7

Comments

tfrancart commented Mar 6, 2019

aschwarte10 commented Mar 7, 2019

tfrancart commented Mar 7, 2019 via email

aschwarte10 commented Mar 8, 2019

aschwarte10 commented Mar 21, 2019