-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideation on making Pthread more scalable #4645
Comments
Your data set is too small for all 64 caches. |
Hi @brada4, I understand your point but if the dataset is soo small that it requires only few (Example 8) cores only then why should we lock all the resources and do poor utilization of resources. We could allow multiple calls to be executed at a time. You meant core and not cache right? If you meant cache, Can you please elaborate your point. |
there are some badly modelled areas, if input+temp+output fits in one cache one core is optimal, arbitrary above that it switches to all core threads and there is observable glitch for some size range after until huge data gets linear speedup. More cpus more pessimal range. Better heuristics welcome |
Hi @brada4 I agree with your point but I have a different concern. Lets imagine this scenario Scenario: I have 10 BLAS calls and suppose OpenBLAS decided that nthreads = 8 is enough for each of them based on their size ( It does happens when |
I got your idea, you are talking scoreboard, not lock. |
I've been trying to address this too, but at the current stage OpenBLAS simply uses a "sensible" maximum number of threads (based on what is accepted as the maximum workload for a single thread before switching to multithreading) instead of throwing all cores at any problem. I have (obviously) not addressed this level3_lock issue in my experiments so far, and I get the impression that you are more experienced with pthread locking algorithms than me anyway. Given that most of the core level3 code has not changed since the early days, one probably needs to look out for state variables that are safe only as long as only a single BLAS call is active - but those will probably show up readily enough, if they exist. |
@martin-frbg |
Hello,
I'm currently working on optimizing the scalability of the openBLAS Pthread flow. Presently, I've observed that even when a BLAS call requires only 8 threads for execution on a 64-core machine, it still locks all available resources using
level3_lock
inlevel3_thread.c.
These resources are only released after the execution completes, resulting in poor CPU utilization (approximately 12.5%).My goal is to maximize CPU resource utilization, ideally reaching close to 100%. To achieve this, I have a theoretical concept in mind and would greatly appreciate community suggestions and insights.
The Idea:
Instead of utilizing a mutex lock at
level3_thread.c,
I propose employing a locking mechanism with conditional wait. This would allow more BLAS calls to proceed until all CPUs are fully utilized. Upon completion of a BLAS operation, the corresponding CPU can be released, signaling the waiting threads to check for resource availability again. Resource allocation and deallocation can be managed through a thread-safe mechanism.I'm seeking feedback on the feasibility and effectiveness of this approach. Are there any potential oversights or inaccuracies in my understanding? I'm open to any insights or suggestions for further improvement.
The text was updated successfully, but these errors were encountered: