Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoProcessor - Load and start bundles in parallel #236

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pskowronek
Copy link

@pskowronek pskowronek commented Sep 17, 2023

A PoC of loading and starting bundles in parallel in AutoProcessor.

This was tried in the following project https://github.com/mucommander/mucommander that is using AutoProcessor to load bundles. For muCommander this gave 2-3 times faster bundle loading (from 2-3s to ~400-800ms on my macbook 2012) - see here mucommander/mucommander@ab376b9 (PR in muCommander: https://github.com/mucommacommits/ab376b984257e79a563a0fa49cc128af9c501e9f).

Can you please review and assess whether the concept could be incorporated into Felix, whether it is safe etc.

@pskowronek
Copy link
Author

Any chance for feedback on this one?

@paulrutter
Copy link
Contributor

@pskowronek interesting concept.
We've experimented with parallel startup of bundles as well, but got stuck with non-reproducible edge cases in which OSGi services would not start. Some kind of deadlock scenario.

Did you encounter this while creating this PR?

@pskowronek
Copy link
Author

@pskowronek interesting concept. We've experimented with parallel startup of bundles as well, but got stuck with non-reproducible edge cases in which OSGi services would not start. Some kind of deadlock scenario.

Did you encounter this while creating this PR?

No, I didn't encounter this. We're using this patch in https://github.com/mucommander/mucommander project since around 6 months, and so far no deadlocks were reported.

@paulrutter
Copy link
Contributor

@pskowronek interesting concept. We've experimented with parallel startup of bundles as well, but got stuck with non-reproducible edge cases in which OSGi services would not start. Some kind of deadlock scenario.
Did you encounter this while creating this PR?

No, I didn't encounter this. We're using this patch in https://github.com/mucommander/mucommander project since around 6 months, and so far no deadlocks were reported.

Ok, that's good. I might try it as well in our product and see what happens. I think we had the issue 1 out of 10 startups, and have a script that i can test it with.
I will report back with my findings if i get to it.
It would be an interesting change, as it reduces startup time of the application.

@paulrutter
Copy link
Contributor

Without having reviewed the full PR, I think it would make sense to at least make this behavior optional and thus configurable, to not break any existing applications.
If users would opt-in to parallel startup, it's at least a conscious decision.

But let me first try it out in my application to see if i run into any issues.

@pskowronek
Copy link
Author

Without having reviewed the full PR, I think it would make sense to at least make this behavior optional and thus configurable, to not break any existing applications. If users would opt-in to parallel startup, it's at least a conscious decision.

Yes, it should be configurable.

But let me first try it out in my application to see if i run into any issues.

Superb! Thanks.

@stbischof
Copy link
Contributor

stbischof commented Oct 1, 2024

Could you please run this against the OSGi TCKs

Spec:
Defining the order in which bundles are started and stopped is useful for the following:
• Safe mode - The management agent can implement a safe mode. In this mode, only fully trusted
bundles are started. Safe mode might be necessary when a bundle causes a failure at startup that
disrupts normal operation and prevents correction of the problem.
• Splash screen - If the total startup time is long, it might be desirable to show a splash screen dur-
ing initialization. This improves the user's perception of the boot time of the device. The startup
ordering can ensure that the right bundle is started first.
• Handling erratic bundles - Problems can occur because bundles require services to be available
when they are activated (this is a programming error). By controlling the start order, the manage-
ment agent can prevent these problems.
• High priority bundles - Certain tasks such as metering need to run as quickly as possible and can-
not have a long startup delay. These bundles can be started first.

@paulrutter
Copy link
Contributor

Good point @stbischof.

@stbischof
Copy link
Contributor

stbischof commented Oct 1, 2024

This all happens in main Project. The command line launcher. So this may make it easier. But the startlevel chapters should be respected. Did you read them carefully?

The bnd Launcher also does start bundle in parallel. Maybe @pkriens can tell us why.

@paulrutter
Copy link
Contributor

paulrutter commented Oct 4, 2024

@pskowronek i checked how we ran parallel startup earlier, and it was based on code shared in https://felix.apache.org/documentation/subprojects/apache-felix-dependency-manager/reference/thread-model.html.

How does this PR work differently than via the dependency manager as outlined above?

I assume this PR just starts the OSGi bundles in parallel, where the above starts the services in these bundles in parallel, correct?

paulrutter added a commit to blueconic/felix-dev that referenced this pull request Oct 4, 2024
@paulrutter
Copy link
Contributor

paulrutter commented Oct 4, 2024

@pskowronek i tried it out and compared startup times against the "regular" build.
In the big scheme of things, i doesn't seem to make a big difference (our application is probably a lot larger than yours, startup times of several minutes are not unusual). Loading/starting the bundles in parallel didn't seem to impact our startup times much, but again, for smaller applications it could make a (positive) difference though.

I'm also running our script to detect startup issues, will let you know the outcome.

@paulrutter
Copy link
Contributor

paulrutter commented Oct 4, 2024

I have the results; startup did fail at least once with the following exception:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.felix.dm.impl.index.ServiceRegistryCacheManager
        at org.apache.felix.dm.impl.Activator.start(Activator.java:46)
        at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:849)
        at org.apache.felix.framework.Felix.activateBundle(Felix.java:2429)
        at org.apache.felix.framework.Felix.startBundle(Felix.java:2335)
        at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1566)
        at org.apache.felix.framework.FrameworkStartLevelImpl.run(FrameworkStartLevelImpl.java:297)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NullPointerException [in thread "pool-3-thread-1"]
        at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.init(ServiceRegistryCacheManager.java:140)
        at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.<clinit>(ServiceRegistryCacheManager.java:63)
        at org.apache.felix.dm.DependencyManager.createContext(DependencyManager.java:314)
        at org.apache.felix.dm.DependencyManager.<init>(DependencyManager.java:88)
        at org.apache.felix.dm.DependencyActivatorBase.start(DependencyActivatorBase.java:77)
        at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:849)
        at org.apache.felix.framework.Felix.activateBundle(Felix.java:2429)
        at org.apache.felix.framework.Felix.startBundle(Felix.java:2335)
        at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:1006)
        at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:992)
        at org.apache.felix.main.AutoProcessor.lambda$processAutoProperties$2(AutoProcessor.java:376)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        ... 1 more
Auto-properties start: reference:file:lib/abc-status-bundle-95.0-SNAPSHOT.jar (org.osgi.framework.BundleException: Activator start error in bundle com.abc.status [15]. - java.lang.NoClassDefFoundError: Could not initialize class org.apache.felix.dm.impl.index.ServiceRegistryCacheManager)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.felix.dm.impl.index.ServiceRegistryCacheManager

Seems like there can be a race condition with the dependency manager, see the stacktrace.

@pskowronek
Copy link
Author

@pskowronek i tried it out and compared startup times against the "regular" build. In the big scheme of things, i doesn't seem to make a big difference (our application is probably a lot larger than yours, startup times of several minutes are not unusual). Loading/starting the bundles in parallel didn't seem to impact our startup times much, but again, for smaller applications it could make a (positive) difference though.

I'm also running our script to detect startup issues, will let you know the outcome.

Can you tell me how many bundles the test app has, and what was the processor you were testing on?

@paulrutter
Copy link
Contributor

paulrutter commented Oct 4, 2024

I don't have the exact count at hand, but in the order of 30-40 bundles i think.
I used your branch to test it with, see blueconic@7ac245c.

@pskowronek
Copy link
Author

I don't have the exact count at hand, but in the order of 30-40 bundles i think. I used your branch to test it with, see blueconic@7ac245c.

ok, what type of processor the test was run on - the number of (HT) cores might matter. In muCommander we have around 180 bundles , and since this is desktop app, it cpu have 4 (8 HT) cores at minimum.

@pderop
Copy link
Contributor

pderop commented Oct 4, 2024

@paulrutter ,

I assume this PR just starts the OSGi bundles in parallel, where the above starts the services in these bundles in parallel, correct?

Hi Paul,

yes, you are correct, when DM is used in concurrent mode, then services are all started and registered concurrently, so concurrency is achieved at the service level.

now, about the exception you reported:

Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NullPointerException [in thread "pool-3-thread-1"]
at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.init(ServiceRegistryCacheManager.java:140)
at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.(ServiceRegistryCacheManager.java:63)
at org.apache.felix.dm.DependencyManager.createContext(DependencyManager.java:314)
at org.apache.felix.dm.DependencyManager.(DependencyManager.java:88)
at org.apache.felix.dm.DependencyActivatorBase.start(DependencyActivatorBase.java:77)
at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:849)
at org.apache.felix.framework.Felix.activateBundle(Felix.java:2429)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2335)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:1006)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:992)
at org.apache.felix.main.AutoProcessor.lambda$processAutoProperties$2(AutoProcessor.java:376)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 more

after many attempts, I confirm that I also reproduced the problem.
I'm not using DM in parallel mode, and I have recompiled the framework from this PR branch.
I'm using 117 bundles, on mac M1 with 8 cpus. The issue is hard to reproduce.

So, from the stacktrace, we have a NPE in this location, meaning that in the previous line (here), bundle.getBundleContext(); has returned null.

I cannot comment on this for the moment.

@stbischof
Copy link
Contributor

@pskowronek

It it correct that all your bubdles use onlY bundleactivators and no declarative services?

@pskowronek
Copy link
Author

@pskowronek

It it correct that all your bubdles use onlY bundleactivators and no declarative services?

Yes

@paulrutter
Copy link
Contributor

paulrutter commented Nov 16, 2024

now, about the exception you reported:

Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NullPointerException [in thread "pool-3-thread-1"]
at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.init(ServiceRegistryCacheManager.java:140)
at org.apache.felix.dm.impl.index.ServiceRegistryCacheManager.(ServiceRegistryCacheManager.java:63)
at org.apache.felix.dm.DependencyManager.createContext(DependencyManager.java:314)
at org.apache.felix.dm.DependencyManager.(DependencyManager.java:88)
at org.apache.felix.dm.DependencyActivatorBase.start(DependencyActivatorBase.java:77)
at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:849)
at org.apache.felix.framework.Felix.activateBundle(Felix.java:2429)
at org.apache.felix.framework.Felix.startBundle(Felix.java:2335)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:1006)
at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:992)
at org.apache.felix.main.AutoProcessor.lambda$processAutoProperties$2(AutoProcessor.java:376)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
... 1 more

after many attempts, I confirm that I also reproduced the problem. I'm not using DM in parallel mode, and I have recompiled the framework from this PR branch. I'm using 117 bundles, on mac M1 with 8 cpus. The issue is hard to reproduce.

So, from the stacktrace, we have a NPE in this location, meaning that in the previous line (here), bundle.getBundleContext(); has returned null.

I cannot comment on this for the moment.

@pderop

I don't know what causes this, but maybe adding a while loop on the bundleContext being non null with a thread.sleep(100) would resolve this intermittent timing issue?
Then again, there might be more places where there could occur similar timing issues.
What do you think would be the best approach? The reason I'm interested in this PR is that we experienced the exact same issue sporadically when experimenting with parallel startup via the DM (hence the script i had laying around).

Our startup times greatly improved by starting services in parallel, but the intermittent NPE cause us to not use it in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants