-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mesos Elasticsearch framework eating up all the offers. #575
Comments
This is one (of many) aspect of Mesos that I don't like. In the Offer's tab in the UI, you would intuitively think that this represents the in use offers. It doesn't. It represents the offers to frameworks that have not accepted or declined them. I.e. they are in limbo. You have a concoction of infrastructure and applications, so I'm unable to give you any specific help. But I can point you in the right direction. I can't comment on the executor start speed. That is more likely to be an infra/setup problem. But without evidence, I can't be much more help here. Keep digging. The offer hogging should definitely not happen. The ES framework should only ever accept a certain number of offers then start staging. During the staging, it should not accept any more offers because as far as the framework is concerned, all tasks are now isolated. I have never seen this before, so I'm reasonably confident that it's not the fault of the framework. Again, keep digging and try to provide evidence. And I'm sure that this workaround isn't any help here. The ES framework simply declines the offers when it receives them. It never attempts to revive offers. You will see that Mesos offers tasks repeatedly every 2 seconds or so, and ES should simply decline them each time. Check the logs. Sorry I can't be of more help at this point. |
Hi Phil, I agree with most of the points, but the scenario which we faced is.
Question why is master offering all the resources to elasticsearch scheduler? As per the point 3 , whenever the elastic search node is up and running, the decline offers logs are seen frequently and quite a number of times. I may be wrong but is it that isAtLeastOneESNodeRunning() method in Offerstrategy is taking more time to respond. I checked the Unirest documentation default socket timeout and 60 seconds and connection timeout is 10 seconds. so for every offer decline is taking atleast 10 seconds, may be more than 10. if node is up, then response time will be atmost two seconds. what do you think? |
Hi @rsingh2411, thanks for the information. So obviously there should not be unused offers in the offers tab in the UI. You may be right about the isAtleastOneNode... idea, but I have no way of checking because it seems that the issues are caused by your setup. If you correct your setup and all the nodes are correctly routable, then I suspect these issues will disappear. |
Hi Phil, one last naive question , is the resourceOffers method Elasticsearchscheduler serial one, i mean Mesos sends batch of offers simulataneously say batch1,batch2,,batch3. Do all three batches execute parallelly or serially. |
I'm not completely sure about this. I thought that Mesos bundled all offers from different slaves into one offer. So we iterate over the list of offers (serially), but I didn't think that there would be more than one group of offers. |
I think I found the cause of this issue. I also encountered something similar with a running elasticsearch cluster receiving resource offers, declining them but resulting in an accepted offer. Elasticsearch logging indicating it is declining the offer:
but the mesos logs are showing the following:
I found the following Mesos issue of a declineOffer resulting in an acceptOffer, which was fixed in Mesos 0.26.0: However, this elasticsearch project is still using Mesos version 0.25.0: So an upgrade to at least Mesos 0.26.0 should resolve this issue. |
Hi @SlevinBE That line you linked referes to the API used by the code. Not the version that it is compatible with. AFAIK you should be able to use it on any Mesos version (greater 0.25, possibly even below). It depends on whether they change the API or not. Give it a try and let me know. Thanks. |
oh indeed, you're right, it's only the API being used in this project and not the implementation. |
I think I found it now. We use the 'mesos/elasticsearch-scheduler:1.0.1' docker image to deploy elasticsearch on DC/OS, which is based on the following common dockerfile (correct me if I'm wrong): However, at the indicated line it installs mesos 0.25.0 on the image, which still has the previously mentioned bug in it. This code will be called by the Mesos Java API via JNI if I'm right. |
Hi All,
We are using mesos elasticsearch framework to bring up elastic search docker containers. We are using flocker along wih docker. There has been a observation that sometimes the executors come up quite quickly and every thing runs fine. However when there is an network latency then or issue with openstack(openstack is being used as infra) then executors take a lot of time to come up or they don't come up at all and task remains in staging state only.
The issue is during this time all offers are hogged by scheduler and they not processes. when I suspend the framework from marathon, then all the offers are released. Is this the intended behavior, as we have other frameworks and tasks running parallelly and they are in starvation mode. why isn't the decline offer working,
Is it because until and unless task execution completes scheduler thread cannot process the offers( driver.launchTasks call resource offers method in ElasticsearchScheduler.java).
is there any workaround.. I have seen in marathon they have implemented the filter in declineoffers method and revive offer calls, Is the same fix required here.
Refer d2iq-archive/marathon#1931
The text was updated successfully, but these errors were encountered: