Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: LXC container name not set! #1857

Open
iglov opened this issue Mar 30, 2023 · 23 comments
Open

ERROR: LXC container name not set! #1857

iglov opened this issue Mar 30, 2023 · 23 comments

Comments

@iglov
Copy link

iglov commented Mar 30, 2023

OS: Debian 11 (And debian 10)
Kernel: 5.10.0-15-amd64
Env: resource-agents 1:4.7.0-1~bpo10+1, pacemaker 2.0.5-2, corosync 3.1.2-2, lxc 1:4.0.6-2

Just trying to add new resource

lxc-start -n front-2.fr
pcs resource create front-2.fr ocf:heartbeat:lxc config=/mnt/cluster_volumes/lxc2/front-2.fr/config container=front-2.fr

After ~5min want to remove it
pcs resource remove front-2.fr --force
got an error and cluster starts to migrate
Mar 29 23:28:51 cse2.fr lxc(front-2.fr)[2103391]: ERROR: LXC container name not set!

as i can see in /usr/lib/ocf/resource.d/heartbeat/lxc the error spawns when agent can't get OCF_RESKEY_container variable.
This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

The question is: why? And how to debug it?

@oalbrigt
Copy link
Contributor

This might be due to the probe-action.

You can try changing

LXC_validate

to ocf_is_probe || LXC_validate.

@oalbrigt
Copy link
Contributor

Seems like the agent already takes care of probe-actions, so I'll have to investigate further what might cause it.

@iglov
Copy link
Author

iglov commented Mar 30, 2023

Hey @oalbrigt , thanks 4 reply!

to ocf_is_probe || LXC_validate.

Yep, ofc i can try, but what the point if as we can see, the OCF_RESKEY_container var isn't exists or the agent just doesn't know anything about it. So even if i'll try it, he wont stop the container here for the same reason

LXC_stop() {

@oalbrigt
Copy link
Contributor

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

@kgaillot
Copy link
Contributor

kgaillot commented Apr 3, 2023

@kgaillot Do you know what might cause OCF_RESKEY_ variables not being set when doing pcs resource remove --force?

No, that's odd. Was the command tried without --force first? It shouldn't normally be necessary, so if it was, that might point to an issue.

@iglov
Copy link
Author

iglov commented Apr 3, 2023

Hey @kgaillot , thx 4 reply!
Nope, without --force the result is the same.

@kgaillot
Copy link
Contributor

kgaillot commented Apr 3, 2023

@iglov @oalbrigt , can one of you try dumping the environment to a file from within the stop command? Are no OCF variables set, or is just that one missing?

@iglov
Copy link
Author

iglov commented Apr 3, 2023

Well, i can try if you tell me how to do that and if i find cluster in the same state.

@kgaillot
Copy link
Contributor

kgaillot commented Apr 3, 2023

Something like env > /run/lxc.env in the agent's stop action

@iglov
Copy link
Author

iglov commented Apr 3, 2023

Oh, you mean i should place env > /run/lxc.env somewhere in the /usr/lib/ocf/resource.d/heartbeat/lxc in LXC_stop() { ... } ? But it won't work cuz: 1. It died before LXC_stop() in the LXC_validate() ; 2. After fencing node will reboot and/run unmounts. So, i think it would be better to put env > /root/lxc.env in LXC_validate()
If all correct i will try when find the cluster with this bug.

@kgaillot
Copy link
Contributor

kgaillot commented Apr 3, 2023

That sounds right

@iglov
Copy link
Author

iglov commented Feb 6, 2024

Hey guyz! I got it. Tried to stop container nsa-1.ny with pcs resource remove nsa-1.ny --force and got some debug:

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=5d3831d43d924a08a3dad6f49613e661
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
PCMK_quorum_type=corosync
SHLVL=1
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:36160
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

And this how it should looks like

OCF_ROOT=/usr/lib/ocf
OCF_RESKEY_crm_feature_set=3.1.0
HA_LOGFACILITY=daemon
PCMK_debug=0
HA_debug=0
PWD=/var/lib/pacemaker/cores
HA_logfacility=daemon
OCF_EXIT_REASON_PREFIX=ocf-exit-reason:
OCF_RESOURCE_PROVIDER=heartbeat
PCMK_service=pacemaker-execd
PCMK_mcp=true
OCF_RA_VERSION_MAJOR=1
VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all
HA_cluster_type=corosync
INVOCATION_ID=b062591edd5142bd952b5ecc4f86b493
OCF_RESKEY_CRM_meta_interval=30000
OCF_RESOURCE_INSTANCE=nsa-1.ny
HA_quorum_type=corosync
OCF_RA_VERSION_MINOR=0
HA_mcp=true
OCF_RESKEY_config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config
PCMK_quorum_type=corosync
OCF_RESKEY_CRM_meta_name=monitor
SHLVL=1
OCF_RESKEY_container=nsa-1.ny
OCF_RESKEY_CRM_meta_on_node=mfs4.ny.local
PCMK_watchdog=false
OCF_RESKEY_CRM_meta_timeout=20000
OCF_RESOURCE_TYPE=lxc
PCMK_logfacility=daemon
LC_ALL=C
JOURNAL_STREAM=9:44603
OCF_RESKEY_CRM_meta_on_node_uuid=2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb
PCMK_cluster_type=corosync
_=/usr/bin/env

As you can see, there miss some variables like OCF_RESKEY_container or OCF_RESKEY_config

Any ideas? ^_^

@oalbrigt
Copy link
Contributor

oalbrigt commented Feb 6, 2024

That's strange. Did you create it without specifying container=<container name> and using -f to force it? What does you pcs resource config output say?

@iglov
Copy link
Author

iglov commented Feb 6, 2024

Yes, it's very, VERY strange. I create resources with pcs resource create test ocf:heartbeat:lxc container=test config=/mnt/cluster_volumes/lxc1/test/config (you can see it on topic) BUT it does not matter, cuz as i said earlier:

This bug is only on clusters who work without reboot a long time. For example after fencing i can add/remove lxc resources and everything will be fine for a while.

As you can see, almost a year has passed before the bug appeared. This means, i can create resource with ANY method and it WILL work correctly until... something goes wrong.
With pcs resource config everything is good:

  Resource: nsa-1.ny (class=ocf provider=heartbeat type=lxc)
   Attributes: config=/mnt/cluster_volumes/lxc2/nsa-1.ny/config container=nsa-1.ny
   Operations: monitor interval=30s timeout=20s (nsa-1.ny-monitor-interval-30s)
               start interval=0s timeout=60s (nsa-1.ny-start-interval-0s)
               stop interval=0s timeout=60s (nsa-1.ny-stop-interval-0s)

Soo-o-o-o, i have no idea how to debug it further :(

@oalbrigt
Copy link
Contributor

oalbrigt commented Feb 6, 2024

Can you add the output from rpm -qa | grep pacemaker? So I can have our Pacemaker devs see if this is a known issue.

@iglov
Copy link
Author

iglov commented Feb 6, 2024

Yep, sure, but i have it on debian:

# dpkg -l | grep pacemaker
ii  pacemaker                            2.0.1-5                      amd64        cluster resource manager
ii  pacemaker-cli-utils                  2.0.1-5                      amd64        cluster resource manager command line utilities
ii  pacemaker-common                     2.0.1-5                      all          cluster resource manager common files
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents

# dpkg -l | grep corosync
ii  corosync                             3.0.1-2+deb10u1              amd64        cluster engine daemon and utilities
ii  corosync-qdevice                     3.0.0-4+deb10u1              amd64        cluster engine quorum device daemon
ii  libcorosync-common4:amd64            3.0.1-2+deb10u1              amd64        cluster engine common library

# dpkg -l | grep resource-agents
ii  pacemaker-resource-agents            2.0.1-5                      all          cluster resource manager general resource agents
ii  resource-agents                      1:4.7.0-1~bpo10+1            amd64        Cluster Resource Agents

# dpkg -l | grep lxc
ii  liblxc1                              1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools (library)
ii  lxc                                  1:3.1.0+really3.0.3-8        amd64        Linux Containers userspace tools
ii  lxc-templates                        3.0.4-0+deb10u1              amd64        Linux Containers userspace tools (templates)
ii  lxcfs                                3.0.3-2                      amd64        FUSE based filesystem for LXC

@kgaillot
Copy link
Contributor

kgaillot commented Feb 6, 2024

@iglov That is extremely odd. If you still have the logs from when that occurred, can you open a bug at bugs.clusterlabs.org and attach the output of crm_report -S --from="YYYY-M-D H:M:S" --to="YYYY-M-D H:M:S" from each node, covering the half hour or so around when the failed stop happened?

@iglov
Copy link
Author

iglov commented Feb 6, 2024

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

@kgaillot
Copy link
Contributor

kgaillot commented Feb 7, 2024

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

@kgaillot
Copy link
Contributor

kgaillot commented Feb 7, 2024

I would like to, but i can't, cuz there is a lot of business sensitive information like hostnames, common logs, processlist, even drbd passwords :(

It would be helpful to at least get the scheduler input that led to the problem. At the time the problem occurred, one of the nodes was the designated controller (DC). It will have a log message like "Calculated transition ... saving inputs in ...". The last message before the problem occurred is the interesting one, and the file name is the input. You can uncompress it and edit out any sensitive information, then email it to [email protected].

Alternatively you can investigate the file yourself. I'd start with checking the resource configuration and make sure the resource parameters are set correctly there. If they're not, someone or something likely modified the configuration. If they are, the next thing I'd try is crm_simulate -Sx $FILENAME -G graph.xml. The command output should show a stop scheduled on the old node and a start scheduled on the new node (if not, you probably have the wrong input). The graph.xml file should have <rsc_op> entries for the stop and start with all the parameters that will be passed to the agent.

@iglov
Copy link
Author

iglov commented Feb 7, 2024

Hey @kgaillot ! Thanks 4 explanations and ur time!
Well, i have there something like that

# 0-5 synapses about stonith

<synapse id="6">
  <action_set>
    <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs/>
</synapse>
<synapse id="7">
  <action_set>
    <rsc_op id="33" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="8">
  <action_set>
    <rsc_op id="31" operation="delete" operation_key="nsa-1.ny_delete_0" on_node="mfs3.ny.local.priv" on_node_uuid="1">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs3.ny.local.priv" CRM_meta_on_node_uuid="1" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </rsc_op>
  </action_set>
  <inputs>
    <trigger>
      <rsc_op id="214" operation="stop" operation_key="nsa-1.ny_stop_0" on_node="mfs4.ny.local.priv" on_node_uuid="2"/>
    </trigger>
  </inputs>
</synapse>
<synapse id="9">
  <action_set>
    <crm_event id="26" operation="clear_failcount" operation_key="nsa-1.ny_clear_failcount_0" on_node="mfs4.ny.local.priv" on_node_uuid="2">
      <primitive id="nsa-1.ny" class="ocf" provider="heartbeat" type="lxc"/>
      <attributes CRM_meta_on_node="mfs4.ny.local.priv" CRM_meta_on_node_uuid="2" CRM_meta_op_no_wait="true" CRM_meta_timeout="20000" crm_feature_set="3.1.0"/>
    </crm_event>
  </action_set>
  <inputs/>
</synapse>

looks good, isn't it? I don't see anything wrong here. But if you still want, i can try to sent you these pe-input files.

@kgaillot
Copy link
Contributor

kgaillot commented Feb 8, 2024

No, something's wrong. The resource parameters should be listed in <attributes> after the meta-attributes (like config="/mnt/cluster_volumes/lxc2/nsa-1.ny/config" container="nsa-1.ny"). Check the corresponding pe-input to see if those are properly listed under the relevant <primitive>.

@iglov
Copy link
Author

iglov commented Feb 8, 2024

Yep, sry, u right, my bad. I tried to find resource nsa-1.ny in pe-input-250 (this one is the last before fuckup) and there is no that primitive there at all. But it is in pe-input-249. Pooof, it's just disappeared...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants