Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

Detecting Greengrass Core Disconnection at IoT Thing side #3094

Closed
Prahar08modi opened this issue Apr 9, 2021 · 10 comments
Closed

Detecting Greengrass Core Disconnection at IoT Thing side #3094

Prahar08modi opened this issue Apr 9, 2021 · 10 comments

Comments

@Prahar08modi
Copy link

Describe the bug
My overall system consists of a Greengrass Core (Raspberry Pi) and multiple other devices (ESP-WROOM-32) which runs Amazon FreeRTOS and sends Sensor Readings every 10s. Also I want to cover the case when sometimes by mistake power supply of Greengrass Core is cut then Thing Code must detect that and try to reconnect to GGC. But Thing code is not able to detect this case.

System information

  • Hardware : ESP-WROOM-32
  • Operating System [Linux]
  • Version of FreeRTOS : 202011.00-4-ga83a71b33
  • Project [Custom Application]

Expected behavior
According to AWS IoT Greengrass documentation, all local communication uses QoS0 (no acknowledgement is provided) as shown below in a screenshot.

Screenshot from 2021-04-09 10-51-10

I have configured keep-alive timeout as 30s.

According to my understanding, I'm sending sensor readings every 10s which are QoS0 type messages. So, the last message time of MQTT Connection is getting reset after every 10s. In this case, Thing Code is not able to detect the keep-alive timeout. But after around 5 minutes, code detects NETWORK ERROR and knows that the QoS0 messages are actually not getting published. The screenshot of the same is attached below.

Screenshot from 2021-04-09 10-45-47

I tested the other alternative by an idle code which just waits for keep-alive timeout. In this case, when nothing is getting published, code is able to detect keep-alive timeout event exactly after 30s of GGC disconnection.

The expected behavior should be to detect keep-alive timeout event after 30s while sending sensor readings every 10s.

Thank you!

@abhidixi11
Copy link
Contributor

abhidixi11 commented Apr 11, 2021

Hello @Prahar08modi
Thanks for reaching out and apologies for a bit delayed response.
As I understand based on your description, you sending sensor data every 10 seconds and you regularly see connection disconnect after 5 minutes.
You should not need to send a ping message since you are regularly sending PUBLISH messages.

In this case, when nothing is getting published, code is able to detect keep-alive timeout event exactly after 30s of GGC disconnection.

I'm assuming you are using process loop when connection is idle. What do you mean by GGC disconnection? If Ping is sent, GGC should not disconnect. If Ping is not sent, since your keep-alive interval is 30 seconds, GGC should wait till 45 seconds before it disconnects.
Can you please confirm above ?

Also I want to cover the case when sometimes by mistake power supply of Greengrass Core is cut then Thing Code must detect that and try to reconnect to GGC.

In this case, the application has to re-establish the connection. When Publish returns failure, that's the indication that the network connection has gone down. The application should then close MQTT connection and disconnect TCP/TLS connection( i.e. close socket etc) and then reestablish both TLS and MQTT connections. The library does not have any running thread to detect the connection has gone down. It will only detect it when either send (eg. sending PUBLISH) of receive ( processLoop) fails.

@Prahar08modi
Copy link
Author

Prahar08modi commented Apr 15, 2021

Hello @abhidixi11

Firstly I'm sorry for the late reply.

I'm assuming you are using process loop when connection is idle.

Yeah. The assumption is correct.

What do you mean by GGC disconnection?

I meant to say GGC disconnection (in this context) as when the power supply of GGC is cut/removed

Can you please confirm above ?

I think there is a confusion and let me clear it. GGC does not get disconnected by its own. I deliberately remove the power supply of GGC.

In this case, the application has to re-establish the connection

Yeah, I've implemented the logic of re-connection by the same flow you mentioned and it works perfectly well. But there is a problem of detecting MQTT disconnection in the case mentioned above.

It will only detect it when either send (eg. sending PUBLISH) of receive ( processLoop) fails.

Exactly that's what is expected. It is able to detect the processLoop failure immediately after keep alive timeout but it is not able to detect failure immediately due to NETWORK ERROR when PUBLISH is send. Instead it detects after around 5 minutes which is too much in my application.

Thanks for the help!

@Prahar08modi
Copy link
Author

Is there any further update on this?

@Prahar08modi
Copy link
Author

Hello @abhidixi11

I found a similar issue #2155. I think the fix provided there was for the older version of Amazon FreeRTOS as the version which I am using (202011.00-4-ga83a71b33) uses coreMQTT Library's function MQTT_Publish for publishing a PUBLISH packet on network by using MQTT LTS PUBLISH API.

@abhidixi11
Copy link
Contributor

Hello @Prahar08modi
Apologies, for the late response, for some reason your response was missed.
The problem you are facing is at the transport layer. Since transport send does not fail, the publish is not failing , therefore, the connection failure is not detected. Send is not failing because the send queue is queuing the packet and it keeps retrying since the device did not receive close when you turned power off for GGC.
How frequently are you running process loop ? When you run process loop are you checking the status to make sure it has succeeded ? I'm trying to determine if process loop will detect receive failure.

@muneebahmed10
Copy link
Contributor

Hi @Prahar08modi,

What version MQTT library are you using? From the following line in your error log:

(MQTT connection 0x3ffdd370) Failed to send PUBLISH packet on the network.

It looks like it's coming from this line in the MQTT compatibility layer with the old API. I'd recommend using the latest coreMQTT API or the recently released coreMQTT Agent library if working with multiple threads.

Second, can you clarify the issue you are facing? Is the issue:

  • After power cycling the GGC, IotMqtt_Publish() returns success for 5 minutes despite not sending anything? Or
  • IotMqtt_Publish() returns NETWORK_ERROR for 5 minutes, and then the connection is disconnected only after the keep alive job fails?

If it's the former, this is an issue at the transport layer as @abhidixi11 described. For the latter, you can just close the network connection yourself as soon as the publish returns an error. Additionally, if using the old MQTT library, you might find this response helpful in explaining why the network connection isn't closed as soon as a receive failure is detected, and instead waits for the keep alive job to fail. This could be related to the issue you are facing.

@Prahar08modi
Copy link
Author

Prahar08modi commented May 2, 2021

Hello @abhidixi11 and @muneebahmed10,

Firstly, Thank you for the support.

I have a confusion regarding the MQTT v2.x.x Library and coreMQTT library. I can't understand the interlinking of these libraries.
IotMqtt_Publish uses MQTT_Publish API of CoreMQTT while for handling Keep Alive Timeout, it creates a Taskpool Job which calls _IotMqtt_ProcessKeepAlive from iot_mqtt_operation.c and not MQTT_ProcessLoop from coreMQTT library.

Send is not failing because the send queue is queuing the packet and it keeps retrying since the device did not receive close when you turned power off for GGC

Does it still queues the publish messages to send queue even if these are QoS0 type?

How frequently are you running process loop ? When you run process loop are you checking the status to make sure it has succeeded ? I'm trying to determine if process loop will detect receive failure.

Actually, for managing Keep Alive Timeout instead of MQTT_ProcessLoop, _IotMqtt_ProcessKeepAlive is called as mentioned above.

What version MQTT library are you using?

Old MQTT Shim for MQTT V2.x.x APIs is being used.

After power cycling the GGC, IotMqtt_Publish() returns success for 5 minutes despite not sending anything?

This is the case where I am not able to detect immediately whether the underlying network connection does not exist (i.e, the GGC's power is disconnected) and I'm still getting success messages from IoTMqtt_Publish API till 5 minutes. After 5 minutes, it detects the NETWORK_ERROR from IoTMqtt_Publish API and then connection is disconnected when keep alive timeout is detected.

I'd recommend using the latest coreMQTT API or the recently released coreMQTT Agent library if working with multiple threads.

Yeah I'm using multiple threads. Maybe I can give it a try...

@abhidixi11
Copy link
Contributor

abhidixi11 commented May 2, 2021

Hello @Prahar08modi ,

Quick clarification :

I have a confusion regarding the MQTT v2.x.x Library and coreMQTT library. I can't understand the interlinking of these libraries.

I looks like you are using shim layer which is a compatibility layer between old API and new API. If you are developing a new application, I would recommend directly using coreMQTT library. We are also working on adding support for coreMQTT-Agent library in this repository, but you can check out the sample usage of coreMQTT-Agent library, which supports connection sharing in demos repository . It shows how various AWS services can be used.

Any particular reason you are using shim layer ?

Thanks!

@Prahar08modi
Copy link
Author

Hello @abhidixi11 ,

Any particular reason you are using shim layer ?

Actually I had a previous experience using iot_mqtt_agent and had the above confusion regarding the new coreMQTT library. So, to reduce development time, I decided to move on with MQTT Shim for now and implement coreMQTT afterwards.

If you are developing a new application, I would recommend directly using coreMQTT library.

Okay. So, I'll be trying the new coreMQTT Agent library after understanding the basics of how packets are transmitted and received, keep alive is handled and all. We can close this issue. If I have any doubt regarding coreMQTT Agent Library, I'll create another thread for the same.

Thanks for guiding and helping me out...

@sukhmanm
Copy link
Contributor

sukhmanm commented May 5, 2021

Hi @Prahar08modi,

Closing this issue as suggested. Please don't hesitate to open another if you run into any problems with coreMQTT, or require further clarification on this topic. Thank you.

@sukhmanm sukhmanm closed this as completed May 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants