UPDATED: NodeMCU MQTT client doesn't detect broker outage and overflows message queue

Update 2017-04-07

I did more research around this problem and found following interesting facts:

  • I tried to simulate mosquitto outage using iptables on the same machine running mosquitto
  • used rule: iptables -A INPUT -s 192.168.1.53/32 -j DROP, (mentioned IP is the IP of NodeMCU running the testing script) it was the second rule in the input chain, the first was allowing communication from loopback
  • if I applied this rule, I found that both NodeMCU FWs (2.0.0.0 master, 1.5.4.1 final) will detect broker outage, delay depends on "keepalive" parametter of the mqtt client (see mqtt module documentation) and is approximately equal to this delay (I did not measured it, just estimation) - tested with keepalive = 60 and 5
  • as the outage is detected, message "MQTT: connection failed with reason -5" is shown, so fallback function for failed connection specified in the "m:connect" function is fired, publishing stopped
  • with NodeMCU 1.5.4.1 final also "m:on("offline",..)" function is fired and "MQTT: offline" message shown just before the previous one
  • if I remove the firewall rule, both reconnects and start to publish messages again
  • in case that the mosquitto broker is stopped (sudo service mosquitto stop) instead of using the firewall to block the traffic, following happend
  • NodeMCU 1.5.4.1 immediatelly showed "MQTT: Offline" message and stopped publishing - expected behaviour
  • NodeMCU 2.0.0.0 continued publishing for about 2minutes (most probably for 2mins 20seconds as mentioned in my original article bellow), heap decreasing, then "MQTT: connection failed with reason -5" message shown, then publishing stopped, "MQTT: offline" message never shown
  • when mosquitto started again, both almost instantly reconnected and started publishing again
  • also noticed that 1.5.4.1 is using less memory than 2.0.0.0 when running the same test script and the memory is freed almost immediatelly as mosquitto outage is detected

So it seems that I will stuck with 1.5.4.1 for my ESP mqtt stuff because the mqtt client seems to mork correcly in cases of broker outages using this firmware, because with 2.0.0.0 the ESP is crashing and rebooting in case of broker connection problems, because of memory overflow.

With examples mentioned in this article ESP never crash even using 2.0.0.0 FW - thats just because the testing .lua script is short enough and there is a lot of memory available to cover that 2minutes and 20seconds of "accumulating not published messages in memory". But if you use more complex .lua script as I did to control the boiler, there is not enough memory left to cover these 2minutes 20seconds and as there is no more memory left the node crash and reboot.

Here is a little bit modified original testscipt I used:

--WIFI_SSID = "SSID"
--WIFI_PASS = "PWD"
MQTT_BrokerIP = "addres.of.broker"
MQTT_BrokerPort = 8883
MQTT_ClientID = "bddsfdsf"
MQTT_Client_user = "test"
MQTT_Client_password = "test"

--wifi.setmode(wifi.STATION)
--wifi.sta.config(WIFI_SSID, WIFI_PASS)
wifiStatusPrev = wifi.STA_IDLE
print("WiFi: Connecting...")
wifi.sta.connect()

--wifi reconnect logic timer
tmr.alarm(0, 500, 1, function()
    wifiStatus = wifi.sta.status()
    if wifiStatus ~= wifiStatusPrev then
        print("WiFi: status change "..wifiStatusPrev.."->"..wifiStatus)
    end
    if (wifiStatusPrev ~= 5 and wifiStatus == 5) then 
        print("WiFi: connected")
        print(wifi.sta.getip())
    elseif (wifiStatus ~= 1 and wifiStatus ~= 5) then
      print("WiFi: Reconnecting...")
      wifi.sta.connect()
    end
    wifiStatusPrev = wifiStatus
end)

m = mqtt.Client(MQTT_ClientID, 5, MQTT_Client_user, MQTT_Client_password)

m:on("message", function(client, topic, data) 
    print("MQTT: "..topic .. ":" ) 
    if data ~= nil then
        print(data)
    end
end)

m:on("offline", function(client)
    publishMqtt:stop()
    print("MQTT: offline")
end)

--MQTT reconnect logic timer
reconnMqtt = tmr.create()

reconnMqtt:register(10, tmr.ALARM_SEMI, function (t)
  reconnMqtt:interval(500);
  print("MQTT: trying to connect to "..MQTT_BrokerIP..":"..MQTT_BrokerPort)
  m:close()
  m:connect(MQTT_BrokerIP, MQTT_BrokerPort, 1, 1, function(client) 
    print("MQTT: connected")
    publishMqtt:start()
    print("PUB: started")
 end, function(client, reason) 
    publishMqtt:stop()
    print("MQTT: connection failed with reason "..reason)
--    reconnMqtt:start()
 end)
end)

--MQTT local timestamp publishing timer
publishMqtt = tmr.create()
publishMqtt:register(500, tmr.ALARM_AUTO, function (t)
  print("Heap: " .. node.heap())
  if not m:publish("ESPtest002/timestamp", tmr.time().."."..((tmr.now()/1000)%1000), 0, 0)
  then
    print("PUB: timestamp publish failed")
  end
end)

reconnMqtt:start()

Notice I commented out "reconnMqtt:start()" under "print("MQTT: connection failed with reason "..reason)", so the timer is NOT started again in case of the connect fail function os fired. In the fact ot is started just once as you launch the script. With 1.5.4.1 and autoreconnect turned on in "m:connect(MQTT_BrokerIP, MQTT_BrokerPort, 1, 1, function(client)" (the second "1" after port) it works like a charm in both cases - broker stopped / cutted out by firewall.

What I want to try next is how it will behave in case that wifi signal is lost and restored again.

Original testcase

Used following code to test this bug (lets call it test.lua):

--WIFI_SSID = "SSID"
--WIFI_PASS = "PWD"
MQTT_BrokerIP = "server.address"
MQTT_BrokerPort = 8883
MQTT_ClientID = "clID"
MQTT_Client_user = "user"
MQTT_Client_password = "pwd"

--wifi.setmode(wifi.STATION)
--wifi.sta.config(WIFI_SSID, WIFI_PASS)
wifiStatusPrev = wifi.STA_IDLE
print("WiFi: Connecting...")
wifi.sta.connect()

--wifi reconnect logic timer
tmr.alarm(0, 500, 1, function()
    wifiStatus = wifi.sta.status()
    if wifiStatus ~= wifiStatusPrev then
        print("WiFi: status change "..wifiStatusPrev.."->"..wifiStatus)
    end
    if (wifiStatusPrev ~= 5 and wifiStatus == 5) then 
        print("WiFi: connected")
        print(wifi.sta.getip())
    elseif (wifiStatus ~= 1 and wifiStatus ~= 5) then
      print("WiFi: Reconnecting...")
      wifi.sta.connect()
    end
    wifiStatusPrev = wifiStatus
end)

m = mqtt.Client(MQTT_ClientID, 10, MQTT_Client_user, MQTT_Client_password)

m:on("message", function(client, topic, data) 
    print("MQTT: "..topic .. ":" ) 
    if data ~= nil then
        print(data)
    end
end)

m:on("offline", function(client)
    publishMqtt:stop()
    print("MQTT: offline")
end)

--MQTT reconnect logic timer
reconnMqtt = tmr.create()

reconnMqtt:register(10, tmr.ALARM_SEMI, function (t)
  reconnMqtt:interval(500);
  print("MQTT: trying to connect to "..MQTT_BrokerIP..":"..MQTT_BrokerPort)
  m:close()
  m:connect(MQTT_BrokerIP, MQTT_BrokerPort, 1, 1, function(client) 
    print("MQTT: connected")
    publishMqtt:start()
    print("PUB: started")
 end, function(client, reason) 
    publishMqtt:stop()
    print("MQTT: connection failed with reason "..reason)
    reconnMqtt:start()
 end)
end)

--MQTT local timestamp publishing timer
publishMqtt = tmr.create()
publishMqtt:register(500, tmr.ALARM_AUTO, function (t)
  print("Heap: " .. node.heap())
  if not m:publish("ESPtest002/timestamp", tmr.time().."."..((tmr.now()/1000)%1000), 0, 0)
  then
    print("PUB: timestamp publish failed")
  end
end)

reconnMqtt:start()

I have separate cfg.lua which configures WiFi SSID, password, tls CA

-- wifi
wifi.setmode(wifi.STATION)

-- band
wifi.setphymode(wifi.PHYMODE_G)

-- wifi network ssid / pwd + save
station_cfg={}
station_cfg.ssid="your-ssid"
station_cfg.pwd="your-wifi-password"
station_cfg.save=true
station_cfg.auto=true
wifi.sta.config(station_cfg)

-- disable autoconnect
wifi.sta.autoconnect(0)

net.cert.verify([[-----BEGIN CERTIFICATE-----
here
is
the
place
for
certificate
data
-----END CERTIFICATE-----
]])

-- restart to apply
node.restart()

Test procedure

  • power on nodeMCU and load testing scripts
  • launch cfg.lua
  • launch test.lua, it starts to send data to broker and showing heap and debug info in console
  • stop broker, message "MQTT: offline" shall be shown and it should stop publish messages, heap should not be decresing

Results

Test01

FW used:

NodeMCU custom build by frightanic.com
    branch: master
    commit: b96e31477ca1e207aa1c0cdc334539b1f7d3a7f0
    SSL: true
    modules: adc,bit,cron,enduser_setup,file,gpio,i2c,mqtt,net,node,pcm,pwm,rtctime,sntp,tmr,uart,wifi,tls
 build  built on: 2017-03-07 18:05
 powered by Lua 5.1.4 on SDK 2.0.0(656edbf)

Result

  • while running with broker online heap settled to 22936
  • broker stopped
  • heap 35424 but decreasing about 50 with every publish, no message about mqtt offline
  • after about 2minutes 20seconds heap at 17752, message "MQTT: connection failed with reason -5" and "MQTT: trying to connect to mqtt.dread.cz:8883", publish stopped, node not crashed
  • broker was started
  • almost immediatelly "MQTT: connected", "PUB: started", heap back to 22944 and publishing

Test02

FW used:

NodeMCU custom build by frightanic.com
    branch: dev
    commit: b09cac058a7b5fae0be85b79bfe8de1a777e2b32
    SSL: true
    modules: adc,bit,cron,enduser_setup,file,gpio,i2c,mqtt,net,node,pcm,pwm,rtctime,sntp,tmr,uart,wifi,tls
 build  built on: 2017-03-23 11:53
 powered by Lua 5.1.4 on SDK 2.0.0(656edbf)

Result are the same as in Test01, heap values differ slightly, but not significatly (22920, 35568, 17776)

Test 03

FW used:

NodeMCU custom build by frightanic.com
    branch: 1.5.4.1-final
    commit: 1885a30bd99aec338479aaed77c992dfd97fa8e2
    SSL: true
    modules: file,gpio,mqtt,net,node,tmr,uart,wifi,tls
 build  built on: 2017-03-23 12:29
 powered by Lua 5.1.4 on SDK 1.5.4.1(39cb9a32)

Result = expcted behaviour

  • while running with broker online heap settled to 29184
  • broker stopped
  • message "MQTT: offline" immediatelly appeared in terminal, publishig stopped
  • broker started
  • measseges "MQTT: connected" and "PUB: started" immediatelly appeared in terminal, publishig started again

Test 04

In test 03 I used less modules than in previous tests by mistake, lets try the SDK 2.0.0 DEV FW once again with same modules as in test 03 (=only with mobules necessary for the test) to make sure that the bug is not caused by other modules

FW used:

NodeMCU custom build by frightanic.com
    branch: dev
    commit: b09cac058a7b5fae0be85b79bfe8de1a777e2b32
    SSL: true
    modules: file,gpio,mqtt,net,node,tmr,uart,wifi,tls
 build  built on: 2017-03-23 13:26
 powered by Lua 5.1.4 on SDK 2.0.0(656edbf)

Result: same as in test 01 / test 02 - not causet by other modules.

Conclusion

While using FWs based on SDK 1.5.4.1, function

m:on("offline", function(client)
    publishMqtt:stop()
    print("MQTT: offline")
end)

works correctly and you are able to immediatelly detect that the connection to the broker was lost.

While using FWs based on SDE 2.0.0 the mentioned fuction is not triggered, after 2 minutes and 20seconds (why such time???) "callback function for when the connection could not be established" defined in mqtt.client:connect is triggered with reason -5 (There is no broker listening at the specified IP Address and Port)

Also it could be possible to detect broker failure while publishing messages - there is possible to define callback function when calling mqtt.client:publish() and this function is triggered on PUBACK (ack of the message) from the server. Then we can check that every published message is ACKed by broker - if not we know the connection or the broker is down, we can stop publishing messages (to prevent heap overflow) and try to reconnect - will try this later for sure (don't want to use old 1.5.4.1 FW just because of such bug).

Links

Like this? - buy me a BEER using Bitcoin: 18zHcqpTydXAYuaYPh17H9nr3MybcqZJT1 - THANK YOU