Neuron unstable when accessing /sys

knebb

Hi,

I use one of the DIs of my neuron an I am counting the impulses. As I am collecting the data through snmp I installed snmpd. But as access to /sys/... is only allowed for root (and snmpd not running as root) I created a wrapper-script:

#!/bin/bash
[ -x /tmp/neuron ] || mkdir -p /tmp/neuron
for i in 1 2 3 4 ; do
	/bin/cat /sys/devices/platform/unipi_plc/io_group1/di_1_0$i/counter > /tmp/neuron/1_0$i
done
for i in 1 2 3 4 5 6 7 8 ; do
	/bin/cat /sys/devices/platform/unipi_plc/io_group2/di_2_0$i/counter > /tmp/neuron/2_0$i
done

Which gets started by crontab.

Now the issue:

When I run the script every minute, my Neuron gets completely unstable. It reboots every now and then, when I ask for uptime it is only rarely more then 10 minutes!
I do not see any reasons for a reboot in the logs, but it reboots. You can recon the reboots every time my Cacti (through snmp) does not get CPU data:

When I change the crontab entry from minutely to every five minutes it works.
From

* *	* * *	root    /root/get_di_counter.sh

to

*/5 *	* * *	root    /root/get_di_counter.sh

With the later setting my Neuron is stable. But I am a little bit uncomfortable as it should not be unstable when it is set to every minute, though.

So any idea why it is getting so unstable?

Oh, and another observation:
When set to /5 the uptime is not in terms of minutes, but the Neuron still reboots once or twice a day. So how do I get it stable?

Thanks!

TomasKnot

I'll look into it, there's possibly an undetected issue which causes the system to restart.

TomasKnot

I've been running the command in a busy loop and so far have managed to do so without restart for a while now, using the following modification (on S103, but I will test an M module next):

#!/bin/bash
j="0"
while [ $j -lt 9999999 ]
do
[ -x /tmp/neuron ] || mkdir -p /tmp/neuron
for i in 1 2 3 4 ; do
        /bin/cat /sys/devices/platform/unipi_plc/io_group1/di_1_0$i/counter > /tmp/neuron/1_0$i
done
for i in 1 2 3 4 ; do
        /bin/cat /sys/devices/platform/unipi_plc/io_group1/di_1_0$i/counter > /tmp/neuron/2_0$i
done
j=$[$j+1]
done

Do you have any other third party applications running, or are you using some other software simultaneously (Mervis, EVOK) that's running and which could potentially cause the crash?

knebb

@TomasKnot
Well, EVOK is running as I use the image you guys provided. But there has been no access to the web page for the last couple of days. And expecially not while the system was rebooting frequently.

Again, on crontab it rebooted every now and then (between 10 minutes and an hour or so.

Since the script is running now every 5 minutes the system reboots "only" twice per 12 hours.
Only additional software running is the snmpd.

Is there something I can test to narrow the root cause?

TomasKnot

Hi @kneb,

Apologies for the delay. I would be grateful if you could try to run the modified version I posted above instead. It's possible it's an issue specific to your model, due to e.g. hardware mapping in the kernel.

I'll do some more testing today and tomorrow, but so far I haven't managed to uncover the cause.

knebb

@TomasKnot
I will test this tomorrow. Today I changed my script so it is not accessing the /sys any more and instead just adds +1 every run. Set to minutely. If the system still reboots it might not be related to /sys access. Up to now it did not reboot (4hrs now stable).

When I am done with this test I will start your script- will be tomorrow.

TomasKnot

Right, well, it might be something else then. I'm quite keen to see what could be wrong though.

knebb

@TomasKnot
No reboot the last 24hrs. Looks like it is really related to accessing /sys.

I will use your modified version now and monitor at least the next 24hrs.

knebb

@TomasKnot said in Neuron unstable when accessing /sys:

Right, well, it might be something else then. I'm quite keen to see what could be wrong though.

Now I ran your script (but added a short counter to get some output).

After iteration 34300 the Neuron rebooted. Until then the script ran with approx 15-20% of CPU.

So it it definitely accessing /sys

Can I do some further steps to narrow the reason for the issue?

knebb

The next run it rebooted after 20100.

TomasKnot

I'll try it myself again, but it's odd that the crash would happen at iteration 34300/20100, in that before it apparently crashed much earlier. Are you sure the system reboots, or could it be just a network issue?

If you are connected to the device via SSH does it display any messages when it crashes? For a software sysfs crash I would expect to see a kernel panic output. In fact kernel panic should not cause a restart anyhow, though it certainly would be a bad thing. You can find more detailed output by running the "dmesg" command if a kernel panic does happen.

But as a separate thing - is it possible the crash is caused by electrical shocks/interference, i.e. are you using the device in isolation or connected to target devices? Those would cause the device to restart. I am reluctant to assume that SYSFS is the cause without a kernel panic readout, as a kernel panic does not cause a crash (a particularly bad one could cause the device to lock up, but not restart). We have not had any so far in our released images, which is the second reason why I am reluctant.

knebb

@TomasKnot
Hi,

first: yes it reboots for sure! I can not login for a minute or so and when back again the "uptime" states only 1 minute or so.

dmesg only shows me the progress of the last boot, but not what happened before.

Connected through ssh. No screen messages. No kernel panic to see. The scripts runs and suddenly does not print any output any more. Until the ssh connection appears to be broken.

In kernel.log nothing to see:

May 16 10:44:32 zentrale kernel: [    8.611983] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
May 16 10:44:34 zentrale kernel: [   10.039003] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xC1E1
May 16 10:44:39 zentrale kernel: [   15.037087] random: crng init done
May 16 11:01:34 zentrale kernel: [    0.000000] Booting Linux on physical CPU 0x0
May 16 11:01:34 zentrale kernel: [    0.000000] Linux version 4.9.41-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1023 SMP Tue Aug 8 16:00:15 BST 2017
May 16 11:01:34 zentrale kernel: [    0.000000] CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=10c5383d
May 16 11:01:34 zentrale kernel: [    0.000000] CPU: div instructions available: patching division code
May 16 11:01:34 zentrale kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache

Regarding a possible circuit/ electrical issue: Could indeed be possible. There is just one device attached (which I would like to count for the moment). But then it should happen as well when not accessing the /sys fs, shouldn't it? But when /sys is not accessed (or less frequent) the reboots do not happen at all (less frequently).

Any further ideas?

TomasKnot

@knebb

I'm running a larger testing script on 2 devices with 4 threads in parallel, so this may take a little while. Unfortunately it's difficult to know what goes wrong without the device kernel log. I mentioned the electrical side as we did have customers who had issues wth it, but usually it was more on the order of factories and such.

If you can wait a little while longer I'll see where my testing takes me, I'll post here again in a few hours.

knebb

@TomasKnot
I found some kernel related issues... see kernel.log I attached.
Unfortunately it appears I have permission issues uploading. You can download the file here.

knebb

Another test:

When I add a "sleep 1s" after every iteration (in while) the system rebooted this time after run 200 (instead of 20100 or so). So it seems to be related to some timing and not the number of accesses.

TomasKnot

@knebb

I'll have a look at it. It does look like there might be a timing/resource starvation issue somewhere, based on the kernel log as well (not a kernel panic, but scheduled thread fails to run in allotted time).

I have gone over all resource allocations again, so at the very least we can rule out a memory leak.

TomasKnot

It looks like the issue is with the invalidation thread stalling out if consecutive reads are done before it can be performed. I've switched it to use mutexes instead of spinlocks, which seems to solve the issue.

I seem to recall I have already sent you a modified binary - would you be willing to accept one again? I would send it via a private message as before.

knebb

@TomasKnot

Yes, you already send one. It is fine.

Looking forward to have a stable system soon. Luckily it is not a hardware fault.

Thanks for great support!

knebb

I ran the script and up to now it is at 43400- so far nearly 50% more than before. No crash or reboot up to now.

I will start my monitoring system and see if it will stay stable.

THANKS a lot!

TomasKnot

Apologies for the trouble, we did not encounter this particular issue before.

I hope your project goes well!