Neuron unstable when accessing /sys

  • Right, well, it might be something else then. I'm quite keen to see what could be wrong though.

  • @TomasKnot
    No reboot the last 24hrs. Looks like it is really related to accessing /sys.

    I will use your modified version now and monitor at least the next 24hrs.

  • @TomasKnot said in Neuron unstable when accessing /sys:

    Right, well, it might be something else then. I'm quite keen to see what could be wrong though.

    Now I ran your script (but added a short counter to get some output).

    After iteration 34300 the Neuron rebooted. Until then the script ran with approx 15-20% of CPU.

    So it it definitely accessing /sys

    Can I do some further steps to narrow the reason for the issue?

  • The next run it rebooted after 20100.

  • I'll try it myself again, but it's odd that the crash would happen at iteration 34300/20100, in that before it apparently crashed much earlier. Are you sure the system reboots, or could it be just a network issue?

    If you are connected to the device via SSH does it display any messages when it crashes? For a software sysfs crash I would expect to see a kernel panic output. In fact kernel panic should not cause a restart anyhow, though it certainly would be a bad thing. You can find more detailed output by running the "dmesg" command if a kernel panic does happen.

    But as a separate thing - is it possible the crash is caused by electrical shocks/interference, i.e. are you using the device in isolation or connected to target devices? Those would cause the device to restart. I am reluctant to assume that SYSFS is the cause without a kernel panic readout, as a kernel panic does not cause a crash (a particularly bad one could cause the device to lock up, but not restart). We have not had any so far in our released images, which is the second reason why I am reluctant.

  • @TomasKnot

    first: yes it reboots for sure! I can not login for a minute or so and when back again the "uptime" states only 1 minute or so.

    dmesg only shows me the progress of the last boot, but not what happened before.

    Connected through ssh. No screen messages. No kernel panic to see. The scripts runs and suddenly does not print any output any more. Until the ssh connection appears to be broken.

    In kernel.log nothing to see:

    May 16 10:44:32 zentrale kernel: [    8.611983] smsc95xx 1-1.1:1.0 eth0: hardware isn't capable of remote wakeup
    May 16 10:44:34 zentrale kernel: [   10.039003] smsc95xx 1-1.1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xC1E1
    May 16 10:44:39 zentrale kernel: [   15.037087] random: crng init done
    May 16 11:01:34 zentrale kernel: [    0.000000] Booting Linux on physical CPU 0x0
    May 16 11:01:34 zentrale kernel: [    0.000000] Linux version 4.9.41-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #1023 SMP Tue Aug 8 16:00:15 BST 2017
    May 16 11:01:34 zentrale kernel: [    0.000000] CPU: ARMv7 Processor [410fd034] revision 4 (ARMv7), cr=10c5383d
    May 16 11:01:34 zentrale kernel: [    0.000000] CPU: div instructions available: patching division code
    May 16 11:01:34 zentrale kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache

    Regarding a possible circuit/ electrical issue: Could indeed be possible. There is just one device attached (which I would like to count for the moment). But then it should happen as well when not accessing the /sys fs, shouldn't it? But when /sys is not accessed (or less frequent) the reboots do not happen at all (less frequently).

    Any further ideas?

  • @knebb

    I'm running a larger testing script on 2 devices with 4 threads in parallel, so this may take a little while. Unfortunately it's difficult to know what goes wrong without the device kernel log. I mentioned the electrical side as we did have customers who had issues wth it, but usually it was more on the order of factories and such.

    If you can wait a little while longer I'll see where my testing takes me, I'll post here again in a few hours.

  • @TomasKnot
    I found some kernel related issues... see kernel.log I attached.
    Unfortunately it appears I have permission issues uploading. You can download the file here.

  • Another test:

    When I add a "sleep 1s" after every iteration (in while) the system rebooted this time after run 200 (instead of 20100 or so). So it seems to be related to some timing and not the number of accesses.

  • @knebb

    I'll have a look at it. It does look like there might be a timing/resource starvation issue somewhere, based on the kernel log as well (not a kernel panic, but scheduled thread fails to run in allotted time).

    I have gone over all resource allocations again, so at the very least we can rule out a memory leak.

  • It looks like the issue is with the invalidation thread stalling out if consecutive reads are done before it can be performed. I've switched it to use mutexes instead of spinlocks, which seems to solve the issue.

    I seem to recall I have already sent you a modified binary - would you be willing to accept one again? I would send it via a private message as before.

  • @TomasKnot

    Yes, you already send one. It is fine.

    Looking forward to have a stable system soon. Luckily it is not a hardware fault.

    Thanks for great support!

  • I ran the script and up to now it is at 43400- so far nearly 50% more than before. No crash or reboot up to now.

    I will start my monitoring system and see if it will stay stable.

    THANKS a lot!

  • Apologies for the trouble, we did not encounter this particular issue before.

    I hope your project goes well!

  • @TomasKnot

    Thanks again! currently set to minutely and uptime is at 15hrs.

    Looks it is really stable now.

    thanks again for the great support!

  • @knebb
    If you need faster response times on the SYSFS I can make that change specifically for you, but the limiting factor will be snmp anyhow. Currently SYSFS is set to refresh at a rate of 50Hz. Rates up to 1000Hz are possible in theory, at a cost of higher CPU use.

  • Ah, well. No I am absolutely fine with this.

    My Cacti monitors the system every 5 minutes. So no need for faster- I am fine with a minute.

    Thanks again!

    Oh, and it is working stable. Now running for nearly 2days without a reboot.