Nagios (Nagwin) fails to start after many days of normal operation

6 posts / 0 new
Last post
phillipscp
Offline
Last seen: 10 months 3 weeks ago
Joined: 31.05.2012 - 05:55
Nagios (Nagwin) fails to start after many days of normal operation

Hi guys,

 

Have been experiencing the same problem for a couple of months now.  Nagios will run fine for a few days or even a week or 2 and then suddenly refuse to start.

The only solution until now is to save the full etc folder, remove and reinstall Nagwin, then copy back the etc folder.  That always fixes the problem but is not so convenient.

 

Please assist based on the information below.

 

------

When the problem appears I see a nagios.exe.stackdump file appear which has the following contents:

 

Exception: STATUS_ACCESS_VIOLATION at eip=6111354A

eax=7F736F6F ebx=00000000 ecx=2F736F69 edx=74706F2F esi=010F02C8 edi=010F02B4

ebp=1991A288 esp=1991A284 program=C:\Program Files (x86)\ICW\bin\nagios.exe, pid 8300, thread unknown (0x1FC8)

cs=0023 ds=002B es=002B fs=0053 gs=002B ss=002B

Stack trace:

Frame     Function  Args

1991A288  6111354A  (74706F2F, 2F736F69, 74B011D8, 000000E0)

1991A318  61111FA4  (010F02B0, 00000007, 00000004, 0042B040)

1991CD58  0042B7FC  (00000000, 00000000, 00000000, 00000000)

1991CD98  610E7455  (0112E220, 1991CDD4, 610E73A0, 0112E220)

End of stack trace

-----

 

itefix
Online
Last seen: 2 min 11 sec ago
Joined: 01.05.2008 - 21:33
Can you give us some

Can you give us some information about Nagios version, number of hosts and services and monitoring agents ? Do you see any indication of trouble in the log files in /var/log ?

phillipscp
Offline
Last seen: 10 months 3 weeks ago
Joined: 31.05.2012 - 05:55
Hi tk,   Monitoring 69 Hosts,

Hi tk,

 

Monitoring 69 Hosts, 365 services.  29 of the hosts are running nsclient++.

Full install was done from Nagwin_1.3.0_Installer.exe so i'm running Nagios Core 3.3.1

 

nagios-stderr.log has alot of entries as follows but it does that from day 1 after a new install:

No such signal: SIGIOT at /var/opt/pnp4nagios/libexec/process_perfdata.pl line 1183.

 
nagios-stdout.log last line is:

Warning: Possibly root user failed dropping privileges with initgroups()
Failed to drop privileges.  Aborting.

 
 
Is any of that useful to you?
 
Regards,
Chris

itefix
Online
Last seen: 2 min 11 sec ago
Joined: 01.05.2008 - 21:33
Those messages are harmless.

Those messages are harmless. You can comment out line 1183 in /var/opt/pnp4nagios/libexec/process_perfdata.pl to get rid of SIGIOT messages. Increasing debugging level may help to pinpoint the problem (etc/nagios/nagios.cfg):

 
debug_level=-1
debug_verbosity=2
max_debug_file_size=2000000

 

When problem occurs again, you can send a copy of the debug file var/nagios/nagios.debug

 


phillipscp
Offline
Last seen: 10 months 3 weeks ago
Joined: 31.05.2012 - 05:55
Hi there, After I put those

Hi there,

After I put those settings in nagios.cfg I was able to start the Nagios service but it effectively hung.  No updates in the Nagios Web interface.  Stopping the service also took extremely long and i had to manually terminate nagios.exe.

I removed the debug settings and nagios is running again.

Curiosly the nagios.debug file was created but it has 0 bytes content.

Anything else to try for debugging?

Chris

itefix
Online
Last seen: 2 min 11 sec ago
Joined: 01.05.2008 - 21:33
Ok. Thanks. Just revert

Ok. Thanks. Just revert values back to defaults. I wonder if Nagios hit a kind of resource limit over time. Nagwin runs pnp4nagios in standard mode (running performance collector perl script each time a host/service check result is received) and there are more than 400 instances (60+ hosts and 360+ services) to process.

It may be worthwhile to try to run performance collection in bulk mode. Assuming that checks are made in 5-min intervals and running bulk updates each 15 seconds, we can reduce number of perf.collection calls from 72 to 4 each minute (360/20 every five mins). That may help.

Steps to activate pnp4nagios in bulk mode:

---- etc/nagios/nagios.cfg ----

Add following lines:

service_perfdata_file=/var/opt/pnp4nagios/var/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file

host_perfdata_file=/var/opt/pnp4nagios/var/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file

 

----- etc/nagios/nagwin/commands.cfg -----

Add following lines:

define command{
       command_name    process-service-perfdata-file
       command_line    /var/opt/pnp4nagios/libexec/process_perfdata.pl --bulk=/var/opt/pnp4nagios/var/service-perfdata
}

define command{
       command_name    process-host-perfdata-file
       command_line    /var/opt/pnp4nagios/libexec/process_perfdata.pl --bulk=/var/opt/pnp4nagios/var/host-perfdata
}