pnp4nagios (Service performance Data timeout)

20 posts / 0 new
Last post
regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
pnp4nagios (Service performance Data timeout)

Hi,

since a few days I got this error messages in my log

Warning: Service performance data command '/usr/bin/perl /var/opt/pnp4nagios/libexec/process_perfdata.pl' for service 'Disk C' on host '******' timed out after 5 seconds

from the most of my monited machines. But only "check_pdm" for Disk C and D comes up with that error.

Had a look into the process_perfdata.pl, but everything seems fine to me.



my %conf = (
    TIMEOUT            => 15,
    CFG_DIR            => "/var/opt/pnp4nagios/etc/",
    USE_RRDs           => 1,
    RRDPATH            => "/var/opt/pnp4nagios/var/perfdata",
    RRDTOOL            => "/usr/bin/rrdtool",
    RRD_STORAGE_TYPE   => "SINGLE",
    RRD_HEARTBEAT      => 8640,
    RRA_STEP           => 60,
    RRA_CFG            => "/var/opt/pnp4nagios/etc/rra.cfg",
    STATS_DIR          => "/var/opt/pnp4nagios/var/stats",
    LOG_FILE           => "/var/opt/pnp4nagios/var/perfdata.log",
    LOG_FILE_MAX_SIZE  => "10485760",               #Truncate after 10MB
    LOG_LEVEL          => 0,
    XML_ENC            => "UTF-8",
    XML_UPDATE_DELAY   => 0,                        # Write XML only if file is older then XML_UPDATE_DELAY seconds
    RRD_DAEMON_OPTS    => "",
    GEARMAN_HOST       => "localhost:4730",                        # How many gearman worker childs to start
    PREFORK            => 2,                        # How many gearman worker childs to start
    REQUESTS_PER_CHILD => 20000,                   # Restart after a given count of requests
    ENCRYPTION         => 1,                       # Decrypt mod_gearman packets
    KEY                => 'should_be_changed',
    KEY_FILE           => '/var/opt/pnp4nagios/etc/secret.key',
    UOM2TYPE           => { 'c' => 'DERIVE', 'd' => 'DERIVE' },
);

my %const = (
    XML_STRUCTURE_VERSION => "4",
    VERSION               => "0.6.13",
);

#
# Dont change anything below these lines ...
#


itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
Can you confirm that

Can you confirm that performance collection command for services in etc\nagios\nagwin\commands.cfg is as follows ?

 

define command {
       command_name    process-service-perfdata
       command_line    /usr/bin/perl /var/opt/pnp4nagios/libexec/process_perfdata.pl
}

How many hosts do you have in your setup ? Could it be related to performance bottlenecks ?

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Hi, yes it is the same. Got

Hi,

yes it is the same. Got 20 Host in my setup and 3 of them send back this error every minute... and a fourth one rarly comes up with:

Warning: Host performance data command '/usr/bin/perl /var/opt/pnp4nagios/libexec/process_perfdata.pl -d HOSTPERFDATA' for host '*******' timed out after 5 seconds

 

Mhh dont think its an performance problem because the log massage comes even up at low memory usage on those clients.

Is it possible to simply switch off pnp? Everything works fine, even the checks of my disks on those clients.

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
You can switch off processing

You can switch off processing of performance data by updating nagios.cfg file:

# PROCESS PERFORMANCE DATA OPTION
# This determines whether or not Nagios will process performance
# data returned from service and host checks.  If this option is
# enabled, host performance data will be processed using the
# host_perfdata_command (defined below) and service performance
# data will be processed using the service_perfdata_command (also
# defined below).  Read the HTML docs for more information on
# performance data.
# Values: 1 = process performance data, 0 = do not process performance data

process_performance_data=1

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Hey again, to disable the

Hey again,

to disable the performance data solved my problem and my event log was clean from that warning. Now it comes up that I ll need this data for the pnp graphs I guess. I'am not sure about that, because nagwin was able to show pnp graphs with some data.

Nevertheless I enabled the performance data and my event log exploded again :( What can I do? Want to use the pnp4nagios...

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
Running Pnp4nagios in bulk

Running Pnp4nagios in bulk mode can help. See a related thread for more information.

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Hey, its been a while...

Hey,

its been a while... sorry for the late reply.

I add those lines to the nagios.cfg and the commands.cfg to running pnp4nagios in bulk mode but it didnt effect my error log messages... they still occur every minute and overflow the error log.

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
You mentioned only three of

You mentioned only three of 20 hosts are influenced by this problem. Try to find out what makes those three different than the others. Apparently, perf-logging for those hosts (maybe some specific services?) take too long time.

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
well, there are more... in

well, there are more... in the last Logfile from today were 13/24 Server listet and the error messages come in every minute...

[03-12-2012 11:29:19] Warning: Service performance data command '/usr/bin/perl /var/opt/pnp4nagios/libexec/process_perfdata.pl' for service 'Disk C' on host 'MBK-DEV' timed out after 15 seconds


[03-12-2012 11:29:00] Warning: Service performance data file processing command '/var/opt/pnp4nagios/libexec/process_perfdata.pl --bulk=/var/opt/pnp4nagios/var/service-perfdata' timed out after 15 seconds


There are some specific services in the nrpe.cfg, but those aren't generate the error messages. Only the standard services like "Disk C" or "Memory physical" getting timeouts. I am gonna adjust the timeout from 15s to 45s. What else I can do?

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
seems like I cant simply

seems like I cant simply change the timeout from 15s to 45s. If I do, Nagwin wont come up again and if I set it to 30s, nothing happend and the error message keep timeout after 15 seconds...

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
Which plugin do you use to

Which plugin do you use to collect cpu and disk space checks ? Try to deactivate suspected checks to see if the problem is related to some specific plugins. Where did you change timeout value ?

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Iam using nrpe 4.0 and

Iam using nrpe 4.0 and check_nrpe plugin to collect check on my hosts. I changed the timeout value in C:\Programme\ICW\var\opt\pnp4nagios\libexec\process_perfdata.pl

 

my %conf = (
    TIMEOUT            => 45,
    CFG_DIR            => "/var/opt/pnp4nagios/etc/",
    USE_RRDs           => 1,
    RRDPATH            => "/var/opt/pnp4nagios/var/perfdata",
    RRDTOOL            => "/usr/bin/rrdtool",
    RRD_STORAGE_TYPE   => "SINGLE",
    RRD_HEARTBEAT      => 8640,
    RRA_STEP           => 60,

 

Gonna deactivate those checks for a while now.

Find some new error messages, maybe they are related to the pnp4nagios problem?

 

[03-12-2012 13:50:49] Warning: The check of host 'EKZ_IMOS' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...

[03-12-2012 13:50:49] Warning: The check of host 'EKZWEBAT' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...

[03-12-2012 13:50:49] Warning: The check of host 'EKZWEB' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...

[03-12-2012 13:50:49] Warning: The check of host 'EKZUPD' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
This indicates that host

This indicates that host checks are failing. Check the command in use for host checks (check_winping?) operates correctly. You can download the latest version including ping binaries to avoid problems in localized Windows versions.

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
OK, updated the check_winping

OK, updated the check_winping plugin. The error doesnt seem to come up again (orphaned) but iam still trying to stop the error messages from pnp4nagios.

Is it possible to raise the timeout value to more than 15 seconds? And were else I can change this value as in the process_perfdata.pl? Can I simply switch off the error messages and keep pnp4nagios running?

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
You can comment out all

You can comment out all alarm() calls in the pnp4nagios performance collector script. I wonder if they don't work as described in your case. The side effect is no having timeout at all.

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Allright, I followed your

Allright, I followed your advice and comment out the following lines in the process_perfdata.pl. The warnings doesnt come up again in the event log.

Thanks for your help!

 

#
# Subs
#
# Main function to switch to the right mode.
sub main {
    my $job = shift;
    my $t0 = [gettimeofday];
    my $t1;
    my $rt;
    my $lines = 0;
    # Gearman Worker
    if (defined $opt_gm) {
        print_log( "Gearman Worker Job start", 1 );
        %NAGIOS = parse_env($job->arg);
        $lines = process_perfdata();
        $t1 = [gettimeofday];
        $rt = tv_interval $t0, $t1;
        $stats{runtime} += $rt;
        $stats{rows}++;
        if( ( int $stats{timet} / 60 ) < ( int time / 60 )){
            store_internals();
            init_stats();
        }
        print_log( "Gearman job end (runtime ${rt}s) ...", 1 );
        return 1;
    #} elsif ( $opt_b && !$opt_n ) {
    #    # Bulk mode
    #    alarm($opt_t);
    #    print_log( "process_perfdata.pl-$const{VERSION} starting in BULK Mode called by Nagios", 1 );
    #    $lines = process_perfdata_file();
    #} elsif ( $opt_b && $opt_n ) {
    #    # Bulk mode with npcd
    #    alarm($opt_t);
    #    print_log( "process_perfdata.pl-$const{VERSION} starting in BULK Mode called by NPCD", 1 );
    #    $lines = process_perfdata_file();
    #} else {
    #   # Synchronous mode
    #    $opt_t = 5 if $opt_t > 5; # maximum timeout
    #    alarm($opt_t);
    #    print_log( "process_perfdata.pl-$const{VERSION} starting in SYNC Mode", 1 );
    #    %NAGIOS = parse_env();
    #    $lines = process_perfdata();
    #}
    $rt = tv_interval $t0, $t1;
    $stats{runtime} = $rt;
    $stats{rows} = $lines;
    store_internals();
    print_log( "PNP exiting (runtime ${rt}s) ...", 1 );
    exit 0;
}

itefix
Offline
Last seen: 6 hours 31 min ago
Joined: 01.05.2008 - 21:33
You need only to comment out

You need only to comment out alarm commands. Otherwise, there will be no perfdata collection:

 

    # Bulk mode
    #    alarm($opt_t);
       print_log( "process_perfdata.pl-$const{VERSION} starting in BULK Mode called by Nagios", 1 );
        $lines = process_perfdata_file();
    } elsif ( $opt_b && $opt_n ) {
    # Bulk mode with npcd
    #    alarm($opt_t);
        print_log( "process_perfdata.pl-$const{VERSION} starting in BULK Mode called by NPCD", 1 );
        $lines = process_perfdata_file();
    } else {
    # Synchronous mode
        $opt_t = 5 if $opt_t > 5; # maximum timeout
    #    alarm($opt_t);
        print_log( "process_perfdata.pl-$const{VERSION} starting in SYNC Mode", 1 );
       %NAGIOS = parse_env();
        $lines = process_perfdata();

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
After I removed the out

After I removed the out comments, the pnp4nagios warning came up again in the event log...

 

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Since I run the pnp4nagios in

Since I run the pnp4nagios in bulk mode, I guess I can disable the synchronous mode. With comment out synchronous mode  there are no warnings in my event log. Is it nessessary to keep the synch mode on with running bulk mode?

 

# Synchronous mode
   #     $opt_t = 5 if $opt_t > 5; # maximum timeout
    #    alarm($opt_t);
   #     print_log( "process_perfdata.pl-$const{VERSION} starting in SYNC Mode", 1 );
   #    %NAGIOS = parse_env();
   #     $lines = process_perfdata();

With this configuration there are no data for the pnp4nagios sheet.

   # Synchronous mode
        $opt_t = 10 if $opt_t > 10; # maximum timeout
    #    alarm($opt_t);
        print_log( "process_perfdata.pl-$const{VERSION} starting in SYNC Mode", 1 );
        %NAGIOS = parse_env();
        $lines = process_perfdata();
    }

Changed the timeout from 5 to 10, maybe it works.

regNov
Offline
Last seen: 8 years 11 months ago
Joined: 05.05.2011 - 10:32
Mhh think i got some trouble

Mhh think i got some trouble after changing the line

$opt_t = 10 if $opt_t > 10; # maximum timeout

There are no more collected data for the pnp4nagios sheets. Even if I set the value back to 5. Also if I remove all the comment out alarm($opt_t) marks.

May you could upload the original process_perfdata somewere? Dunno what else to do, except from reinstall nagwin...