More advanced scripting to handle remediation -- Part 2

In the last article, I shared a simple script to automate restarting a service based on memory.

Taking that information and going a step further, let's first use this in a monitor service.

First, let's configure a script that will run as a service. A service is basically just a program that runs in an infinite loop. Pretty simple with the while [ 1 ]; do in BASH.

Let's say that you put the script in /root/scripts/remediation.sh:

/root/scripts/memory_remediation_service.sh

#!/bin/bash
#Setting the time between checks to 1 minute.
sleep_time=60  
while [ 1 ]; do  
  switches=""
  for i in "$@"
  do
      case $i in
      --search=*)
        switches="$switches --search=${i#*=}"
        shift
      ;;
      --service=*)
        switches="$switches --service=${i#*=}"
        shift
      ;;
      --warn=*)
        switches="$switches --warn=${i#*=}"
        shift
      ;;
      --crit=*)
        switches="$switches --crit=${i#*=}"
        shift
      ;;
      --script=*)
        script_to_run=${i#*=}
        shift
      ;;
      --sleep=*)
        sleep_time=${i#*=}
        shift
    esac
  done
  if [ "$script_to_run" != "" ]; then
    $script_to_run $switches
    retval=$?
    if [ $retval = 4 ]; then
      echo "Invalid switches, or switches missing.  Quitting."
      exit 2
    fi
  else
    echo "Missing the script to run. Cannot start service."
    exit 1
  fi
  sleep $sleep_time
done  

Now, I had to make a minor modification to the original script, so it would exit 4 if we didn't pass proper values. This would ensure the init script died if there was an error.

Here's the new script that the service script will call every 60 seconds:

/root/scripts/remediation.sh:

#!/bin/bash
function help(){  
  echo "Usage:
      $0 [--search='STRING'] [--service='STRING'] [--warn=INTEGER] [--crit=INTEGER] [-h || --help]

  --search  | String to search for utilizing 'ps auxf'
  --service | Service to restart if critical threshold is met/exceeded
  --warn    | If this integer is met, but is lower than critical, it will return 1, to tell you it has passed the warning threshold
  --crit    | If this integer is met or exceeded, it will restart the service (if defined), and return 2 to let you know
  -h|--help | print this help"
}
for i in "$@"  
do  
    case $i in
    --search=*)
      search=${i#*=}
      shift
    ;;
    --service=*)
      service=${i#*=}
      shift
    ;;
    --warn=*)
      warn=${i#*=}
      shift
    ;;
    --crit=*)
      crit=${i#*=}
      shift
    ;;
    --help|-h*)
      help
      exit 0
    ;;
  esac
done  
if [ "$warn" = "" ]; then  
  warn=50
fi  
if [ "$crit" = "" ]; then  
  crit=75
fi  
if [ "$search" = "" ]; then  
  echo "Invalid search string, or string not set."
  help
  exit 4
fi  
pct=0  
memory_pct_used=$(ps auxf|grep "$search"|grep "grep" -v|awk '{print $4}'|cut -d'.' -f1)  
for i in $(echo $memory_pct_used); do  
  pct=$(expr $pct + $i)
done

if [ $pct -gt $warn ] || [ $pct -eq $warn ]; then  
  if [ $pct -gt $crit ] || [ $pct -eq $crit ]; then
    if [ "$service" = "" ]; then
      echo "Memory for '$search' has passed the critical threshold. Current memory is $pct%"
      exit 2
    else
      service $service stop; service $service start
      echo "Memory for '$search' has passed the critical threshold. Memory was at $pct%. I have restarted the service."
      exit 2
    fi
  else
    echo "Memory for '$search' has passed the warning threshold, but is below the critical threshold. Current memory is $pct%"
    exit 1
  fi
else  
  echo "Memory is within allowable range. Current memory is $pct%"
  exit 0
fi  

Next we will need a daemon script for your distro. I've written systemd and Upstart scripts for this below.

Ubuntu

/etc/init/remediation.conf

start on startup  
description "Simple remediation solution"  
chdir /root/scripts/  
setuid root  
# Let's have it warn if "node" is using more than 60%, and restart the service if it reaches more than 75% memory
exec ./memory_remediation_service.sh --search="node server" --service="my_node_service" --warn=60 --crit=75 --script=/root/script/remediation.sh  
CentOS/RHEL >= 7

/usr/lib/systemd/system/remediation

[Unit]
Description=Simple remediation solution  
After=syslog.target

[Service]
Type=simple  
User=root  
Group=root  
ExecStart=/root/scripts/memory_remediation_service.sh --search="node server" --service="my_node_service" --warn=60 --crit=75 --script=/root/script/remediation.sh

# Give a reasonable amount of time for the server to start up/shut down
TimeoutSec=300

[Install]
WantedBy=multi-user.target  

This script can be modified easily to watch other services, and as I said in the last article, you can also repurpose this to monitor other metrics, instead of memory.

Now, we have a service based solution. In the next article, I will cover how to do all this across multiple servers to centralize remediation.