More advanced scripting to handle remediation -- Part 1

Remediation. The bane of DevOps existance. In an ideal world, it would not be needed. Seldom do we live in an ideal world.

With this in mind, I've been spending the past few months automating remediation, as well as comparing turn-key solutions, like Neptune.io and StackStorm.com

While each has their highs and lows, they both have one thing in common: For a feature-rich solution that is easy to set up, you're going to pay. This in and of itself is not a bad thing.

For those of us who don't want to pay, you can do some simple scripting to get similar results, at no extra cost.

Let's start with this test case

  • You have a Linux box that keeps running out of memory. For the sake of simplicity, let's say that it's node.js .
  • You know that it will crash, but the service will report that it's online when you run a service status check
  • CPU appears fine, it's just the memory that is tapped out.

A simple soultion would be something like this:

#!/bin/bash
memory_pct_used=$(ps auxf|grep "node server.js"|grep "grep" -v|awk '{print $4}'|cut -d'.' -f1)  
if [ $memory_pct_used -gt 75 ]; then  
  service my_node_service restart
  echo "More than 75% of the memory was in use by the node service. I have restarted my_node_service"
  exit 1
else  
  echo "All is right in the world"
  exit 0
fi  

This script is fairly limited in what it can do. Let's change it up to be reusable.

#!/bin/bash
function help(){  
  echo "Usage:
      $0 [--search='STRING'] [--service='STRING'] [--warn=INTEGER] [--crit=INTEGER] [-h || --help]

  --search  | String to search for utilizing 'ps auxf'
  --service | Service to restart if critical threshold is met/exceeded
  --warn    | If this integer is met, but is lower than critical, it will return 1, to tell you it has passed the warning threshold
  --crit    | If this integer is met or exceeded, it will restart the service (if defined), and return 2 to let you know
  -h --help | print this help"
}
for i in "$@"  
do  
    case $i in
    --search=*)
      search=${i#*=}
      shift
    ;;
    --service=*)
      service=${i#*=}
      shift
    ;;
    --warn=*)
      warn=${i#*=}
      shift
    ;;
    --crit=*)
      crit=${i#*=}
      shift
    ;;
    --help|-h*)
      help
      exit 0
    ;;
  esac
done  
if [ "$warn" = "" ]; then  
  warn=50
fi  
if [ "$crit" = "" ]; then  
  crit=75
fi  
if [ "$search" = "" ]; then  
  echo "Invalid search string, or string not set."
  help
  exit 1
fi  
pct=0  
memory_pct_used=$(ps auxf|grep "$search"|grep "grep" -v|awk '{print $4}'|cut -d'.' -f1)  
for i in $(echo $memory_pct_used); do  
  pct=$(expr $pct + $i)
done

if [ $pct -gt $warn ] || [ $pct -eq $warn ]; then  
  if [ $pct -gt $crit ] || [ $pct -eq $crit ]; then
    if [ "$service" = "" ]; then
      echo "Memory for '$search' has passed the critical threshold. Current memory is $pct%"
      exit 2
    else
      service $service stop; service $service start
      echo "Memory for '$search' has passed the critical threshold. Memory was at $pct%. I have restarted the service."
      exit 2
    fi
  else
    echo "Memory for '$search' has passed the warning threshold, but is below the critical threshold. Current memory is $pct%"
    exit 1
  fi
else  
  echo "Memory is within allowable range. Current memory is $pct%"
  exit 0
fi  

Copy the above, save it to a .sh file, chmod it to +x, then run with no flags to receive usage instructions.

Here's one example I used:

➜  ~ ./memory_remediation.sh --search="amavisd-new" --warn=15 --crit=25

And here's the example output:

Memory is within allowable range. Current memory is 11%  
# I ran the next 2 lines to see what value was returned
➜  ~  retval=$?
➜  ~  echo $retval
0  
# 0 = good
# 1 = warn
# 2 = critical

Now, to make sure it alerts properly, let's force it to return a critical:

➜  ~  ./memory_remediation.sh --search="amavisd-new" --warn=3 --crit=9
Memory for 'amavisd-new' has passed the critical threshold. Current memory is 10%  
➜  ~  retval=$?
➜  ~  echo $retval
2  

Finally, have it return warning:

➜  ~  ./memory_remediation.sh --search="amavisd-new" --warn=3 --crit=25
Memory for 'amavisd-new' has passed the warning threshold, but is below the critical threshold. Current memory is 10%  
➜  ~  retval=$?
➜  ~  echo $retval
1  

Using the same basic script, you can easily reconfigure it to key off CPU, disk I/O, and swap space being used.

In the next article, I will show you how to convert this into a daemon that will constantly monitor a given service.