Do you get notified if an app server, web server or process scheduler go down? If you do, good for you! If you don’t receive notifications, here is a script that will send you notifications.
If you are an admin who doesn’t have access to monitoring tools (and there are lots of you), this post will go over a script I wrote. The script builds a System Status page and sends email notifications. You can run the script on any machine and it will send an email if anything has gone down. A status.html
page is also generated so that users can go check on the status for their system.
This page and script is something I put together over a few days as a side project. We wanted a page that would give us the status of each component in an environment. This isn’t the most robust script (I’ll explain the limitations in the post), but I wanted to share it so other people could use it. If you rely on end users to tell if you an environment is down, this script can help you out.
All of the code is hosted on GitHub, so go grab it here.
Install Prerequisites
The script is written in Ruby, uses tnsping
, uses Mechanize gem for interacting with Peoplesoft, Markdown for formatting, the Redcarpet gem for generating HTML documents, and the Mail gem for emailing status updates. So we’ll need to install all those parts. It sounds like a lot, but it’s pretty simple.
We’ll walk through the installation process. I’m writing the instructions for Windows, but the Linux steps will be similar.
Oracle Client
The script uses tnsping
to check the database status. So, we need the Oracle Client installed on the machine. If you don’t have the Oracle Client software, you can download it here
You also need a tnsnames.ora
file with entries for all the databases you want to check. You can place the tnsnames.ora
file anywhere on the server. The status.bat
script sets the TNS_ADMIN
environment variable to point to your tnsnames.ora
file.
Ruby Dev Kit
We’ll install Ruby and the Ruby Dev Kit (2.2.4 is what I’m using) on the machine where our scripts will run. Download the Ruby installer from here:
http://rubyinstaller.org/downloads/
I installed Ruby to e:\ruby22-x64
and selected the option to “add executables to the PATH variable”.
Next, download the Ruby DevKit from the same site. The DevKit includes tools to build Gems from source code. We need to the extra tools included with the DevKit. I installed the Ruby DevKit to e:\ruby22-x64-devkit
.
Open a new command prompt as an Administrator.
e:
cd ruby22-x64-devkit
ruby dk.rb init
notepad config.yml
- Add
- e:/ruby22-x64
to the end of the file (notice the dash and forward slash) - Save and close the
config.yml
file ruby dk.rb install
Follow the instructions here if you have issues with the DevKit installation.
Gems
Ruby has a powerful package manager called “Gems”. The gem
command is part of Ruby, so we can install the extra packages we’ll use for our status page.
Open a new command prompt as an Administrator and we’ll install the Gems.
where ruby
Make sure this command returns the e:\ruby22-x64
folder first.
If it’s not listed first, change the
PATH
envronment variable so `e:\ruby22-x64\bin\
is first.
gem install mechanize
gem install redcarpet
gem install mail
That’s it for the Gems.
Scripts
There are two scripts in the project:
psavailability.rb
status.bat
The first script, psavailability.rb
, is the main script. This is where all the processing happens. The second script, status.bat
, is a wrapper to set environment variables like ORACLE_HOME
, TNS_ADMIN
and PATH
.
psavailability.rb
Let’s dive into the psavailability.rb
script since that is the main part. As I mentioned before, the script is written in Ruby. I’ll be first person to tell you that I’m not an expert in Ruby. If there places where the code could be improved, let me know.
Status Check Flow
I chose to do all my status checking through the application. This is where Mechanize comes into play. I replicate the actions of an user who opens the login page, logs in, and navigates to the Process Monitor page. The main reason for this method was simplicty. I can do all my checks with one library.
+--------------+ +--------------+ +-------------+
| Web Server | +-----> | App Server | +------> | Scheduler |
| Check | | Check | | Check |
+--------------+ +--------------+ +-------------+
The main disadvantage is: if the web server is down, the app server and process scheduler checks will fail. That may not be accurate from a technical perspective, but from a user perspective it would be true. If you can’t log in, you can do anything!
Setup
I’ve moved any variables that will vary from my install to the top. They are:
# ---------------------------
# Change these variables
# ---------------------------
smtpServer = '<smtp server>'
statusUser = '<PeopleSoft Username>'
statusUserPwd = '<PeopleSoft Password>'
homepageTitleCheck = '<Homepage Title>'
fromEmailAddress = '<From email address>'
toEmailAddress = '<To email address>'
deployPath = '<e:\\path\\to\\PORTAL.war\\>'
# ---------------------------
The script assumes that you will use the same account to access all environments. I created a new account called STATUS
and gave it limited permissions. STATUS
can log in and open the Process Monitor Server List page. This way I can track logins from the status script, and we give the service account the least amount of security needed.
Another assumption in the script is that your Homepage Title will have similar text. In our case, we use titles like HR 9.2 Demo
, FS 9.2 Test
, or ELM 9.2 QA
for our environments. I check for 9.2
in the Homepage Title to know if the login was successful.
Initialization
The next section is contains some initialization steps. I set the User-Agent to IE 9, but you can change that if you want.
Mail.defaults do
delivery_method :smtp, address: smtpServer, port: 25
end
affectedEnvironments = Array.new
notify = false
agent.user_agent_alias = 'Windows IE 9'
Then, I create the Markdown table headers for our table. I found it much easier to create the table in Markdown and then convert the table to HTML at the end.
table = "| Environment | Database | Web Status | App Status | Scheduler | Batch Server | Update Time | Batch Status |\n"
table = table + "| ----------- | -------- | ---------- | ---------- | --------- | ------------ | ----------- | ------------ |\n"
Last, I read in the URLs.txt
file to get the list of environments and URLs to use. The project on GitHub has a sample URLs.txt
file to follow.
# Get the list of environments
# the URLs.txt file is a CSV file with the format "DBNAME,baseURL,processMonitorURI"
agent = Mechanize.new
URLs = CSV.read('URLs.txt', {:col_sep => ','})
URLs.shift # Remove Header Row
Checking Status
URLs.each { |environment, loginURL, prcsURI|
web_status = 'Running'
app_status = 'Running'
database = 'Running'
We’ll loop through each environment for the status checks. In our environment, we have 25 “environemnts” (prod and non-prod) that we check. I say “environments” because production has 2 web servers and I check each one.
begin
t = `tnsping #{environment}`
if t.lines.last.include? "OK"
database = 'Running'
else
database = 'Down'
end
rescue
database = 'Down'
end
To test the database, we run a tnsping
command. If the response contains “OK” in the last line the ping was a success. If you are thinking to yourself, “that’s not the best way to test the database”, I agree. But this is a the quicket way to get an Up or Down response. (See the Future Improvements section at the end.)
# Check web server by opening login page
begin
signon_page = agent.get(loginURL + '?cmd=login')
rescue
web_status = 'Down'
end
Next, we attempt to load the login page. If the page responds, we know our web server is up.
begin
signin_form = signon_page.form('login')
signin_form.userid = statusUser
signin_form.pwd = statusUserPwd
homepage = agent.submit(signin_form)
# We updated PeopleTools > Portal > General Settings to include '9.2' in the title (e.g, "HR 9.2 Test").
# If we see '9.2' in the title, we know the login was successful
if homepage.title.include? homepageTitleCheck
app_status = 'Runnning'
else
app_status = 'Down'
end
rescue
app_status = 'Down'
end
To check the app server status, the script logs into the application. We grab the form named login
and pass in the PeopleSoft user and password. The page returned from the login attempt is stored in homepage
. In our case, every environment has “9.2” in the homepage title. If “9.2” is in the title, I know we have logged in and the app server is up.
The field that holds the homepage title is
psprdmdefn.descr254
.
begin
# Build URL for Process Monitor and access the component page directly
procMonURL = loginURL + prcsURI
procMonURL.sub! '/psp/', '/psc/'
server_list = agent.get(procMonURL)
scheduler_status = ''
scheduler_status = ['', '', '', 'Down'].join(' | ')
schedulers = server_list.search(".PSLEVEL1GRID").collect do |html|
# Iterate through the Server List grid (but skip the first row - the header)
html.search("tr").collect.drop(1).each do |row|
server = row.search("td[1]/div/span/a").text.strip
hostname = row.search("td[2]/div/span").text.strip
last_update = row.search("td[3]/div/span").text.strip
status = row.search("td[9]/div/span").text.strip
scheduler_status = [server, hostname, last_update, status].join(' | ')
end
end
rescue
scheduler_status = ['', '', '', 'Down'].join(' | ')
end
For the process scheduler, we take the Process Monitor URI and append it to the login URL. In the URL, we pass in ?Page=PMN_SRVRLIST
to access to the Server List page. We also substitue /psp/
for the /psc/
servlet. That makes the screen scraping easier since we remove the header frame.
On the Server List page, we grab the grid for the batch servers, drop the header row, and capture the status for each server in the list.
begin
logoutURL = loginURL + '?cmd=logout'
agent.get(logoutURL)
rescue
end
Don’t forget to log out!
table = table + "| #{environment} | #{database} | #{web_status} | #{app_status} | #{scheduler_status} |\n"
# If a component is down, add the environment to the affectedEnvironments list
if web_status.include?("Down") || app_status.include?("Down") || scheduler_status.include?("Down")
affectedEnvironments.push(environment)
end
}
Last, we append the status for each component into a string and add it to our Markdown table. If any components for the environment are down, we add that environment to affectedEnvironments
.
Formatting Output
# Format Markdown table into an HTML table
options = {
filter_html: true,
link_attributes: { rel: 'nofollow', target: "_blank" },
space_after_headers: true
}
renderer = Redcarpet::Render::HTML.new(options)
markdown = Redcarpet::Markdown.new(renderer, extensions = {tables: true})
tableHTML = markdown.render(table)
This section takes our Markdown table and creates an HTML table.
# Add a style to the "Down" fields
if affectedEnvironments.empty?
tableStyleHTML = tableHTML
else
tableStyleHTML = tableHTML.gsub! '<td>Down</td>', '<td class="down">Down</td>'
end
The HTML table has no styles, so it looks plain. I want to highlight any component that is “Down”. We find any <td>
with “Down” as the value and add a the class .down
to it.
File.write('table.html', tableStyleHTML)
# Combine the header, table, and footer HTML files into one status HTML file
statusPage = `copy /a header.html+table.html+foother.html status.html`
deployFile = `xcopy status.html #{deployPath} /y`
At this point, we write our HTML table to the file table.html
. Next, we combine the prebuilt header.html
and footer.html
files with the updated table.html
.
Last, we copy the file to the web location where the status.html
file can be viewed.
We have a page with all the links to our envitonments. I added an
<iframe>
at the bottom of the links page to show thestatus.html
. Anyone who wants to check on an environment can see the status on the links page.
Notify
Now for the fun part – sending a notification. We scheduled the script to run every 10 minutes. But, if an environment is down for maintenace or it’s taking us a while to get it back up, I don’t want to get emails every time the script runs. I want the email to go out once each time an environment is down.
# If the environment is newly down, send the email
# If the environment was already down (exists in 'down.txt'), don't resend the email
if affectedEnvironments.empty?
# if no environments are down, delete the 'down.txt' file
if File.exist?('down.txt')
delete = `del down.txt`
end
else
if File.exist?('down.txt')
downFile = File.read("down.txt")
affectedEnvironments.each do |env|
if !(downFile.include?(env))
# If both conditions (component down, environment not stored in 'down.txt'), send an email
notify = true
end
end
else # if the file 'down.txt doesn't exist, the component is newly down
notify = true
end
# Write down environments to file for next status check (will overwrite the existing file)
File.open("down.txt", "w") do |f|
f.puts(affectedEnvironments)
end
end
If an environment is down, the script writes that environment name to a text file, down.txt
. Next time the script runs, it compares the environments marked down (in the affectedEnvironments
array) to the file contents. If the environment exists in down.txt
, we skip notification. If a new environment is down, we send the email.
if notify
mail = Mail.deliver do
from fromEmailAddress
to toEmailAddress
subject 'PeopleSoft System Status: ' + affectedEnvironments.join(", ") + ' Down'
# use the markdown table as the text version
text_part do
body = table
end
# use the status.html file as the HTML version
html_part do
content_type 'text/html; charset=UTF-8'
body File.read('status.html')
end
end
end # end Notify
Last, create the email and send. I add status.html
as the HTML content of the email, and the Markdown table as the plain text version.
status.bat
The status.bat
script is a wrapper to set environment variables and invoke psavailability.rb
. The status.bat
script is what our Scheduled Task calls.
This is the list of environment variables I set:
set ORACLE_HOME=e:\oracle\product\12.1.0\client_1
set TNS_ADMIN=%ORACLE_HOME%\network\admin
set PATH=%ORACLE_HOME%\bin;%PATH%
set PATH=e:\ruby22-x64\bin;%PATH%
After the environment variables are set, we invoke psavailability.rb
e:
cd \
cd psoft\status
ruby ps92availability.rb
Limitations
There are some (many?) limitations in the script. That is because I wrote the script for current situation (Windows/Oracle) and this was something we built as a side project. Here is a list of the limitations that I know of:
- Windows-only
- Oracle-only
- Does not support HTTPS for websites
- Does not support TLS for SMTP
- Requires the homepage title to be consistent
- Doesn’t check the app server or batch server directly
- If an app server is down, and you have Jolt failover set, you may get a false “Running” status.
The Windows and Oracle limitation wouldn’t be hard to fix. If you want to make any changes, I’d be happy to integrate them into the main project.
Future Improvements
Here are a list of future improvements that I’ve thought of:
- Slack integration with slack-ruby-client.
- Better checks (or second test if web check fails) for database and batch. This could be done via database connections to run SQL statements.
- Better checks (or second test if web check fails) for app server domains.
- Add Integration Broker checks (messages in Error or Timeout) via SQL.
- Add HTTPS support to test Load Balancer URLs.
- Add TLS support to SMTP
The project is hosted on GitHub. I’ll merge any pull requests that people want to send.
As I mentioned earlier, this script started out as a way to get notifications. This was a “scratch your itch” type of project. But, if you want to use the script or improve it, you can get all the code over on GitHub.
Thanks for sharing this! I am not familiar with Ruby so I was wondering if you could tell me the best way to enable logging? I have it getting about halfway through my list of servers and then it shows everything as down.
Hi Dale
Ruby doesn’t the same tracing functionality as PeopleCode (which actually quite nice). But, there are a few tools we can use to get Ruby trace and debug logs. Try this first:
irb psavailability.rb > trace.log
This will create a
trace.log
file where you can see the values used in the code. One issue with this, the main loopURLs.each { |environment, loginURL, prcsURI|
is only printed once. But, above that line you can find the array with all the URLs from the text file.Another option is
ruby -rtracer psavailability.rb > tracer.log
. This will give you LOTS of data (I think too much), but it will show the full execution of the script in the filetracer.log
.One more option is to modify the script to enable tracing at specific lines. To add in tracing,
require 'tracer'
at the top of the file (with the otherrequire
statements).Tracer.on
before the code you want to trace.Tracer.off
after the code you want to trace.When we built the script we had this issue too. My initial guess is that the URLs aren’t quite right for a set of environments. That’s what happened to us – we copy/pasted the URLs and missed some Node changes 😉 If you want to paste a line or two of your
URLs.txt
file (where it works and then doesn’t work) we might be able to find something.Thanks for the suggestions so far no luck. You are right in that -rtracer is to much information! What I have figured out is that it fails on the 11th one always. I can rearrange the URLs in the list and it will complete previously failed ones but 11+ will fail every time.
Hi,
Very Nice stuff. Can I set this up for PT 8.54? This is the server status link for PT 8.54
https://apps.oii.oceaneering.com/psp/HRDMP/EMPLOYEE/HRMS/c/PROCESSMONITOR.PROCESSMONITOR.GBL?FolderPath=PORTAL_ROOT_OBJECT.PT_PEOPLETOOLS.PT_PROCESS_SCHEDULER.PT_PROCESSMONITOR_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%2cIsFolder
Yes, this should work with 8.54. If your database name is
HRDMP
, then the line inURLs.txt
would be:I was able to get some nice logs with the following added to the psavailability.rb file.
require 'logger'
log = Logger.new('logger.txt')
log.level = Logger::INFO
agent.log = log
Here is the relevant line in the URLs.txt file:
I found that I was getting 400 errors. I haven’t found out if they are valid yet but this is better than what I was finding. I am not sure if it’s the /EMPL/s/ in the redirect that is blame or not.
I, [2016-03-15T16:31:19.256368 #8916] INFO -- : Net::HTTP::Get: /psp/strq/?cmd=login
I, [2016-03-15T16:31:19.288766 #8916] INFO -- : status: Net::HTTPOK 1.1 200 OK
I, [2016-03-15T16:31:19.292745 #8916] INFO -- : form encoding: utf-8
I, [2016-03-15T16:31:19.293747 #8916] INFO -- : Net::HTTP::Post: /psp/strq/?&cmd=login&languageCd=ENG
I, [2016-03-15T16:31:19.400730 #8916] INFO -- : status: Net::HTTPFound 1.1 302 Moved Temporarily
I, [2016-03-15T16:31:19.402681 #8916] INFO -- : follow redirect to: https://www.cnd.nd.gov/psc/strq/EMPLOYEE/EMPL/s/WEBLIB_PTBR.ISCRIPT1.FieldFormula.IScript_StartPage
I, [2016-03-15T16:31:19.403681 #8916] INFO -- : Net::HTTP::Get: /psc/strq/EMPLOYEE/EMPL/s/WEBLIB_PTBR.ISCRIPT1.FieldFormula.IScript_StartPage
I, [2016-03-15T16:31:19.415724 #8916] INFO -- : status: Net::HTTPBadRequest 1.1 400 Bad Request
I, [2016-03-15T16:31:19.417032 #8916] INFO -- : Net::HTTP::Get: /psc/strq/EMPLOYEE/EMPL/c/PROCESSMONITOR.PROCESSMONITOR.GBL?Page=PMN_SRVRLIST
I, [2016-03-15T16:31:19.455260 #8916] INFO -- : status: Net::HTTPBadRequest 1.1 400 Bad Request
I, [2016-03-15T16:31:19.457210 #8916] INFO -- : Net::HTTP::Get: /psp/strq/?cmd=logout
Dale, the
logger
module looks really nice. A couple of questions:EMPL
. Not sure what application your hitting, but the default is usuallyHRMS
for HR,ERP
for Finance, etc. I noticed your node was different.https://www.cnd.nd.gov/psc/strq/EMPLOYEE/EMPL/s/WEBLIB_PTBR.ISCRIPT1.FieldFormula.IScript_StartPage
Dan,
No, all of my environments are https. I added
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
to the script.EMPL is Portal (Interaction Hub)
Pasting that URL in after logging in takes me to this URL which seems to indicate that the URL is valid.
https://www.cnd.nd.gov/psp/strq/EMPLOYEE/EMPL/h/?tab=NDS_EMPLOYEE_HUB
Ah, forgot about the Portal.
Try this – let’s write the homepage variable to file so we can see what HTML the page returns when the script attempts to log in. I added in
log.info(homepage)
for app servers that are down.I’m curious what the server returns for HTML when the script thinks it’s down.
Looks like it’s not seeing anything at the time of the errors.
I, [2016-03-17T10:14:27.879620 #6216] INFO -- : nil
Well, that’s not useful!
Is this still happening with the 11th environment, or this specific environment?
All logs after that are
nil
also.So we had the issue of all environments were showing down after a certain one. If your default local node has the same Username and password in all environments we didn’t have an issue. Our demo and production environments do. I resolved my issue by clearing cookies.
agent.get(logoutURL)
agent.cookie_jar.clear!
Eric, thanks for posting that change!
Eric – I added that line to the GitHub repo. Thanks again for sharing the change.
Dan,
Thank you for the script. I’m currently using the script for our environments with a few minor mods. It has proven to be very helpful.
Mod #1
Our environments have two process schedulers per instance one on Linux and one on Windows and I wanted to show that all on one line
Mod #2
In the footer I included a date time stamp so you had an idea of when the status.html file was last generated.
I schedule the script to run on our Windows Test Process Scheduler box and I copy the status.html to another Windows server running a simple web server via IIS. So now our team can just click on a link and get this information and of course I get the emails when something comes down.
Thank you again!
Eric
Hi Eric,
Can you share how you added the second line of process schedulers within the same script?
We have the same scenario with two process schedulers NT and UNIX running but the report only shows one of them.
Hamza
Hamza – On GitHub repo, I added an issue for multiple schedulers. I don’t have a fix yet, but there are few options how to support multiple schedulers that you could look at.
https://github.com/psadmin-io/ps-availability/issues/1
Hi Dan, Thanks for sharing this. I am new to ruby. So far I was able to get the status of domain, web and app server. I tried to get the details of scheduler and batch server which I couldn’t. Could you please point me in the right direction on how we can acquire scheduler and batch server details.
In my code for these details I am searching for an html id ‘SERVERLIST’. But so far all the details appear blank under these two columns.
It will be very helpful if you can provide how you were able to get the above details. Please let me know if you need anything.
The script is absolutely great and works well. However, when I schedule the script using Task scheduler on windows vm, the task schedule shows that the task has completed successfully however it has not generated status.html. The task scheduler works fine for any other windows operation. Not sure how to fix the issue (or) debug
Thanks Amar
Check to make sure that the working directory is correct when you start the script. Are you running the job as a service account or under your account?
Dan,
I am new to ruby, and trying to implemnet this in my envionment Running into couple of issues.
Question Q1) I installed the Ruby lastest Devkit-x64 (3.1). As I already installed the merged ruby package with dev-kit do I still need to run below commands and edit the config.yml. I ran and getting the errors
3. ruby dk.rb init ruby: No such file or directory — dk.rb (LoadError)
4.
notepad config.yml
( no file found) Where is this file7.
ruby dk.rb install
ruby: No such file or directory — dk.rb (LoadError)I tried to run status.ps1 after skipping above errors and getting below error. Please let me know how to resolve
D:/Ruby31-x64/lib/ruby/gems/3.1.0/gems/csv-3.2.6/lib/csv.rb:1828:in
read': wrong number of arguments (given 2, expected 1) (ArgumentError) from ./psavailability.rb:73:in
‘