Automatization of Techlab's Request Management Process

Warning

This is an old post, Techlab was an interesting project that, unfortunately, no longer exist.

Background on Techlab, from the website

Techlab is a CERN IT project providing testing/benchmarking platforms for multiple architectures and accelerators. We gather users requests, analyze market trends and procure cutting edge hardware. We then set it up, test it and put it up for experimental use by members of the HEP community.

Techlab systems are operated as evolving, test systems, on a best effort basis, and must not be used for production work or with sensitive data. Hosts are reinstalled on a regular basis and user data is permanently deleted at the end of each test slot.

Under some conditions and subject to availability, it may be possible to loan a Techlab system for longer periods, for example to pilot a build system on new platforms.

Core principles of Techlab:

Everything can be published - no NDA
Hardware is off-the-shelves - No development boards
Hardware is periodically updated based community feedback and industry trends
Multiple vendors and platforms
No production service or guaranteed availability - This is a lab!

Techlab’s request management process has seen several improvements in terms of bookkeeping and accountability over its lifetime, from handling requests via email and manually configuring the nodes, to the use of autoconfiguration tools. Despite these improvements, the request management process still relies heavily on manual intervention in several points of the system.

Here we define the next evolution of Techlab’s request management and describe the system developed to streamline this process, mitigating manual intervention and automating it as much as possible. In Section 1, we describe the current request process and its inconveniences, while in Section 2, we propose an improved way to handle the requests and specify its ideal requirements. In Section 3, we describe the design decisions taken and the Application Programming Interface (API) of the new request management system developed. Finally, in Section 4, we analyze the improvements.

1. The current request process: workflow and infrastructure#

Users’ requests are handled through CERN’s Service Portal, a service management and IT support solution with ServiceNow software at its backbone.

With few exceptions, Techlab machines are enrolled on CERN’s cloud infrastructure, capitalizing on the configuration management system in place. Lifecycle management tools such as Foreman or automation frameworks such as Puppet remarkably simplify the configuration and administration of Techlab’s nodes. However, up to this point, CERN only supports x86 architectures. Thus, Techlab’s arm and PowerPc nodes have to be configured manually.

One of the most relevant concepts of CERN’s configuration system is the hostgroup, Foreman’s notion of a cluster of computers. Puppet’s configuration centerpieces are the manifest files and the Hiera (Puppet’s built-in key/value data lookup system) files. Manifests describe, in Puppet’s domain-specific language, the common configuration to be applied to the nodes in a hostgroup, while Hiera files provide key-value data in yaml format that is fed to the manifest at configuration time. This separation allows, for instance, the specification of values such as node users in a node-specific Hiera file, which is digested by the manifest to configure each node.

A request’s lifetime frequently evolves as follows:

The user fills in the request form on the Service Portal, specifying their project, the systems they would like to access, the loan period, and the users requiring access to the hosts. Once introduced in the system, the service managers are notified and one of them takes ownership of the request, discussing the options with the user if needed. After reaching an agreement on the terms of the loan, f.e. loan period or nodes involved, the request handler will fill the corresponding fields in the request’s managent form and proceed to configure the system manually. The request will be kept open until the expiration date, at which point the request handler might offer the requesting user an extension, depending on resource availability and utilization.

We can identify several points of improvement for this workflow:

It is time inefficient. While this system provides better accountability than managing the request privately (f.e., by email), it still has the cost of checking for request expiration manually on ServiceNow. This can be too time consuming when the number of users requests grow.
It is still prominently manual. Puppet simplifies the configuration and installation of the nodes, but it still requires explicit intervention of the system administrator to describe the configuration on a node basis.
It is redundant. The requests and the configuration files both contain redundant data, such as users, loan period and request number.

2. Proposal and requirements#

We propose a new, automated request-handling system that capitalizes on existing APIs and data redundancies between the different steps of the existing solution. Thus, we define the following functional and non-functional requirements.

Functional Requirements:

Given a user request, our system should generate the node specific files needed for Puppet to configure the nodes involved in the request.
When a request expires, the system should be able to clean up the configuration files and remove all access to the machines.
Users should be notified, by email, of the expiration of their requests in advance, preferably automatically.
The notification sent to the users should contain a link to the request in case they want to apply for an extension.

Non-Functional Requirements:

The system should collect data from the requests taking advantage of CERN’s ServiceNow Python API.
Emails reminding the users of the expiration date should be sent 15 and 5 days in advance of the loan’s expiration date.
The system should be implemented based on a strict and constrained formatting of the configuration files, increasing its simplicity and reducing uncertainty.
The system should be modular, composed of a minimal Python API at its core, complemented with executable scripts implementing the request management and the sanitization of files and requests.
The automated process should use the executable scripts.
The node-specific Hiera files should contain all the relevant information of its associated requests.
The ServiceNow interface should implement caching to reduce queries to ServiceNow, due to the time consuming nature of the operation.

3. The new request management system#

We can divide the final request management into three different parts, the request management API, a minimal Python module that contatins a class to interface with the configuration system and another one to interface with the requests on ServiceNow; the helpers, functions that implement the most common operations when handling requests; and the executables, which consist of a simple solution for synchronizing the requests and the configuration system, and another one for format enforcement and data sanitization.

In Section 3.1 we define the different calls implemented in the system API. Then, in Section 3.2, we detail the system design and the algorithms involved in the process. Finally, in Section 3.3 we describe a simple way to automate the whole process.

3.1. System API#

In this section, we outline all programming interfaces we expose to the user, differentiating them in three groups: request management interfaces, helpers and executables.

Request management interfaces

The request_management Python module includes the bare minimum operations to both interface with ServiceNow and perform input/output (I/O) operations, in our constrained format, involving the configuration files. In this module, we implement the concept of a compute node, in the Node class; and a class that acts as an interface to ServiceNow, SnowSession.

# Build a node object from a request data dictionary
Node.from_request(requestObject)
# Load a node object from its yaml representation
Node.from_yaml("path/file.yaml")
# Write a node object to a yaml file
Node.to_yaml("path/file.yaml")

# Loads the ServiceNow Api configuration file and sets up a session
SnowSession snow_session("config.yaml")
# Fetches and caches the open requests from ServiceNow...
# ...or retrieves them from the internal cache
snow_session.requests()
# Returns the faulty requests identified in the fetching phase
snow_session.faulty_reqs()
# Returns the information of a single request.
# Fetches requests if request cache is empty
service_now.single_request(req_id)

Helpers

For convenience, we provide a set of functions to exploit the most common actions in action_helpers. The helpers implemented include:

from action_helpers.py import *

# Return the requests expiring in remaining days
# Optionally don't include (current day + remaining_days) day
reqs = requests_expiring_in_x_days(remaining_days, less_than=False):

# Clean up all traces of a request from a node
purge_req_from_node(req_id, node_name, "path/to/hiera_files.yaml"):

# Remove all traces of a particular request
clean_request(req_id, hiera_path)

# Remove all traces of just expired request
clean_expired_requests(hiera_path)

# Check if there are new requests.
# Updates the nodes associated to new requests if the update parameter is True
new_requests, updated_nodes = check_for_new_requests(update=False):

#Update a node from the information of a request
updated_node = update_node_from_request(node, request_id)

#List the open requests associated to a user
requested_by_user = requests_associated_to_user(username)

#Cleans up a node.
delete_node("path/to/node.yaml")

Executables

Finally, we provide two executables, housekeeper.py, which synchronizes the configuration files with the latest information from the requests and warns users and administrators about deadlines; and format_police.py, which checks that both node files and requests are correctly formatted according to our constrained design guidelines.

1. Check for problematic active requests
2. Check issues in request
3. Check for problematic node files (no request tags)
4. Exit/Quit

What would you like to do?

3.2. System Design and Algorithms#

In this section we describe and justify the system design. First we explain how, by enforcing a strict formatting, we simplify the parsing of nodes and requests and their data structure representations. Then, we introduce the hardware-labs’ request form in CERN’s service portal and the conventions adopted when filling the administrator fields, which allow for a deterministic retrieval of request information as a Python key-value (field-contents) datastructure. Next, we outline the process of retrieving the ServiceNow forms’ information and building a list with the open requests. Finally, we describe how everything comes together and the interactions with the user programmed into the system.

Figure 1 depicts a diagram of the system, where we can identify the different components mentioned beforehand and their dependencies.

RMS — **Figure 1**: new Request Management System.

Strict formatting and metadata extensions.

Puppet’s Hiera files are yaml files structured in key-value(s) blocks. The only essential block, whose existence all other blocks depend on, is the user data block, sssd::interactiveallowusers:. In our system, each node has its associated Hiera file. For instance, the configuration data for techlab-ops.cern.ch—such as users, rootusers, and extra packages to be installed—is described in the file techlab-ops.cern.ch.yaml, and will be combined with the hostgroup’s configuration before a Puppet configuration run.

Techlab and openlab’s nodes are usually shared between users from multiple requests. One of the most common operations when assigning a node to a new request is consulting the expiration date of the existing requests associated to said node. For this purpose, Techlab and openlab’s teams agreed on a way to enrich the information on the Hiera files by introducing metadata as formatted comments on the username, where each tag is preceded by a space plus two hashes, and is followed by a space.

sssd::interactiveallowusers:
  - xvallspl ##t From: To:01/01/2020
  - acatmilton ##t From: To:01/01/2020
  - espinete ##t From: To:01/01/2020
  - apurr ##t From: To:01/10/2020

Here, we build on the metadata as comments idea and extend the user-associated request information in the node with more relevant information tags: t, start and end dates in ISO format; r, request number; p, project description; e.

sssd::interactiveallowusers:
  - xvallspl ##t From:2019-03-01 To:2019-04-30 ##r RQF1243159 ##p Automatization of request management ##e Techlab
  - murray ##t From:2019-03-01 To:2019-04-30 ##r RQF1541658  ##p Testing interfaces for cat operated GPUs ##e Catlove
sudoers:
  - xvallspl

In addition, to reduce complexity, and to constrain and simplify the parsing of the Hiera files, we impose the following restrictions:

We only accept one level of identation, with a two space indent.
Every block starts with an unindented key, which is followed either by a single value, or a list of values in consecutive indented lines.
Users from the user block, and only from the user block, are always followed by metadata as comments, strictly following the previously discussed format.
The t and r tags always contain complete information.

Enforcing a strict formatting of the node files and requests also allows us to simplify their data structure representation and, with it, the operations of writing to and retrieving from a Hiera file the request information associated with a specific user of the node.

Finally, the program fails in cases not following the format described above. For prevention and format debugging, we provide an executable sanitizer, format_police.py, that helps identify and solve fomat problems in the node files or in the ServiceNow request forms.

ServiceNow’s form and adopted conventions

To reduce potential points of failure and avoid redundant typing of data, all node configuration data is introduced by the request administrator in the ServiceNow request form, and then propagated, including the metadata typed above, by the system to the node files. Any direct manual intervention in the node files is discouraged.

The administration section of the form has dedicated fields for users, nodes and start and end loan dates, which must be agreed on with the users. In addition, we can fill in any extra configuration (sudoers, Python packages required) in the Setup Comments field of the form. Keep in mind that the configuration data described in the form will be applied to each node assigned to the request.

However, for the request to be considered valid, the administrator form fields (Figure 2) should comply with their own restrictions:

If a request status is in progress, the loan dates fields must be filled.
Any blocks of data other than users has to be expressed as key-value(s) pair in an independent line in the Setup comments field of the form.
Any list of users or of node configuration data must be formatted as comma separated values.

Fetching the requests and managing computing nodes configurations

The communication with ServiceNow is handled by the SnowSession class in the main Python module, capitalizing on CERN’s ServiceNow Python API to retrieve the information from all open requests. Each instance of the class opens a new communicating session with the server.

Because of the sizeable cost of the queries to ServiceNow, due to its internal database structure, SnowSession only fetches requests in in progress status and implements caching mechanisms for the retrieved requests. Only the most relevant request information is retrieved from the ServiceNow form fields and stored in a key-value data structure (a Python dictionary). For reference, the request dictionary has the following structure:

{
  'end_date': datetime.datetime(2020, 1, 19, 0, 0),
  'experiment': u'CatHEP',
  'id': u'RQF1412862',
  'nodes': [u'techlab-gpu-canthugeverycat'],
  'opened_by': u'Aaron Purr',
  'opened_by_id': u'aaronpurr',
  'project': u"Can't hug every cat",
  'setup_comments':
    {
      u'sudoers:': [u'aaronpurr']
    },
  'start_date': datetime.datetime(2019, 9, 20, 0, 0),
  'users': [u'aaronpurr', u'a.ham', u't.jefferson']
}

Where experiment, project and setup_comments are the only fields allowed empty values.

The Node datastructure, describing a computing node configuration, can be built both from requests and Hiera files, with the Node.from_request("request_id") and Node.from_yaml("path/to/node_file.yaml") constructors respectively, and stored as Hiera node configuration file, calling Node.to_yaml("path/to/node_file.yaml"). A Node instance, contains the following information:

Users. Users with access to the node.
Associated Requests. Requests this node configuration is associated with.
Release date. Date when the node is expected to be completely freed.
Extra keys. Extra configuration blocks applied to this node, such as sudoers, rootusers and cvmfs mounted folders.

Executing the process and warning the users

The classes described above can be used to build or edit the node configuration files from the information of a request once the user request have been accepted and the administrators have completed the corresponding fields in the ServiceNow form.

We provide a convenient way to execute the complete request management process in housekeeper.py. This executable script performs the following actions in order, leveraging the functions in ```action_helpers.py``:

Update node configuration files with any updates in the ServiceNow requests, for instance, loan extensions.
Identify new requests and update a stored list of open request. Apply the configuration to the respective node Hiera files.
Clean the expired requests.
Warn users of expiring requests.
Warn administrators of changes made and expired requests.

For the last two steps, we implemented a request-aware email notification system. Figure 3 illustrates an email example. Note that the email template is filled in with request-specific informations—the user’s first name, the nodes associated to the request, and the loan days left⁠—and provides the user with a clickable button that links directly to the request form, suggesting to apply for an extension in case of need.

3.2. Automating the process#

This full process can be automated to run unsupervised on a daily basis. However, for security reasons and for modularity, the node configuration files are stored on internal CERN git repositories for their root hostgroups, to which only members of the hostgroup are granted writing access. In addition, the ServiceNow session configuration file contains secrets, so it can’t be distributed openly. To overcome this concern, a sample configuration file is stored in tbag, CERN’s IT service for managing secrets, only retrievable by members of the hardware_labs hostgroup.

With the restrictions described in mind, the automation of the process can be programmed with an authenticated cron job that runs, for instance, the following script:

# Update the hostgroup's local repository
cd <user_path>/it-puppet-hostgroup-hardware_labs/
git fetch origin
git rebase origin/master
cd -

# Retrieve the configuration file from the secrets store
tbag show --hg hardware_labs --file config.yaml config.yaml
#Run the request management process
Python housekeeper.py

# Publish the changes
cd <user_path>/it-puppet-hostgroup-hardware_labs/
git add .
git commit -m "[techlab] updates from the automated system"
git push origin master

Note that this is an authenticated job that needs to be executed by the hostgroup administrators, or both the secret retrieval and the git push command would fail.

4. Conclusions#

This new request system reduces redundancy and points of failure and allows for the automation of a large portion of request handling process.