Wednesday, September 7, 2016

Monitoring Distributed Systems – Lessons from the RELEASE project

The observer effect tells us that we cannot observe a system without disturbing it; on the other hand, without observation we are unable to correct, evolve or optimise systems. In this post we first discuss the general question of observing distributed systems, and then we look at some of the ideas and prototypes for more effective monitoring that come out of the RELEASE project, which was focussed on building scalable distributed systems in Erlang.

Monitoring


We begin by exploring what it means to monitor a distributed system, how it is done, and the uses to which monitoring data can be put. Sloppily, we know, we will use the term “monitoring” to cover activities of monitoring, observation, tracing and visualisation: what they have in common is the fact that they involve extracting information from a system which goes beyond the information it is required to produce to function and gives information about how it is performing that function.

After setting out the various dimensions of monitoring, we conclude by discussing the ways in which monitoring can load a system. This sets the scene for a discussion of how to mitigate this loading, in the context of Erlang distributed systems.

Why monitor?


We monitor to make sure that a system is delivering what it should. This might be a matter of ensuring that it can maintain performance over time, as unread messages accumulate in mailboxes, perhaps. It could mean observing the way in which (eventual) consistency is managed, so that the system is able to respond appropriately despite there being disruptions to infrastructure. It could also have a more active purpose: to allow the system to adapt to changing circumstances, either through real-time reactive behaviour, or through longer term code evolution.

What are we monitoring?


In a distributed system two kinds of activity can be observed: first there are data which come from a single node, such as CPU usage, or trace messages to particular function calls; there are also data arising from the traffic between different nodes.

As the examples just given show, these data can be event-based, such as log messages about function calls or other system events; or they can be observations, such as CPU usage. While it is in the user’s control how often to observe a property of the system, trace messages will be generated according to how the system is behaving, and so the rate and scale of their production is not under user control.

Different levels of monitoring are possible, too. Simple data, such as CPU or memory usage, can be gathered for all systems, whereas in many cases the most useful data is application-aware, more directly reflecting the semantics of the system that is being observed.

When do we monitor?


Monitoring can be online: information about a system is available in real-time, so that problems can be identified and rectified as a system runs. A drawback of this approach is that data analysis is limited by the fact that it must produce results on the basis of what has happened up to a point during execution. It must also do this in a timely way, using a limited amount of data.

On the other hand, more complex and complete analyses can be produced post hoc using data generated during the system execution. The two are combined in a hybrid approach, in which  offline analyses are repeatedly generated for portions of an execution.

How do we react?


It may be that the results of monitoring are presented to an operator, and it is up to her to decide how to react to what she sees, reconfiguring or rebooting parts of a system for instance. It is also possible to design automated systems that use monitoring data: these range from a simple threshold trigger fired, for example, by CPU loading reaching a particular level, to a fully autonomic system.

The burden of monitoring


Monitoring requires data to be produced by the system, and for these data to be shipped around the system to a central monitoring point where storage, analysis and presentation can take place. We’ll look at these in more detail once we have discussed a particular concrete case: Erlang/OTP.

Erlang


Erlang is a concurrent, distributed, fault-tolerant programming language that comes with the Open Telecom Platform (OTP) middleware library. In distributed Erlang each node runs a copy of the BEAM abstract machine and runtime: separate nodes can be run on the same or different hosts. The “out of the box” Erlang distribution model provides full connectivity between all nodes; but this can be modified by manual deployment and connection of “hidden” nodes, which are unconnected by default. Erlang nodes can also be partitioned into global groups, with each node in at most one group. (Note: there is current work on improving the scalability of distributed Erlang: link).

SD-Erlang is an adaptation of Distributed Erlang designed to be more scalable by limiting all-to-all connectivity to groups (called “s_groups”). These groups may overlap, thus providing indirect connectivity between nodes in different s_groups, SD Erlang also allows non-local spawning of processes on nodes determined by various attributes including measures of proximity.

Improving monitoring 


In this section we  look at a number of ways of improving monitoring of distributed systems; these are illustrated by means of tools built to monitor Erlang and SD Erlang systems, but the lessons are more generally applicable.

SD-Mon is a tool designed specifically to monitor SD-Erlang systems; this purpose is accomplished by means a “shadow” network of agents, which collect and transit monitoring data gathered from a running system. The agents run on separate nodes, and so interfere as little as possible with computations on system nodes; their interconnections are also separate from system interconnects, and so transmission of data gathered does not impose an additional load on the system network.

SD-Mon is configured automatically from a system description file to match the initial system configuration. It also monitors changes to the s_group structure of a system and adapts automatically to the changes, as illustrated here:



Filter data at source: BEAM changes


The Erlang BEAM virtual machine is equipped with a sophisticated tracing mechanism that allows trace messages to be generated by wide range of program events. Tracing in this way typically generates substantial data, which can be filtered out downstream. We were able instead to modify the tracing system itself to filter some messages “at source” and so avoid the generation and subsequent filtering of irrelevant data. These changes make the data gathered for Percept2 more compact.

Enhanced analytics: Percept2


Percept2 provides post hoc, process-level, visualisation of the working of the BEAM virtual machine on a single Erlang node. When a node runs on a multicore device, different processes can run on the different cores, and Percept2 enhances the analysis of the original Percept in a number of ways (Percept is a part of the Erlang distribution). For instance, it can distinguish between processes that are potentially runnable and those that are actually running, as well as collecting runtime information about the dynamic calling structure of the program, thus supporting application-level observation of the system. Multiple instances of Percept2 can be run to monitor multiple nodes.

Improved infrastructure: Percept2


The post hoc analysis of Erlang trace files for the visualisation in Percept2 is computationally intensive, but is substantially speeded up through parallel processing of the logs using  concurrent processes on a multicore machine. Logging of some trace information is also performed more efficiently using OS-level tracing in DTrace.

Network information: Devo / SD-Mon


Existing monitoring applications for Erlang concentrate on single nodes. In building Devo which visualises inter-node messaging intensity in real time, it was necessary to analyse Erlang trace information to find inter-node messages; Devo and SD-Mon also show information about messaging through nodes that lie in the intersection of two s_groups, thus providing the route for inter-group messages.

Monitoring and deployment: WombatOAM


WombatOAM provides support for Erlang and other distributed systems that run on the BEAM virtual machine. As well as providing a platform for deploying systems, it also provides a dashboard for observing metrics, alerts and alarms across a system, and can be integrated with other monitoring and deployment infrastructure. WombatOAM is a commercial product from Erlang Solutions Ltd, developed on the RELEASE project.

Conclusions


The aim of this post is to raise the general issue of monitoring of advanced distributed systems. After doing this we have been able, inter alia, to illustrate some of the challenges in the context of monitoring Erlang distributed systems.

We are very grateful to the European Commission for its support of this work through the RELEASE project: EU-ICT Specific targeted research project (STREP) ICT-2012-287510. Thanks also to Maurizio Di Stefano for his contributions to this post.