Memory leak mitigation and failure prevention

In a cloud infrastructure, it is important to detect and mitigate memory leaks in critical infrastructure software before catastrophic failure takes place.

Problem description

Many long-running software services are susceptible to memory leaks. A service experiencing memory leak tends to gradually consume more and more memory during its operation, ultimately compromising the performance and the stability of the service. In the case of critical infrastructure software, it is important to detect and prevent catastrophic failure due to memory leaks.

Fault class

  • Software error

  • Performance degradation

OpenStack projects used

There are at least three solution architectures.

  1. Local: mitigation decisions and actions are conducted locally on each server.

  2. Central: mitigation decisions and actions are orchestrated by a central service.

  3. Delegated: mitigation decision policy can be defined centrally but localized to each server.

Local

In the local architecture, a local utility can make the decision to restart a process or take other mitigating actions when the process exceeds certain fixed memory thresholds specified by the cloud operator. Candidate implementations include:

  • custom scripts.

  • native memory limit mechanisms (e.g., cgroups) which would kill a process when memory usage becomes too high, allowing another mechanism to restart the process.

  • Monit.

Central

In the central architecture, mitigation decisions can be made at a central level which is able to use cloud level information and policy not available at a local level. Mitigation actions can include the orchestration of graceful failovers that involve multiple servers.

There are three logical components to a solution.

  • Memory usage collection:

    • Monasca (monasca-agent)

    • Nagios

    • Zabbix

  • Mitigation decision:

    • Congress

    • Vitrage

    • Watcher

  • Mitigation action:

    • Mistral

    • Watcher

Delegated

The delegated architecture might be implemented as a mixture of the above two.

Remediation class

Proactive / preemptive

Fault detection

Definitive detection of memory leak is an unsolved problem. For the purpose of this use case, suspected memory leak can be determined based on operator-set limits or a more generic procedure based on memory usage history and other relevant information.

Inputs and decision-making

Inputs:
  • Memory usage by process.

  • Memory usage by server.

  • (Potentially) Memory usage history.

  • (Potentially) A list of candidate services or processes for memory leak mitigation.

Decision making:
  • The simplest case is when the operator prescribes memory limits for each relevant process. Take mitigating actions when prescribed memory limits for a service/process is breached.

  • The appropriate memory limits for each service/process might be determined by an inductive algorithm. The subject is under active investigation by the research community (for selected references, see References).

  • When there are no prescribed memory limits, decisions can be made on the basis of a more generic procedure or policy. For example, a policy sketch may be as follows.

    • When a server’s overall memory usage exceeds 90% of available memory for a period of 10 minutes, take mitigating actions on the candidate services or processes, prioritized by parameters such as:

      • Each service’s total memory usage.

      • Each service’s historical memory usage.

      • Risk and level of disruption of mitigating action taken upon each service.

Remediation

Two main mitigating approaches are available:

  • Restart the service experiencing memory leak.

  • Orchestrate a graceful fail-over.

Existing implementation(s)

Existing implementations are available for the local architecture. See Local.

Future work

If there is operator interest in the central or delegated architectures, future work would include implementing the architectures using the referenced projects and documenting the results.

Dependencies

Not applicable.

References

Matthias Hauswirth and Trishul M. Chilimbi. 2004. Low-overhead memory leak detection using adaptive statistical profiling. SIGOPS Oper. Syst. Rev. 38, 5 (October 2004), 156-164. DOI=http://dx.doi.org/10.1145/1037949.1024412

Sor, Vladimir, Plumbr Ou, Tarvo Treier and Satish Narayana Srirama. “Improving Statistical Approach for Memory Leak Detection Using Machine Learning.” 2013 IEEE International Conference on Software Maintenance (2013): 544-547.