Retrying groups of tightly coupled tasks in Ansible

weixin_0010034

333人浏览 · 2022-08-05 03:25:26

weixin_0010034 · 2022-08-05 03:25:26 发布

In some cases the best way to handle failure in distributed systems is to just try again. The same idea can apply when configuring a fleet of machines using Ansible. Let's say for examples sake that we have a task which may not always succeed on the first attempt for 10% of hosts being targeted due to some race condition. However, on the second attempt of the task, the remaining 10% of hosts will succeed. Now we could try to solve this race condition in many ways, but sometimes we may not have the time or required control of the system to do so. In this case, retrying the task may be the simplest, most efficient way to solve the problem.

I ran into this problem at work while working on a configuration to provision a rack of servers using Ansible. Typically, with Ansible if a task may fail on the first try, one can repeat the task like so:

- name: Some task that might fail
  failing_task:
    some: setting
  register: outcome
  retries: 3
  delay: 10
  until: outcome.result == 'success'

This is great! Now Ansible will repeat the task three times with a 10 second delay in between each attempt. At the end of each attempt the until condition is evaluated and if it does not evaluate to true, the task will be repeated, assuming there are still retries left.

Now, let's say that I have a group of tasks that need to be repeated on failure, and not just one like in the example above. Well, grouping tasks is possible in Ansible, so maybe this will work:

---
- name: Group of tasks that are tightly coupled
  block:
  - name: Setup for the next task that needs to run after each failed attempt
    setting_up:
        some: prerequisite action

  - name: Some task that might fail
    failing_task:
        some: setting
    register: outcome
  retries: 3
  delay: 10
  until: outcome.result == 'success'

This, unfortunately, will not work as Ansible does not currently support using retries on a block. If you find yourself in a situation where you need to repeat a group of tasks this will work:

# filename: coupled_task_group.yml
- name: Group of tasks that are tightly coupled
  block:
  - name: Increment the retry count
    set_fact:
      retry_count: "{{ 0 if retry_count is undefined else retry_count | int + 1 }}"
  - name: Setup for the next task that needs to run after each failed attempt
    setting_up:
        some: prerequisite action

  - name: Some task that might fail
    failing_task:
        some: setting
  rescue:
    - fail:
        msg: Maximum retries of grouped tasks reached
      when: retry_count | int == 5

    - debug:
        msg: "Task Group failed, let's give it another shot"

    - include_task: coupled_task_group.yml

This looks really strange at first. You might ask why is the block's rescue calling the task file that the tasks are being declared in. What's happening is we're using the power of recursion to repeat a task until some condition is reached: either the task succeeds on the first try or on retry, or the rescue block is called up to five times triggering the when condition on the fail task throwing us out of the loop. Note that when writing any kind of recursive function ensuring a base case is vital, otherwise the function may call itself infinitely. The same holds true when using the above approach of repeating groups of tasks in Ansible, if we forgot to increment the retry_count variable on each pass through Ansible would run indefinitely until stopped by the user.

In the future, this approach shouldn't be needed as a PR to add this functionality to Ansible is nearing completion. To see if it's been merged, check here.

点击阅读全文

CI/CD

CI/CD社区为您提供最前沿的新闻资讯和知识内容

更多推荐

Jenkins环境变量、参数及groovy脚本使用介绍。

这是系列中的第三篇文章使用 Jenkins 将 Django 应用程序部署到 AWS EC2 实例。 Jenkins的温和指南。在本文中,我们将介绍如何在 Jenkins 中使用变量和参数以及如何使用 groovy 脚本。我们将使用与以前相同的项目。您需要事先阅读过之前的文章。 1.如何在 JenkinsFile 中配置和使用变量。在编写 JenkinsFile 管道脚本时,您可能需要注入和

CI/CD

Rust 中的 OpenTelemetry 分布式跟踪指南

在本文中,我将分享我将 OpenTelemetry 分布式跟踪添加到 Rust 应用程序的经验。我将尝试回答以下问题: 如何在 Rust 中检测 Opentelemetry? 如何在 Rust 应用程序中添加手动和自动检测? 如何使用tracing来调试Rust应用? 如何可视化跟踪和跨度? 如何在多线程环境中保存我们的 span 上下文? 在 Rust 中进行跟踪时,你必须知道哪些 crat

CI/CD

Jenkins 管道,用于构建 Django 映像并将其推送到 dockerhub 和 GitHub webhook 集成。

这是系列文章的第四篇使用 Jenkins 将 Django 应用程序部署到 AWS EC2 实例。詹金斯的温和指南。在本文中,您将拼凑您所学的所有基础知识,并构建一个完整的 Jenkins 管道,该管道将构建 Django 映像并将其部署到 Dockerhub。您需要事先阅读过之前的文章 1.添加 init 阶段并加载script.groovy文件。在您的 JenkinsFile 中,添加加