Agent Resiliency Guidelines
Occasionally an agent executing in our RelativityOne Compute platform will encounter connectivity issues. This can result in unhandled agent shutdowns or unpredictable behavior. We recommend adding retry logic combined with detailed logging around most HTTP requests or fileshare operations. Agent developers can use the following pattern to control their agent's shut down cycle more effectively.
The following examples use the Polly 5.7.0 package. The Relativity agent framework supports Polly version 5.7.0 only. Do not add this assembly to your application; Polly.dll is available for use by default.
Example retry pattern
Before implementing retries you should understand which errors are fatal and which are transient (capable of, or likely to resolve on retry). In the following example there is a PolicyBuilder to handle all exceptions and filter them through a method that determines whether the exception is transient. This is where you will add your own logic (based on testing failure scenarios in your agent) to determine which errors are transient. For example: a 401 Unauthorized error is likely not transient; it is unlikely that a failed request in this state would be resolved by retry.
Copy
Retry pattern
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Here we define how we want to evaluate the errors and handle exceptions.
var httpPolicyBuilder = Policy<QueryResult>.Handle<Exception>(ex => IsTransientFailure(ex));
// The following WaitAndRetry policy is configured to handle retries for a specified number of consecutive failures (MAX_CONSECUTIVE_FAILURES).
// It introduces a sleepDurationProvider function which calculates the duration based on the retryCount, allowing for a gradual increase in waiting time between calls.
// Additionally, an onRetry delegate is defined to log details about the exception before each retry, aiding in troubleshooting.
var httpRetryPolicy = httpPolicyBuilder.WaitAndRetryAsync(
retryCount: MAX_CONSECUTIVE_FAILURES,
sleepDurationProvider: retryCount => TimeSpan.FromSeconds(Math.Pow(2, retryCount)),
onRetry: (ex, sleepDuration, retryCount, context) =>
{
_logger.LogError(ex, $"An issue was detected with the host while making an http request. Retry Count: {retryCount}/{MAX_CONSECUTIVE_FAILURES}." +
$" Remaining attempts until agent shutdown:{MAX_CONSECUTIVE_FAILURES - retryCount}.");
}
);
// The fallback policy will be executed after all retries are exhausted.
// This is where you can forcibly shut down the agent, allowing it to restart in a healthy state.
var httpFallbackPolicy = httpPolicyBuilder.FallbackAsync(cancellationToken => ShutDownAgentUnhealthyHost());
// Put a wrapper around both policies
var httpPolicyWrapper = Policy.WrapAsync(httpFallbackPolicy, httpRetryPolicy);
// An example Http Request using Object Manager, wrapped in the retry policy wrapper.
QueryResult objectManagerQueryResult = await httpPolicyWrapper.ExecuteAsync(() => MakeHttpRequest());
Shut down the agent
In the RelativityOne Compute platform it is sometimes advantageous to exit the agent altogether to receive a new one when retries fail or are not possible. This can be used in a variety of failure scenarios including SQL timeouts. The process of shutting down and restarting the agent can add a few minutes to your agent lifecycle/workflow but would not require end user intervention to retry or restart jobs.
In the previous example you added a fallback policy. The following shut down policy will be executed when all of the retries are exhausted. We recommend using this approach to stop the agent when the failure prevents your agent from completing its work. Shut down the existing agent by setting DidWork to false. As long as the agent's workload discovery API is returning a Size greater than None, a new agent will start in a healthy state to handle the work.
Copy
1
2
3
4
5
6
7
8
// We want to force the agent to exit in the event of a host failure.
// Setting DidWork to false will force the agent to break execution.
// If the agent's workload discovery endpoint is requesting a workload size > 0 a new agent will start to handle the job.
private void ShutDownAgentUnhealthyHost()
{
_logger.LogFatal($"The agent was unable to do work because the host was not healthy or retries were exhausted. Now exiting agent.");
base.DidWork = false;
}