Speed scale-out cluster

Feature availability

Available from: Resilio Active Everywhere v5.0.0
Available for: Windows, Linux, macOS Agents. Full sync shares. NFS/SMB/cloud storage. MC API.
Not available for: NAS, mobile Agents, MC Agent regardless of its OS. Caching gateways, TSS shares. Run Script Jobs.

Understanding scale-out cluster

Scale-out clusters are designed to maximize transfer speed and scalability by distributing workloads across multiple Agents. This architecture is ideal for environments that demand high-performance data movement and fault tolerance. Unlike traditional single-server configurations, a scale-out cluster dynamically assigns roles and optimizes resource utilization to eliminate single points of failure, ensuring seamless operation even if individual nodes go offline.

System requirements and expected performance

Management Console

Ensure that the Management Console meets the general system requirements.

Agents

A scale-out cluster typically includes multiple industry-standard systems (VMs or physical nodes), each running 1–2 Agents depending on available CPU cores. To maximize efficiency, scale up first - run multiple Agents per machine to utilize up to 10–20 Gbps of throughput; then scale out by adding more nodes as needed.

Scale-out deployment requirements:

All Agents must run the same Resilio Agent version.
All hosts running Agents must use the same operating system (you cannot have a mix of Agents running on, for example, Windows and Linux).
All hosts running Agents must use the same hardware architecture (you cannot have a mix of Agents running on x64 and Arm hardware).
Minimum number of Agents: 1
Maximum recommended number of Agents: 50 per scale-out cluster.
Largest verified deployment: 40 per scale-out cluster.

Reference VM recommendations:

Google Cloud: n2-standard-8 for a single Agent
ARM-based: c4a-standard-8
Example cluster : Five VMs (c6gn.2xlarge, 8 vCPU each) with one Agent per VM achieving 40 Gbps upload to S3 within the same cluster. Spot instances can be used for Agents with helper role, as assigned by admin.

Resource requirements:

CPU: 8 cores per Agent on a VM. Total core count depends on the number of Agents per VM.
RAM: 32 GB or more, depending on the number of files. In a scale-out scenario, folder trees are distributed unevenly across Agents: if a branch leader goes offline and a new branch follower is assigned, the previous branch follower retains the folder tree even if it comes back online, continuing to consume RAM. The same applies to instances manually downgraded to the helper role.
Storage: Minimum 1 GB/s read/write speed per Agent, as measured by fio. Average indexing speed is 20k files/sec.
Network: Minimum 10 Gbps per Agent, as measured by iperf between source and destination machines (TCP or UDP). To maximize scale-out performance, TCP multistreaming must be enabled for the Agents
TCP is recommended for stable connections without packet loss, such as between data centers or between a data center and a cloud region. Most cloud providers and enterprise firewalls limit UDP traffic, while TCP connections are usually unrestricted.
ZGT is recommended for:
- High-latency connections (RTT > 200 ms)
- Connections with packet loss
- Unstable network environments

Load balancing scale-out cluster

Scale-out clusters incorporate automatic and configurable load-balancing mechanisms to optimize resource allocation and job execution. The key principles of load balancing include:

Jobs load balancing

Cluster leader assigns branch leader and branch follower roles only to specific amount of Agents based on on their load, number of jobs, last assigned role, file errors. The maximum number of branch followers per branch leader is controlled by parameter scaleout.redundancy_factor, default is 1. Any remaining Agents are assigned the helper role.
Jobs are load balanced in batches per parameter scaleout.leader_job_block_size, default value is 10. Jobs are assigned the Agents to fill in the block, i.e. with up to 10 jobs, next - other Agents in pseudo-random manner, last - by available RAM.

File transfers load balancing

File transfers are handled by helpers, ensuring optimized distribution of workload across the cluster. This follows standard scale-out operational logic.

Connectivity Management

All cluster communication is routed through the branch leader, which determines internal connection endpoints. The scale-out implementation uses redirect mechanism at the tunnel handshake level, allowing remote helpers or peers to locate and establish connections with an appropriate helper within the cluster

Scale-out cluster roles

Group roles

Group roles define the cluster roles that an Agent in a scale-out cluster can assume.

Note

You can change the group role after the group is created.

Cluster Member and Helper: The Agent can take any role in the cluster.
Cluster Member: The Agent can be a cluster leader, branch leader, or branch follower but not a helper. At least one cluster member or cluster member and helper is required in the cluster. A cluster leader is elected among Agents with these roles. The cluster remains idle if a leader is not elected (e.g., all Agents with these roles are offline).
Helper: The Agent can only serve as a helper and cannot take on other roles.

Group auto-assignment

Add all new Agents and Add new Agents that match the rule assignment rules require at least one cluster member at creation. Auto-groups are supported, with the default role assigned as helper. Additional automation assigns specific roles to Agents through tagging. Applicable tag values: helper, cluster member or cluster member and helper.

Cluster roles

Cluster Leader
Responsible for assigning branch leader, branch follower, and helper roles (including to itself). Manages job status reporting. Each cluster configuration will have one branch leader, with additional branch followers for load balancing. The leader builds cluster configurations based on updates from members, who report job load, memory usage, and previous role assignments.
Branch leader
Responsible for external communication and is a primary source of file tree - maintains files meta information (file tree), scans, uploads and downloads files. Assigns download operations to helpers.
Branch follower
Cluster member, holds a copy of file tree but doesn’t share it outside cluster - maintains file tree (but not files metadata), merges with branch leader inside its own cluster, scans. Its purpose is to become a new branch leader if the current leader fails. Branch follower may be a helper at the same time.
Helper
Responsible for data transfer. Requests download operations from the branch leader, executes them, and reports back. Does not store the file tree (although branch leaders or followers downgraded to helper retain it temporarily). Handles only delegated tasks.
Cluster follower
A passive cluster member who participates in leader elections. HA follower.
Disconnected
A cluster member without a connection to the leader.
Unassigned Role An Agent that does not participate in leader elections and can go offline without disrupting the cluster.

External connections outside cluster

External connections outside the cluster are managed by the branch leader but can be delegated to a helper. Helpers attempt to reuse already established connections for downloading data, generally preferring to connect to the nearest Agent in a remote cluster or an external Agent.

Failover

Two types of failovers may occur in a scale-out cluster.

Cluster Role Failover: Triggered when the cluster leader loses connection to other cluster members and a new leader is elected.
Cluster Transfer Failover: Occurs when a branch leader or branch follower changes due to failure.

Cluster Roles Failover

This failover involves switching roles within the cluster. It occurs when the cluster leader disconnects from other cluster members due to a network disruption, Agent process stopping or host system restart. Connectivity to other Agents, not the Management Console (MC), determines failover — if the leader stays connected to Agents but loses MC access, the failover process won’t occur.

During failover Agents in scale-out cluster report status "failover" in the job run and event job's logs, and job progress is temporarily halted.

The process typically takes around 30 seconds. New cluster leader is elected using RAFT algorithm and it continues the job. If no leader is selected, failover continues until one is elected or the process times out.

Cluster leader configures the cluster based on updates from members, who report number of active jobs, memory usage and their role in the last configuration known to a cluster member. New cluster configuration goes in two steps:

Selecting from Agents who are already branch leader or branch followers.
Filling the rest of roles.

At all stages Agents with file errors or excessive memory usage (over 90%) are excluded from selection.

While cluster failover is basically roles re-assignment trying to preserve previously assigned roles so as to minimize data transfer interruptions, it affects all jobs of the cluster and may slow down operations.

Cluster transfers failover

It's similar to failover in High Availability groups. Triggered by branch leader failure due to:

disconnection from branch followers in the job.
deletion of identifying .sync/ID file .
Database error.
storage misconfiguration, causing branch followers to reference a different storage location than the branch leader.

Some of the operations are interrupted by the failover and started again after it, for example:

Initial indexing of the job (Agent scanning the folder and building the folder database). The new branch follower will have to scan the whole folder before it continues the job.
Trigger execution in a Distribution/Consolidation job, the new branch leader will start these from scratch.
File download from the leader. If Lazy indexing parameter in the Profile is enabled, remote Agents, outside the scale-out cluster will restart active downloads from the new leader.
Metadata recalculation. While file hashes are retained, metadata remains on the previous leader and needs to be recalculated.

Agents in such a cluster won't be able to properly detect the file changes and the job will get stuck. If metadata synchronization is disabled, rescanned files will receive a new date Modified timestamp. This can cause the receiving Agents to detect the files as updated, leading to either a re-upload of the files to cloud storage or simply updating the date Modified timestamp without changing the file content. In Distribution or Consolidation Jobs the destination Agents won't be able to seed such files.

Using scale-out clusters in Jobs

All Agents in the scale-out cluster synchronize the same physical location. If the storage location is misconfigured, the followers report the error and cease synchronization.
Only direct path or a storage connector location are supported. Path macros are not supported for scale-out cluster, and if used, the behavior will be undefined. Changing the job path for an already configured job is not supported, you need to remove the cluster from the job and add again with a new path.

Resilio supports different Job configurations: cluster-to-cluster, cluster-to-Agents.

It's recommended to enable TCP multistream and increase number of allowed connections for Agents in cluster before creating the Job

{  
'net.tcp.streams'='10', //10 is optimal value for majority of cases
'transfer_peers_limit': 'X', // where X= (size of cluster)*3  
'overwrite_changes':yes //otherwise, Agents in cluster won't be able to properly detect the file changes and the job will get stuck
}

The following parameters are enforced for scale-out cluster in the jobs:

{  
'disk.unwritten_max_load_factor': '5',  
'disk.unwritten_async_max_load_factor': '10',  
'disk.out_of_order_threads': '1'  
}

Job run details and its progressing - number of files, size of the job, etc. - are reported by the cluster leader that is taken from branch leader. Leader reports the total datasets, while branch leaders reports the part of the data set they manage. Helpers' information is not taken into consideration.

The general scale-out cluster transfer speed is calculated as a sum of the speed from all Agents in the cluster.
Total cluster files counter and size are reported per branch leaders.

For getting the correct understanding and expectations of how the job run is progressing, all leaders in scale-out clusters in this job must be elected (not in the process of failover) and online (connected to MC). Otherwise the job's progress will be reduced accordingly.

Additionally, more detailed transfer status is available for leaders from its overview in job run.

Scale-out cluster members in addition to existing statuses, report the following statuses:

working - Agent is a branch leader in the scale-out cluster and is performing a task. Check the leader Agent for details.
active - Agent is a helper in scale-out cluster and is downloading or seeding files.
inactive - Agent is passive and does not perform any activity in the scale-out cluster.

Supported Job types
Scale-out cluster can be used in Synchronization, File Caching and Hybrid Work Jobs with the limitations listed in the Peculiarities and limitations section.

Synchronization of file permissions is also supported. A whole scale-out cluster can be selected as a Reference Agent.

Cross-platform synchronization of file permissions
Using scale-out clusters with only one Agent assignedcluster member role on a system that cannot apply the replicated permissions (for example, on a Linux with replicated NTFS permissions or vice versa) should be avoided. It may lead to unexpected permissions issues and access problems. Always ensure that scale-out clusters are used on systems with compatible file permission structures.

Distribution/Consolidation jobs
Scale-out clusters can be used as source or destination in these jobs.
Triggers in the job are executed by the branch leader in the scale-out cluster. If failover happens during script execution, the newly elected branch leader starts executing the script from scratch. Before finalizing download trigger is not supported for scale-out cluster.

An active job run can be stopped on an Agent from scale-out cluster - the job run is stopped for all Agents in the cluster.

Adding new Agents to an active job run with a scale-out cluster in it is not supported.
Restarting the job run that has a scale-out cluster in it is not supported, Agents will report the error about the misconfigured storage path.

Job run will be aborted on all Agents in scale-out cluster in case of some error, which is added to "Abort on error" list, appeared on the cluster leader. "Agent offline" error is ignored on all scale-out Agents.

AGENT_NAME and AGENT_ID tags are supported for scale-out clusters. The name and the ID of the scale-out cluster is used instead.

Peculiarities and limitations

Scale-out cluster does not support:
- Agents version older than 5.0.0
- Mobile Agents, Agents install on NAS devices, Management Console Agent
- Run Script Jobs
- Path macros for scale-out clusters
- Changing the Job path for scale-out cluster
- Adding new Agents to a Job run or restarting the job run that has a scale-out cluster
- Network policy rules
- Before finalizing download trigger
- Job priorities
- Pausing the Job run or a single Agent in the Job run. This causes cluster failover and may lead to indexing of the job from scratch.
File query results are reported only from the cluster leader.
Scale-out clusters are not compatible with Priority Agent functionality.
No automatic support for scaling up or down.
Temporary error "Share's identifying .sync/ID file is broken" for scale-out clusters may appear after changing the job type or recreating the job using the same files storage.
Differential sync implementation is incomplete, helpers may re-download pieces.
Single large file split download in OneDrive is not supported due to of sequential write requirement from OneDrive.
Helpers wont join download from if their tunnel peer is already seeding to their cluster.
If a job completes its initial syncing and a new Agent is then added to a non-reference RW scale-out cluster, the initial sync is expected to restart. However, it finishes too quickly, and if the cluster leader changes to this new Agent during that time, it may lead to permission misalignment with the reference cluster.