Scaling and Reliability - Windows Event Collection (aka Windows Event Forwarding)

WEC Solutions Products Support Partners Resources About

Enterprise Scaling and Reliability

Performance and Scaling

Scaling WEC is a matter of scaling out – not up. Most of our customers run collectors based on a modest VM size of 8GB RAM and 2-4 cores. Assuming reasonably modern hardware, we haven’t found other resources such as NIC or storage to be impactful on WEC.

Building a bigger collector (scaling up) does not result in a linear increase in WEC throughput. Our most successful clients opt for more small collectors rather than a few large ones.

TIP: we recommend configuring the WinRM and WEC services to each run in their own process (sc.exe type= own) instead of sharing memory and resources with other Windows services in the same process. But be sure also to configure the URL ACLs for each listener.

Load balancing between multiple collectors

The only disadvantage to using multiple collectors for scaling is that you have to assign different forwarders to each collector. Even though WEC is WinRM at the network level, and WinRM is ultimately HTTP, you cannot put a web or DNS load balancer between forwarders and collectors. This is because each collector keeps track of its forwarders and, crucially, of the bookmarks for each source log on each forwarder which are the placeholders that ensure WEC doesn’t miss or repeat events. Besides the burden of forwarder assignment, you must also duplicate the same subscriptions on each collector – with the exception of the forwarders assigned to each collector

How Supercharger does load balancing

Supercharger completely automates dynamic load balancing any number of forwarders between your available collectors - both for Active Directory and Entra ID based environments. First you create a special Supercharger object called a LoadBalancer. Then:

Assign the collectors across which you want to spread the load of your forwarders. (You can easily add new or replace failed collectors anytime.)
Specify which forwarders should be distributed among the assigned collectors using either
- Active Directory environments: Groups or an LDAP queries specifying the desired computer accounts
- Entra-Joined PC environments: MS Graph query specifying the desired Windows 11+ devices
Select a Managed Filter (a Supercharger object comprising an Xpath query) that specifies the source logs and events to forward
Select a destination log - Supercharger can create the log if needed on each collector

Supercharger queries AD or Entra for your desired computers and then assigns them evenly to each collector, creating the necessary subscriptions on each collector. Supercharger takes into account the status of each AD computer / Entra device and its "last seen" time to make sure that there's no imbalance between collectors due to inactive forwarders. As new computers are provisioned and old ones decommissioned, Supercharger adjusts the assignments to keep the load even across all the collectors in the load balancer.

How many forwarders per collector?

Windows Event Collection performance and scaling like many technologies is complex and we hesitate to provide rules of thumb in terms of number of forwarders because the quantity of events can vary so widely depending on which logs and events you collect and whether your forwarders are workstations or servers. Other variables include your Windows audit policy, which can greatly impact the volume of security events and, especially in the case of servers, the workload of specific forwarders. We have seen MS documents that suggest limiting collectors to 2,000 forwarders but we have customers with 10,000 forwarders per collector. Until you know the average EPS produced by forwarders in your environment it is better to concentrate on EPS.

The reliable way to scale WEC

The best way to scale WEC is to begin with 100-1,000 forwarders and your planned subscription design in terms of logs and events collected determined by your Xpath queries. Observe the CPU and memory utilization over a period of time likely to cover regular peaks – such as a week or month. Extrapolate from there. Then double the number of forwarders to verify your projection works. This will enable you to find out what your collector configuration can handle with your collection profile. You should aim for a 75-80% average utilization of collector resources at peak periods. Then take the total number of forwarders and divide it by how many forwarders your testing indicates one of your collectors can handle.

Your SIEM and other downstream consumers

Remember that getting events to your collector’s destination log is not the whole journey. Your downstream consumers have to keep up with the flow as well. We’ll discuss this more under Log Continuity.

Reliability

You can build a highly reliable and scalable logging pipeline with WEC. We have customers collecting events from excess of 100,000 forwarders with strict availability requirements. But you do need to understand the unique architecture and idiosyncrasies of WEC.

WEC Health

WEC can occasionally throw errors or silently stop working. The errors are often confusing and more concerning are the silent failures.

We have, by necessity, become experts on WEC problems and we have assiduously worked to build this expertise into Supercharger. Here are key health issues we have isolated over the years and the remedies we’ve tested in the field.

Collector Availability

Common high-availability technologies such as Windows clustering and load balancers cannot be leveraged for WEC. However, hypervisor clusters supporting VM failover are highly effective for providing collector availability. Of course, collectors can still fail inside the VM such as due to a Windows update problem for instance. So, you still need a way to deal with a failed collector while preserving the flow of your logging pipeline. You either need to be able to quickly replace the collector with a standby or redistribute the failed collectors load (forwarders) to the remaining collectors until it is back up. You do not want to be in a position of having to stand up a new Windows server.

How Supercharger manages WEC health

Supercharger constantly monitors each collector, subscription (including the status of each assigned forwarder) and destination log. We identify over 50 different problem conditions and automatically remediate when possible and alert you otherwise by the health status color of the object on the Supercharger dashboard and - if desired - by email. Supercharger understands the hierarchy of WEC objects and a problem health status on a lower object - such as a subscription bubbles up as appropriate to higher objects like Collectors and LoadBalancers.

Supercharger Load Balancers pay a key role in reliability, and we recommend implementing all your subscriptions via load balancer objects even if your actual forwarder load does not require it. By using Load Balancers and other Supercharger policy objects such as Managed Event Filters, Managed LDAP and Graph queries, Collector Policies and Subscription Policies, you externalize all the configuration of your WEC environment so that even with a complete collector failure you lose nothing. Simply provision a new collector or assign a hot-standby to the load balancer and your WEC environment is healed.

Log Continuity

WEC does a reliable job of forwarding complete event logs, automatically recovering from network outages or collector reboots, and picking up where it left of on each source log. Between the forwarder and collector the only way events can be lost is if

Forwarder is delayed (by interrupted network communication or failed collector) long enough for the entire source log to fill with unforwarded events resulting in the oldest events being overwritten by newer events
Load Balanced scenario: Forwarder is reassigned to the same subscription but no a different collector. Newly subscribed forwarders normally pick up the most recent events unless you set ReadExistingEvents. Supercharger LoadBalancers never reassign forwarders to a different collector unless you determine a failed collector cannot be restored to health soon enough and you remove it from the LoadBalancer.
You change the Xpath event filter on an existing subscription and ReadExistingEvents is false.

If you set a subscription's ReadExistingEvents to true, it is possible to get duplicate events in the latter 2 situations above.

But log continuity extends beyond the collector. Events can also be lost if your SIEM or other downstream consumer:

Is down long enough for the collector's destination log to fill up with unconsumed events in which case the oldest events will be overwritten my newly received events
Is consistently consuming events at rate lower than the collector is receiving events from forwarders

How Supercharger helps track log continuity

For continuity between forwarder and collector, Supercharger constantly identifies a forwarder failing to send events within a specified time period and if you've defined the subscription's policy as being critical it will immediately flag the subscription as unhealthy due to insufficient forwarders, alert you and highlight the unhealthy forwarders.

In load balanced subscriptions Supercharger never reassigns a forwarder to a different collector unless absolutely necessary because you replace a permanently failed collector with a new one.

Supercharger allows you to mandate the use of ReadExistingEvents via subscription policy. ' ReadExistingEvents is an important WEC feature that mitigates possible event loss situations discussed above.

For continuity between collector and downstream consumers like SIEMs, Supercharger offers 2 new features:

Incoming log volume tracking. Supercharger produces its own special events to the collectors Application log which record the incoming qty of events to each destination log per specified time period. You can compare these numbers to the quantity of events consumed by your SIEM to know if it's not keeping up.
Continuity tracer events. Supercharger can produce special tracer events in each destination log with a timestamp and counter. You can report on these events in your SIEM and identify any gaps.

Manage, Scale and Heal Windows Event Collection with Supercharger
Download • Enterprise Pricing • Ask Sales