Query Optimization
The straight forward way to execute a query is using query.ForEachEntity()
.
This implementation is super compact and provide a direct result during development. E.g.
This execution type creates a delegate (lambda) and the callback for every result add additional execution costs.
There are various ways to improve query execution significant.
Query creation remains unchanged for all variants by using the same Query<>()
instance.
Only the code related for query execution is different and require additional boilerplate code. This approach enables:
Query optimization can be postponed. E.g. a specific query turns out to be a bottleneck.
All optimization variants require only local code changes.
Query optimizations can be achieved by using various query technics like:
Query optimization examples are part of the unit tests see: Tests/ECS/Examples
Info In general friflo ECS provides in many areas only a single API to implement a specific behavior. This avoids confusion to select a one of multiple API variants if available. Query optimization and Batch operations are an exception to this rule. The reason is to provide provide room for significant performance gains if needed.
Boosted Query
🔥 Update - Introduced new query approach using Friflo.Engine.ECS.Boost.
Optimization: Use query.Each()
and a struct implementing Execute(...)
for maximum query performance.
This query approach is the most performant approach - except Query vectorization / SIMD. A unique feature of Friflo.Engine.ECS - it uses no unsafe code. This enables running the dll in trusted environments. For maximum performance unsafe code is required to elide bounds checks.
Instead of processing components of a query with query.ForEachEntity(...)
the MoveEach
struct below process components in its Execute()
method.
The processing of all query components is performed by query.Each(new MoveEach())
.
The performance gain compared with query.ForEachEntity(...)
is ~3x.
The method query.Each()
requires adding the dependency Friflo.Engine.ECS.Boost.
Enumerate Query Chunks
Optimization: Replace a query.ForEachEntity(( ... ) => lambda)
by two foreach
loops.
This approach avoids the more expensive lambda calls.
Also as described in the intro enumeration of a query result is fundamental for an ECS. Components are returned as Chunk's and are suitable for Vectorization - SIMD.
🔥 Update - Added example code for slower iteration alternative. The alternative should be used only for small result set with less than 10 entities.
Parallel Query Job
Optimization: Execute a query
on multiple CPU cores in parallel.
To minimize execution time for large queries a QueryJob can be used. It provides the same functionality as the foreach loop in example above but runs on multiple cores in parallel. E.g.
To enable running a query job a ParallelJobRunner is required.
The runner can be assigned to the EntityStore
or directly to the QueryJob
.
A ParallelJobRunner
instance is thread-safe and can / should be used for multiple / all query jobs.
In case of structural changes inside ForEach((...) => {...})
use CommandBuffer.Synced to record the changes.
These changes are adding / removing components, tags or child entities and the creation / deletion of entities.
Note: CommandBuffer
is not thread safe. CommandBuffer.Synced
is thread safe.
After RunParallel()
returns these changes can be applied to the EntityStore
by calling CommandBuffer.Playback()
.
Re-use QueryJob
The QueryJob
returned by query.ForEach(...)
is intended to be re-used.
This ensures that subsequent calls to queryJob.RunParallel()
do not perform allocations on GC heap.
Recommendation A parallel query achieves notable performance gains in case using only arithmetic computations like * / + - sin(), cos(), ... in the loop.
In case of using a CommandBuffer
and and applying massive entity changes the single threaded version is typically faster.
The reason is that entity changes applied to a CommandBuffer
requires heavy random memory access.
If doing this on multiple threads the CPU cores are competing with access to memory heap and CPU caches.
Query Vectorization / SIMD
Optimization: Utilize SIMD of your CPU to execute a query
.
SIMD architectures: SSE, AVX, AVX2, AVX-512, AdvSIMD, Neon, ...
The most efficient way to speedup query execution is vectorization. Vectorization is similar to loop unrolling - aka loop unwinding - but with hardware support. Its efficiency is superior to multi threading as it requires only a single thread to achieve the same performance gain. So other threads can still keep running without competing for CPU resources.
Note: Vectorization can be combined with multi threading to speedup execution even more. In case of a system with high memory bandwidth the speedup is speedup(SIMD) * speedup(multi threading). If SIMD or multi threading alone already reaches this bandwidth bottleneck their combination provide no performance gain.
The API provide a few methods to convert chunk components into System.Runtime.Intrinsics - Vectors.
E.g. AsSpan256<>
and StepSpan256
. See all methods at the Chunk - API.
The Span
retrieved from a chunk component has padding components at the end to enable vectorization without a scalar remainder loop.
The following examples shows how to increment all MyComponent.value
's by 1.
Last updated