Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance with System.Numerics and Multithreading #12

Open
GabrielMotaAlexandre opened this issue Sep 25, 2023 · 20 comments
Open

Performance with System.Numerics and Multithreading #12

GabrielMotaAlexandre opened this issue Sep 25, 2023 · 20 comments

Comments

@GabrielMotaAlexandre
Copy link
Contributor

GabrielMotaAlexandre commented Sep 25, 2023

First of all thanks for the port.

I refactored to use Vector3 instead of RcVec3f and the performance I had was like 4x faster on DtCrowd.

Parallelism also improved a lot.

Why is RcVec3f being used instead?

@ikpil
Copy link
Owner

ikpil commented Sep 25, 2023

Compatibility Issues

System.Numerics.Vector3 leverages SIMD (Single Instruction, Multiple Data) instructions for optimized vector operations, which can be very fast for certain operations. However, SIMD operations can significantly vary in performance depending on the data structures and types of operations involved.

For example, using Vector3 may actually be slower for performing operations on small-sized vectors or simple scalar operations. Furthermore, SIMD extensions may not be supported on all hardware, making it unavailable in certain environments.

@GabrielMotaAlexandre
Copy link
Contributor Author

GabrielMotaAlexandre commented Sep 25, 2023

I couldn't find topics related to downsides, I could imagine very few case Vector3 would be slower but overall I thought it would be worth it.

@ikpil
Copy link
Owner

ikpil commented Sep 26, 2023

Could you please provide your environment?
Could you provide me with the DLL that you've built?
Would you like to try running the build artifacts on a different CPU environment?

@GabrielMotaAlexandre
Copy link
Contributor Author

I edited my comment in case it was misinterpreted.

@ikpil
Copy link
Owner

ikpil commented Sep 27, 2023

@ikpil
Copy link
Owner

ikpil commented Oct 28, 2023

I've added a new branch with a version that uses SIMD. While it needs testing on various architectures, it has already increased performance by over 50% on my current laptop. I'll continue with more R&D, and once I'm confident it's safe, I'll plan to merge it into the main branch.

@GabrielMotaAlexandre

@GabrielMotaAlexandre
Copy link
Contributor Author

GabrielMotaAlexandre commented Oct 28, 2023

Thanks, great news.

@galvesribeiro
Copy link

Hello folks!

Any updates on the SIMD support?

We are using it on the server side for an unannounced MMO and results are good with this port. However, we would like to indeed leverage SIMD on this.

On that subject - are DtCrowdAgent.RequestMoveTarget()/.AddAgent()/.RemoveAgent() thread-safe?

Thanks!

@ikpil
Copy link
Owner

ikpil commented Feb 28, 2024

  1. SIMD will be supported. However, we are currently working on fixing the SOH issue first.

  2. By default, RecastNavigation is not thread-safe. The ported DotRecast is also not thread-safe. Therefore, when using it, use isolation or a Query pool to use multiple instances.

@galvesribeiro

@galvesribeiro
Copy link

galvesribeiro commented Feb 28, 2024

Thanks for the reply @ikpil

We only use DtCrowd and NavQuery/NavMesh. In that case, I guess we should have a SemaphoreSlim(1,1) being used whenever we need to call Add/Remove agent and Update.

Is that enough to protect the Write operations but Read from any thread without "lock"?

Again, thanks for the great work on this port!

@ikpil
Copy link
Owner

ikpil commented Feb 28, 2024

I haven't tested it, but just take a look at the feeling.
@galvesribeiro

        public class DtCrowdManager
        {
            private DtCrowd _crowd;
            private ConcurrentQueue<Action> _requests;

            public DtCrowdManager(DtCrowd crowd)
            {
                _crowd = crowd;
                _requests = new ConcurrentQueue<Action>();
            }

            // DtCrowdAgent -  should only read.
            // AddAsync - thread-safe
            public Task<DtCrowdAgent> AddAsync(RcVec3f pos, DtCrowdAgentParams option)
            {
                var tcs = new TaskCompletionSource<DtCrowdAgent>(TaskCreationOptions.RunContinuationsAsynchronously);
                _requests.Enqueue(() =>
                {
                    var ag = _crowd.AddAgent(RcVec3f.Zero, null); // ..
                    tcs.SetResult(ag);
                });

                return tcs.Task;
            }

            // RemoveAsync - thread-safe
            public Task<bool> RemoveAsync(DtCrowdAgent ag)
            {
                var tcs = new TaskCompletionSource<bool>(TaskCreationOptions.RunContinuationsAsynchronously);
                _requests.Enqueue(() =>
                {
                    _crowd.RemoveAgent(ag);
                    tcs.SetResult(true); // ...
                });

                return tcs.Task;

            }

            // It should be called only from one thread.
            public void Update(float dt)
            {
                while (_requests.TryDequeue(out var action))
                {
                    action.Invoke();
                }
                
                _crowd.Update(dt, null);
            }
        }

@galvesribeiro
Copy link

ConcurrentQueue does a hard lock internally. So the Enqueue() call is essentially blocking the thread.

We use DotRecast in a server which is based on Microsoft Orleans. Locks in that context are extremely harmful. So I guess in our case the SemaphoreSlim would be better.

@mellinoe
Copy link

Any update on this? I'm not very concerned about the performance difference between vector types, but having to convert all of my vectors anytime I interact with this library makes it a lot more painful than necessary. Most game- and graphics-related libraries are using System.Numerics at this point, and it can't be overstated how much more convenient things are when everything is consistent.

@ikpil
Copy link
Owner

ikpil commented May 29, 2024

Any update on this? I'm not very concerned about the performance difference between vector types, but having to convert all of my vectors anytime I interact with this library makes it a lot more painful than necessary. Most game- and graphics-related libraries are using System.Numerics at this point, and it can't be overstated how much more convenient things are when everything is consistent.

In conclusion, I plan to make the change around November 2024 when dotnet 6 support ends.

The issue here is a crash occurring due to memory corruption during SIMD operations in a specific environment.

Here's how I tested it:

  1. I built a DLL with dotnet 6 in release mode on Windows 64-bit with an AMD Ryzen 5600x.
  2. I copied this DLL to Hyper-V rocky 9 Linux 64-bit on Windows 64-bit and ran it.
  3. At some point during specific SIMD operations, a crash occurred.
  4. Upon investigation, I found out that the crash happened during Vector3 SIMD operations.

So, I switched to using RcVec3f.

Here are similar reported issues:

@ikpil
Copy link
Owner

ikpil commented May 29, 2024

The work to make the change is already completed, and I'm concerned that if we switch now, there might still be people using early versions of dotnet 6 who could experience crashes. 😓

@kaoraswoo
Copy link

kaoraswoo commented Jun 10, 2024

현재까지 업데이트를 잘 하고 있는 프로젝트를 운영하셔서 저에게는 너무 큰 공부가 되어서 감사드리며,
서버내 사용을 위한, 멀티쓰레드 구조 관련해서 추가적으로 하나 여쭙고 싶습니다.

As we use the server, entities (agents) are managed through space partitioning (quadtree or sectors). The update tick itself is called for each partitioned space, and agents are updated accordingly.

In the DtCrowd code and the code you provided, all agents are processed through GetActiveAgents within DtCrowd.

If agents are partitioned by space, what would be the best approach to handle this?

I would like to ask if it is structurally feasible to override DtCrowd's Update or GetActiveAgents to handle target agents.

Additinally, I found your another post in Unity Forum.
you replied for similar question as this comment.

The first approach is to implement it directly.

  • I used this method because I have a lot of monsters.
    Another approach is to use a Crowd Manager with partitioning.
  • If the partitions are well-defined, you can even run it with multiple threads.

It seems your approach would be directly method without DtCrowd. Isn't it?

And you mentioned about partitioning for multi threads,

Could you explain more about both approaching?

@ikpil
Copy link
Owner

ikpil commented Jun 11, 2024

If agents are partitioned by space, what would be the best approach to handle this?

Could you provide more details on what it means for agents to be partitioned by space?

@kaoraswoo
Copy link

kaoraswoo commented Jun 12, 2024

If agents are partitioned by space, what would be the best approach to handle this?

Could you provide more details on what it means for agents to be partitioned by space?

의사전달을 위해, 조금더 자세히 상황설명을 해드리게 됩니다(한글로 쓰고 번역을 하게 됩니다)

  • Entity: 서버의 움직이는개체의 단위이며, NavMesh Agent를 소유하고 있습니다.

  • Field : 하나의 NavMesh 맵을 가지며, Sector 리스트를 가지고 있는 단위.

  • Sector : Entity개체가 속한 논리적으로 일정간격(10m단위) 나누어져있는 Grid(Tile과 비슷) - 서버 브로드캐스팅 단위

하나의 네비메시를 포함한 Field는 특정 간격(10m)으로 Sector별로 공간분할이 되며, Entity는 Position에 따라서 Sector에 속하고 있습니다.(Position에 따라 Sector간 이동됨)
Entity의 Position은 Entity가 가지고 있는 CrowdAgent(Reference)의 Position과 같습니다(참조로 갱신합니다)

서버 Sector들의 Update함수 호출은 Entity간 영향범위가 없는 거리에 따라, 적당히 스케쥴링되어 동시에 Multi Thread Task에서 호출하고 있습니다.

ex) 1번부터 100번까지의 Sector가 있는상황에서, 10개의 Thread Task가 같은틱에서 서로간 영향없는 Sector들의 Update 함수를 호출.
호출된 Sector는 Update함수 내에서 자신의 영역에 있는 Entity의 Update 함수를 호출하며 각종 처리를합니다.

다만 현재의 DtCrowd를 보면, 하나의 DtCrowd개체(Agent 리스트를 포함)에서 한번의 update틱에서 로직수행을 하고 있어서
저와 같이 섹터별로(같은틱에 영향범위가 서로 없는 Agent끼리) Update틱이 필요한 상황에서는 DtCrowd의 Agent들의 업데이트를 쪼갤 수가 없어 보여서 질문을 드렸었습니다.


Agent들의 상호간 처리를 자연스럽게 하는 DtCrowd의 많은기능(Steering, Collision)을 Multi Thread에서 쓸 수 있을까 고민하던 찰나였었고,

만약 하나의 DTNavMesh에서 DtCrowd의 MultiThread 단순한 접근이 힘들다고 하면, 저는 다음과 같은 선택을 해야 할 것 같습니다.

  1. 하나의 Field당 하나의 Thread로 강제하여, DtCrowd객체의 update 호출을 해서 쓰거나
  2. Agent간 충돌처리는 하지 않고, 필요시(몬스터길찾기정도) 그때그때 FindPath만 Multi Thread로 사용한다.
    (하나의 Field-DTNavMesh에서 DtNavMeshQuery 만 ThreadTask별로 여러개 생성해서 각 Thread별로 쿼리처리를 하려고 합니다)

위의 Idea에 대한 것은 제가 현재까지 생각해본 내용이지만,
서버에서의 MultiThread 처리나 혹은 대규모 유저간 전쟁이 있는 MMORPG를 기준으로 꼭 MultiThread가 아니더라도 효율적인 DotRecast&Detour 사용 가능한 시나리오가 있다고 하면
조언이나 아이디어에 대해서 생각을 공유해주시면 정말 감사드리겠습니다.


To facilitate communication, let me provide a more detailed explanation of the situation.

  • Entity: An entity is a unit of a moving object on my server and owns a NavMesh agent.

  • Field: A unit that has one NavMesh map and a list of sectors.

  • Sector: A logically divided grid tile (10m intervals) where an entity belongs.

A Field containing a NavMesh is spatially partitioned by sectors at specific intervals, and entities belong to sectors based on their positions. The position of an entity is the same as the position of the CrowdAgent (reference) it holds (referenced).

The server's update function is scheduled and called simultaneously in a multi-threaded task in an order that ensures no influence range between entities within the grid.

For example, in a situation with sectors numbered from 1 to 100, 10 thread tasks call the update functions of sectors without influence range.
Within the sector’s update function, it calls the update function of entities in its area.

Currently, DtCrowd performs logic in a single update tick with one DtCrowd instance (including a list of agents). In a situation like mine, where an update tick is needed for each sector (agents without influence range in the same tick), it seems impossible to split the update of agents in DtCrowd. Hence, I am asking for advice.

I was considering whether it is possible to use many of DtCrowd's features (steering, collision) in a multi-threaded environment, given that they handle agent interactions naturally.

If it is difficult to achieve simple multi-threaded access with DtCrowd in a single DTNavMesh, I might need to make the following choices:

  1. Call the DtCrowd object's update in a single thread per field.
  2. Handle agent collisions separately and use FindPath in a multi-threaded manner only when needed.
    (I plan to create multiple DtNavMeshQuery instances in each thread task within a single Field-DTNavMesh and process queries in each thread.)
    These are the ideas I have considered so far. If there are any scenarios or ideas you could share, whether multi-threaded or not, based on handling large-scale user interactions or wars in MMORPGs on a server, I would greatly appreciate your thoughts and advice.

@ikpil
Copy link
Owner

ikpil commented Jun 13, 2024

서버에서의 MultiThread 처리나 혹은 대규모 유저간 전쟁이 있는 MMORPG를 기준으로 꼭 MultiThread가 아니더라도 효율적인 DotRecast&Detour 사용 가능한 시나리오가 있다고 하면

DtCrowd 예제는 많은 기능들이 어떻게 상호 연결되는지를 보여주며, 스레드 안전성(thread-safety)을 고려하지 않았습니다.
군집 처리를 멀티 스레드로 개발하는 방법은 많지만, 제가 알고 있는 방법은 @kaoraswoo 님께서 언급한 내용에서 크게 벗어나지 않습니다.

전체적으로 두 가지 주요 접근 방식이 있습니다:

에이전트 분할
1-1. 에이전트를 필드에 할당하고, 필드를 격리하여 멀티 스레드로 처리.
1-2. 에이전트를 동적으로 군집으로 묶고, 군집을 격리하여 멀티 스레드로 처리.
1-3. 에이전트가 다른 에이전트를 읽기만 하고, 에이전트를 격리하여 멀티 스레드로 처리.
1-4. 1에서 100개의 에이전트가 있을 때, 1-10, 11-20 등으로 그룹화하여 멀티 스레드로 순회 (일부 오차 허용).

기능 분할
2-1. 격리 가능한 기능들을 멀티 스레드로 처리.

개발의 편의성과 생산성 측면에서 1-1과 1-4가 가장 쉽습니다.
저는 주로 1-1 방법을 사용하여 개발하고 프로파일러를 이용해 개선합니다.

2-1은 추가 연구가 필요합니다.

부족한 내용이지만 도움이 되셨으면 합니다.


The DtCrowd example shows how many features are interconnected, and it does not consider thread-safety.

There are many ways to develop crowd processing with multi-threading, but the methods I know do not deviate significantly from what @kaoraswoo mentioned.
Overall, there are two main approaches:

Agent Splitting
1-1. Assign agents to fields, isolate the fields, and process using multi-threading.
1-2. Dynamically group agents into clusters, isolate the clusters, and process using multi-threading.
1-3. Agents only read other agents, isolate agents, and process using multi-threading.
1-4. If there are 1 to 100 agents, group them into 1-10, 11-20, etc., and iterate through the groups using multi-threading (allowing some error).

Functionality Splitting
2-1. Process isolated functions using multi-threading.

In my experience,
for development convenience and productivity, 1-1 and 1-4 are the easiest.
I mainly develop using 1-1 and improve it using a profiler.

2-1 requires further research.

Although the content is lacking, I hope it was helpful.

@GabrielMotaAlexandre
Copy link
Contributor Author

Another thing I had done which improves performance and readability was converting float collections to Vector3 collections.
EG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants