feat: add support for coordinator schemas #28031

Daesgar · 2025-01-29T13:16:46Z

Problem

We run ClickHouse topologies in which nodes have different roles. Their schema differs depending on those roles, and we don't support running different schema creation statements depending on them.

Changes

Add support for managing schemas on coordinator nodes. To achieve this:

Use the ClickhouseCluster class so we can run statements on any combination of nodes from the cluster
Use the posthog_migrations cluster that contains all nodes, including coordinators. The cluster is only used to populate ClickhouseCluster info, so I added a new setting (CLICKHOUSE_MIGRATIONS_CLUSTER) to reference it and keep all the migrations running as expected. Now the HostInfo contains variables for the node role (worker / coordinator) and cluster type (online / offline), that is fed from macros.
- To use those macros correctly, I changed the query from system.clusters to clusterAllReplicas.
The migration function can now receive an additional parameter
By default, migrations run on all workers to keep compatibility with what was done until now
- I will probably update this because most of the queries run an ON CLUSTER, so we would be running an ON CLUSTER query on every worker. This should not be an issue as all existing migrations are applied and locally we only have 1 node in the default cluster, but still.

Before shipping this I need to:

Add new tests to cover the new logic for the ClickhouseCluster class
Create a PR to update the website docs so the developing locally section includes the a host for the clickhouse-coordinator: chore: include clickhouse-coordinator in the development locally section posthog.com#10535
Remove the default_database from our ClickHouse all our clusters in remote_servers config, so we can use is_local correctly in ClickhouseCluster: https://github.com/PostHog/posthog-cloud-infra/pull/3535

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes.

How did you test this code?

Tested locally running all the migrations and adding new dummy ones to ensure they are created in the expected nodes.

Also tested the ClickHouse configuration for the hobby deployment to ensure that migrations run correctly.

…nto coordinator-migrations

greptile-apps

PR Summary

This PR adds support for managing ClickHouse schemas across different node roles (coordinator/worker) in distributed clusters.

Added new posthog_migrations cluster in config.d/worker.xml and config.d/coordinator.xml to handle schema changes across all node types
Introduced NodeRole enum (ALL/COORDINATOR/WORKER) in client/connection.py to control migration execution targets
Added host_cluster_type and host_cluster_role macros in ClickHouse configs to identify node roles
Modified ClickhouseCluster class to filter operations by node role using map_all_hosts(node_role=NodeRole.X)
Added CLICKHOUSE_MIGRATIONS_CLUSTER setting (defaults to 'posthog_migrations') to manage schema changes across all nodes

_{💡 (3/5) Reply to the bot's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!}

_{20 file(s) reviewed, 11 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-01-30T12:41:21Z

docker-compose.dev.yml

+        volumes:
+            # this new entrypoint file is to fix a bug detailed here https://github.com/ClickHouse/ClickHouse/pull/59991
+            # revert this when we upgrade clickhouse
+            - ./docker/clickhouse/entrypoint.sh:/entrypoint.sh
+            - ./posthog/idl:/idl
+            - ./docker/clickhouse/docker-entrypoint-initdb.d:/docker-entrypoint-initdb.d
+            - ./docker/clickhouse/config.xml:/etc/clickhouse-server/config.xml
+            - ./docker/clickhouse/config.d/coordinator.xml:/etc/clickhouse-server/config.d/coordinator.xml
+            - ./docker/clickhouse/users-dev.xml:/etc/clickhouse-server/users.xml
+            - ./docker/clickhouse/user_defined_function.xml:/etc/clickhouse-server/user_defined_function.xml
+            - ./posthog/user_scripts:/var/lib/clickhouse/user_scripts


style: consider deduplicating volume mounts between clickhouse and clickhouse-coordinator services using YAML anchors

.flox/env/manifest.toml

greptile-apps · 2025-01-30T12:42:18Z

docker/clickhouse/config.d/coordinator.xml

+    </remote_servers>
+
+    <macros>
+        <shard>02</shard>


logic: shard number '02' may conflict with existing shards if not carefully coordinated across the cluster

greptile-apps · 2025-01-30T12:42:53Z

posthog/clickhouse/client/migration_tools.py

+    def run_migration():
+        if node_role == NodeRole.ALL:
+            logger.info("       Running migration on coordinators and workers")
+            return cluster.map_all_hosts(lambda client: client.execute(sql)).result()


logic: No error handling for failed migrations across nodes - could leave cluster in inconsistent state

greptile-apps · 2025-01-30T12:43:04Z

posthog/clickhouse/cluster.py

+                SELECT host_address, port, shard_num, replica_num, getMacro('hostClusterType') as host_cluster_type, getMacro('hostClusterRole') as host_cluster_role
+                FROM clusterAllReplicas(%(name)s, system.clusters)
+                WHERE name = %(name)s and is_local
                ORDER BY shard_num, replica_num


logic: Using is_local filter with clusterAllReplicas may return no results if default_database is set in remote_servers config. Ensure this is removed as mentioned in PR description.

PR ready for that!

greptile-apps · 2025-01-30T12:43:05Z

posthog/clickhouse/cluster.py

 ) -> ClickhouseCluster:
    extra_hosts = []
    for host_config in map(copy, CLICKHOUSE_PER_TEAM_SETTINGS.values()):
-        extra_hosts.append(ConnectionInfo(host_config.pop("host")))
+        extra_hosts.append(ConnectionInfo(host_config.pop("host"), None))


style: Setting port to None for extra_hosts while adding port support could cause connection issues if the default port is not correct

greptile-apps · 2025-01-30T12:44:10Z

posthog/management/commands/migrate_clickhouse.py

@@ -53,12 +54,14 @@ def handle(self, *args, **options):
        self.migrate(CLICKHOUSE_HTTP_URL, options)

    def migrate(self, host, options):
+        # Infi only creates the DB in one node, but not the rest. Create it before running migrations.
+        self._create_database_if_not_exists(CLICKHOUSE_DATABASE, CLICKHOUSE_MIGRATIONS_CLUSTER)


style: Database creation should be wrapped in a try/catch block to handle potential connection failures gracefully

greptile-apps · 2025-01-30T12:44:11Z

posthog/management/commands/migrate_clickhouse.py

+        with default_client() as client:
+            client.execute(f"CREATE DATABASE IF NOT EXISTS {database} ON CLUSTER '{cluster}'")


logic: Raw SQL string interpolation creates potential SQL injection risk. Use parameterized query instead

Suggested change

with default_client() as client:

client.execute(f"CREATE DATABASE IF NOT EXISTS {database} ON CLUSTER '{cluster}'")

with default_client() as client:

client.execute("CREATE DATABASE IF NOT EXISTS %(database)s ON CLUSTER %(cluster)s", {'database': database, 'cluster': cluster})

The driver only support parameter replace for select and insert queries. :/

Some reptile comments are pure guesses, but other are quite good ;)

greptile-apps · 2025-01-30T12:44:19Z

posthog/clickhouse/test/test_cluster.py

+    def mock_get_task_function(_, host: HostInfo, fn: Callable[[Client], T]) -> Callable[[], T]:
+        if host.host_cluster_role == NodeRole.WORKER.value.lower():
+            times_called[NodeRole.WORKER] += 1
+        elif host.host_cluster_role == NodeRole.COORDINATOR.value.lower():
+            times_called[NodeRole.COORDINATOR] += 1
+        return lambda: fn(Mock())


style: mock_get_task_function is using a private method name with double underscores. Consider using the public interface or documenting why the private method needs to be mocked.

greptile-apps · 2025-01-30T12:44:19Z

posthog/clickhouse/test/test_cluster.py

+
+    cluster = ClickhouseCluster(bootstrap_client_mock)
+
+    times_called: defaultdict[NodeRole, int] = defaultdict(int)


style: defaultdict usage here could mask errors if NodeRole values are incorrect. Consider using a regular dict with explicit initialization.

orian · 2025-01-30T13:01:29Z

posthog/clickhouse/cluster.py

@@ -130,15 +150,23 @@ def any_host(self, fn: Callable[[Client], T]) -> Future[T]:
            host = self.__hosts[0]
            return executor.submit(self.__get_task_function(host, fn))

-    def map_all_hosts(self, fn: Callable[[Client], T], concurrency: int | None = None) -> FuturesMap[HostInfo, T]:
+    def map_all_hosts(


map_hosts would be more appropriate, as it's more of map_hosts_by_role ;)

100%, I'll update that

orian

This was a nontrivial learning read ;-)

Some 10 lines python list-comprehensions are quite hard for me to read.

sentry-io · 2025-01-31T09:23:03Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ **ServerException: DB::Exception: All connection tries failed. Log: ** posthog.clickhouse.cluster in __init__ View Issue
‼️ ServerException: DB::Exception: Received from ch23.posthog.net:9000. DB::Exception: Too many simultaneous queries.... posthog.clickhouse.cluster in __init__ View Issue
‼️ **ServerException: DB::Exception: All connection tries failed. Log: ** posthog.clickhouse.cluster in __init__ View Issue
‼️ ServerException: DB::Exception: There was an error on [ch20.posthog.net:9000]: Code: 202. DB::Exception: Too many ... posthog.management.commands.migrate_clickhouse ... View Issue
‼️ ServerException: DB::Exception: Received from ch23.posthog.net:9000. DB::Exception: Too many simultaneous queries.... posthog.clickhouse.cluster in __init__ View Issue

_{Did you find this useful? React with a 👍 or 👎}

Daesgar and others added 8 commits January 28, 2025 12:12

chore: update migrations to use ClickhouseCluster class instead

158c430

chore: refactor compose and migrations to use system.clusters info

22110bd

Update query snapshots

b2e050f

chore: allow running migrations on worker or coordinator nodes

a3a126d

Merge branch 'coordinator-migrations' of github.com:PostHog/posthog i…

7d63488

…nto coordinator-migrations

fix: pass port when building tuple

265bae4

Merge remote-tracking branch 'origin/master' into coordinator-migrations

ac7bf42

fix: linting

a0c7feb

Daesgar changed the title ~~WIP feat: add support for coordinator schemas~~ feat: add support for coordinator schemas Jan 29, 2025

Daesgar added 5 commits January 29, 2025 18:29

Merge branch 'master' into coordinator-migrations

5263982

Merge remote-tracking branch 'origin/master' into coordinator-migrations

294f3b1

chore: add test for node role filtering

6595959

fix: mock bootstrap_client instead of private attribute

37267ed

chore: update hobby deployment

32673da

Daesgar mentioned this pull request Jan 30, 2025

chore: include clickhouse-coordinator in the development locally section PostHog/posthog.com#10535

Merged

Daesgar added 2 commits January 30, 2025 13:20

ref: update configs to leave hobby deployment with its previous config

8246bce

Merge remote-tracking branch 'origin/master' into coordinator-migrations

36f3789

Daesgar marked this pull request as ready for review January 30, 2025 12:40

Daesgar requested a review from a team as a code owner January 30, 2025 12:40

orian self-requested a review January 30, 2025 12:43

greptile-apps bot reviewed Jan 30, 2025

View reviewed changes

orian reviewed Jan 30, 2025

View reviewed changes

orian approved these changes Jan 30, 2025

View reviewed changes

Daesgar added 7 commits January 30, 2025 15:14

chore: create map_hosts_by_role

18f64e8

fix: update comment

e7867dd

fix: replace paremters properly

71439c5

fix: revert parameters replace

bd654ba

Merge branch 'master' into coordinator-migrations

21d1fde

Merge remote-tracking branch 'origin/master' into coordinator-migrations

8584672

chore: add posthog_migrations to hobby deployment

6ba1e5f

fix: add macros for hobby deployment

ed24ba3

Daesgar enabled auto-merge (squash) January 31, 2025 08:39

Daesgar merged commit 9b30f8d into master Jan 31, 2025
95 checks passed

Daesgar deleted the coordinator-migrations branch January 31, 2025 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for coordinator schemas #28031

feat: add support for coordinator schemas #28031

Daesgar commented Jan 29, 2025 •

edited

Loading

greptile-apps bot left a comment

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

Daesgar Jan 30, 2025

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

Daesgar Jan 30, 2025

orian Jan 30, 2025

greptile-apps bot Jan 30, 2025

greptile-apps bot Jan 30, 2025

orian Jan 30, 2025

Daesgar Jan 30, 2025

orian left a comment

sentry-io bot commented Jan 31, 2025 •

edited

Loading

		with default_client() as client:
		client.execute(f"CREATE DATABASE IF NOT EXISTS {database} ON CLUSTER '{cluster}'")


		cluster = ClickhouseCluster(bootstrap_client_mock)

		times_called: defaultdict[NodeRole, int] = defaultdict(int)

feat: add support for coordinator schemas #28031

feat: add support for coordinator schemas #28031

Conversation

Daesgar commented Jan 29, 2025 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

Daesgar Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

Daesgar Jan 30, 2025

Choose a reason for hiding this comment

orian Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

greptile-apps bot Jan 30, 2025

Choose a reason for hiding this comment

orian Jan 30, 2025

Choose a reason for hiding this comment

Daesgar Jan 30, 2025

Choose a reason for hiding this comment

orian left a comment

Choose a reason for hiding this comment

sentry-io bot commented Jan 31, 2025 • edited Loading

Suspect Issues

Daesgar commented Jan 29, 2025 •

edited

Loading

sentry-io bot commented Jan 31, 2025 •

edited

Loading