-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config for deciding whether to use Iceberg Time type #11174
Conversation
As per the current design: If Connect Schema is int32 with logical type is Time, then iceberg table type is returned as TimeType. Spark does not support the Iceberg TimeType https://iceberg.apache.org/docs/1.4.3/spark-writes/#iceberg-type-to-spark-type. In this PR, We have added the config to decide whether to use Iceberg Time Type or IntegerType. This config gives the flexibility to the users to choose IntegerType over TimeType without loosing any precision with being able to query via spark.
Fixed Record Converter Test to handle whether to use IcebergTimeType or Integer type for validations.
Fixed SchemaUtils test to handle iceberg time type or integer type.
@bryanck can you please review this. |
@fqaiser94 can you please review this. |
Have you considered using an SMT for this? I'm reluctant to add configs for each type conversion scenario. |
@bryanck thanks for the quick response. Kafka Connect SMT's are a bit slow and putting that just for this conversion would unnecessarily lower the performance. Also as there is not plan from spark to extend the support and spark is very frequently and widely used that is why I thought of giving this option via the config. |
What we want is to keep transform logic out of the sink when possible. Our plan is to have a set of useful SMTs that we package w/ the connector, and this could be one of them. |
Hi Bryan, Thank you for your response. I completely agree with your view that the logic for Single Message Transforms (SMTs) should not be part of any sink. However, in the context of this PR, I have a few considerations about moving this to an SMT: Performance Impact: While SMTs provide flexibility, they can introduce performance overhead. Adding an SMT increases the time complexity by O(number of records), which may seem like O(N), but becomes significant when N involves processing millions of records. General Applicability: SMTs are typically designed to work across different connectors, ensuring they remain connector-agnostic. In this case, however, the transformation logic is closely tied to deciding whether to convert to Iceberg's time type. Moving this to an SMT would result in an SMT that is specifically tailored for Kafka Connect Iceberg, which might limit its broader applicability. Overhead of Moving Logic: The current logic is a simple conditional check. Moving this to an SMT introduces additional processing complexity, essentially creating a new step or hop, which could come at a higher cost for what is essentially a straightforward type conversion. I'd be interested to hear your thoughts on this approach. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
Hi Bryan, I hope you're well. I wanted to follow up on this. It's been a while, and I'd appreciate the opportunity to discuss it further. |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Hey! Just a quick follow up: I encountered the same issue when using the files generated by the connector with Spark, and I came to the same conclusion as @kumarpritam863 : configuring a property in my custom connector and now using it in production. Regarding making this functionality available, as @bryanck says, we can include a simple SMT that transforms all the fields in the key/value that are of type Time into long or Timestamp from epoch. Maybe is not as efficient but it separates the logic from the connector itself. I can handle this in a separate PR, but first, it's necessary that the SMTs be merged as part of the Iceberg repo: #11936 Let me know if you have any other idea @kumarpritam863 |
We have #11936 for a new SMT project. We can start adding some common transforms to that, rather than mixing transform logic into the sink itself. |
As per the current design:
If Connect Schema is int32 with logical type is Time, then iceberg table type is returned as TimeType.
Spark does not support the Iceberg TimeType https://iceberg.apache.org/docs/1.4.3/spark-writes/#iceberg-type-to-spark-type.
In this PR, We have added the config to decide whether to use Iceberg Time Type or IntegerType.
This config gives the flexibility to the users to choose IntegerType over TimeType without loosing any precision with being able to query via spark.