-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data handling in dump_data_to_clickhouse #108
Comments
Hi @anfbermudezme, thank you for taking the time to write up this really thorough set of thoughts. There is definitely a lot we can do to improve the event sinks, I have always hoped that this implementation would be a temporary step and that we could use the event bus for this eventually, however adoption of the event bus has been slow and we don't want to force community adoption of it just for this purpose. I'll try to address some of the reasoning behind why things are the way they are below, but would love to continue the conversation and find ways to improve the project. Potential challenges
We have mitigations for this that delay sending the task until the transaction is committed. This should be the default for all sinks triggered inside MySQL transactions (see here for an example) but is not currently, so that is a way that things can be improved for sure. This also prevents issues where data arrives in ClickHouse and the transaction rolls back, leading to inconsistencies. Not all sinks are tied directly to MySQL transactions, however, and things such as handling course data on publish can have more complicated failure states as Mongo is involved as well.
This is true, however in most cases we're talking about a primary key lookup in MySQL which is extremely efficient, so generally not a performance concern compared to everything else happening in the LMS. There are two other reasons it happens this way: 1- Some objects are too large to be stored in the Celery task table and would be truncated / lost if we tried to store the full serialized object. We could work around this by storing the serialized data somewhere else, but then incur a lot of complexity and new race conditions. 2- Speaking of race conditions, the Celery tasks we use to send this data do not have a guaranteed order. By sending the primary keys we lose some granularity in the ClickHouse data when someone saves multiple times in quick succession, but the cost of losing those intermediate rows seems worth it if we are guaranteed to have the most recent rows in ClickHouse instead of potentially having an intermediate row accidentally sent last and overwriting the state with out of data data.
This is true as well, and the separation of concerns is a difficult one to untangle. Since Aspects is not a required or even default component of the Open edX stack, it did not seem appropriate for other systems to have any knowledge of ClickHouse or the opinionated serialization needed for serializing its data there. Since Aspects controls the ClickHouse schema it has seemed to make more sense for it to take on the burden of making the serialization choices it needs for efficient storage. Points for Consideration
I think that for transaction scenarios our best bet is to make sure every sink that can have one is set up to send only trigger the sink on commit. Retries in general are definitely desirable. We have shied away from using Celery retries very much out of concern for scenarios like ClickHouse going down, Celery backing up on retries for many events, and other core services that share Celery with us failing or degrading due to that. This is an area where we could definitely improve, however!
I think I covered this above, but am happy to talk about it more. Questions
I'd definitely be happy to explore alternatives to the way things currently work!
I don't think that storing serialized data for every is a thing we would want to pursue based on the reasons above. I think that examining ways that we could handle retries better and making sure that all MySQL based models execute on commit would be good wins for us. Many of these challenges may go away in an event bus implementation (and new ones added, no doubt), but we still don't seem very close to that. I'm happy to continue the conversation here, or do a video call, or even carry this forward in a data working group meeting if you like. Thanks again! |
There is also a new tool that we can likely use to replace the sinks as they exist: https://github.com/Altinity/clickhouse-sink-connector We'll be evaluating that for the next release, it would simplify things a great deal (especially if they add the feature for appending tables, not just overwriting with the latest). |
Hi @bmtcril Thank you for your detailed response and for sharing the update about the Altinity ClickHouse Sink Connector. It’s exciting to see that there’s potential for simplifying the current process significantly. The idea of appending tables instead of overwriting them aligns well with ensuring data consistency, and I’m looking forward to seeing how this evaluation unfolds. If there’s anything I can contribute to these discussions or in testing and evaluating the new connector, I’d be happy to help. Please let me know how I can assist or join the efforts to push these improvements forward. |
@anfbermudezme Thank you for the offer! I don't know which areas of the platform you have experience with, but if you wanted to try to add the Altinity Sink Connector to the tutor-contrib-aspects project it would be helpful to know how that goes and whether there are any issues with the resulting data. @Ian2012 or I could help write a ticket for that work describing what we think needs to be done to test, but it's probably a fair amount of work requiring some knowledge of Tutor plugins, MySQL replication, and Docker/Compose. Does that sound like something you'd want to do? |
Current Behavior
The dump_data_to_clickhouse method in the Aspects plugin currently queries the database to retrieve the object before sending it to ClickHouse. This approach ensures the process is fully asynchronous and doesn’t block the original request.
While this design makes sense to prevent blocking, it introduces some potential challenges:
Points for Consideration
While the current implementation aims to avoid breaking the original request in case of serialization or connectivity issues, I wanted to explore whether there’s room for optimizations. Here are some ideas to consider:
Questions
The current implementation works, but it may be worth discussing whether adjustments like retries or a shift in responsibility could improve its efficiency and maintainability. Thank you for considering this topic and for the work behind this plugin.
The text was updated successfully, but these errors were encountered: