Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 1.15 over 1.17 to 2.0 not possible (database corrupt) #268

Open
alexx-km opened this issue Jan 6, 2025 · 5 comments
Open

Comments

@alexx-km
Copy link

alexx-km commented Jan 6, 2025

Hello,

we're currently in the process of updating ClearML from 1.15 to 2.0 and tried to use the suggested docker-compose file from here to update to the 1.17 intermediate version. After fixing some issues with permissions and the token authentication of the fileserver, I finally got the container to run without bigger issues and I was able to access our project. Sadly, it seems, that our database was not correctly migrated to the newer version. I see all experiments in the overview:
image

and also some of the final metrics and general experiment information (runtime etc.) is still available, but the console output and metric graphs are missing:

image
image

Do you have an idea, what the issue could be? I attached the log-file of the upgrade process to this issue: clearml_update_log.txt

Thank you in advance!

@evg-allegro
Copy link
Contributor

Hi @alexx-km , task events are coming from Elatsticsearch. Can you please check that volumes mapping for elasticsearch service are the same in your current and previous versions of the docker-compose?
Provided that they are the same. Please run the following command from the host:
curl -XGET "http://localhost:9200/_cat/indices/events*"
Are there any indices with red status? Can you share here the output?

@alexx-km
Copy link
Author

alexx-km commented Jan 7, 2025

Thank you so much for your quick reply! I think that could be the issue. As the workstation user which starts the docker images does not have the id 1000, I had to change the path mapping from
/opt/clearml/data/elastic_7:/usr/share/elasticsearch/data to
/opt/clearml/data/elastic_7:/var/lib/elasticsearch/data
. My workaround for the 1.15 version was to specify the user 1001:1001 in the docker-compose file for elasticsearch, but this does not work anymore and throws the error:
clearml-elastic | ERROR: unable to create temporary keystore at [/usr/share/elasticsearch/config/elasticsearch.keystore.tmp], write permissions required for [/usr/share/elasticsearch/config] or run [elasticsearch-keystore upgrade], with exit code 78
Do you have an idea if it is still possible to migrate to the newer version with the user 1001?

Also, when I try to execute the command you suggested, I get the error
curl: (7) Failed to connect to localhost port 9200: Connection refused

@alexx-km
Copy link
Author

alexx-km commented Jan 7, 2025

I think I fixed it by

  1. setting the correct permissions to user 1000 with
    • sudo chown -R 1000:1000 /opt/clearml/data/elastic_7_config
    • sudo chown -R 1000:1000 /opt/clearml/data/elastic_7_logs
    • sudo chown -R 1000:1000 /opt/clearml/data/elastic_7
  2. running the upgrade script from 1.15 to 1.17
  3. running the upgrade script from 1.17 to 2.0
    • also had to disable token authentication with CLEARML__fileserver__auth__enabled: "false"

. Is there a way to verify that the migration was successful for all data in the database?

@oren-allegro
Copy link

@alexx-km - you can check for errors in the container logs - but in general - if you are able to view your tasks (experiments) this means mongo upgraded successfully, and if you are able to see events (console logs, scalars, plots) it means elasticsearch also upgraded correctly.
The only other thing to check if the fileserver, if you are using it. You can verify that debug samples and model snapshots are visible and can be downloaded.

@alexx-km
Copy link
Author

alexx-km commented Jan 7, 2025

Alright, it seems to work. Thank you very much for your help 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants