Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected Performance? #6

Open
jlongland opened this issue Mar 8, 2018 · 19 comments
Open

Expected Performance? #6

jlongland opened this issue Mar 8, 2018 · 19 comments

Comments

@jlongland
Copy link
Contributor

Not really an issue, just more of a question. What is the expected performance of the data loader and what size of dumps has it been tested against?

I'm testing using skip_historical_imports = true and only_load_final = true which amounts to a 1.1 GB dump for our institution. I've yet to successfully complete a full load, locally or on AWS. My local host is a very dated Macbook Pro, so I decided to try on AWS. Even with the loader running on d2.xlarge and RDS instances on db.r4.2xlarge, the loads are too slow to be useful.

Before I investigate too much further, I figured I'd touch base to find out more about how CDL has been tested.

@Mythra
Copy link
Contributor

Mythra commented Mar 8, 2018

Hey @jlongland ,

To be honest there hasn't been too much production testing for the "Canvas-Data-Loader" since it's meant to be an example, or an interim solution. That being the solution is kept relatively slow for debugging sake (e.g. we don't parallel process rows of files even though we very easily could) so it's easier to figure out where something went wrong.

Since canvas data dumps are posted once every 24 hours, a dump should be less than 24 hours to import no matter the size of the account. That being said the CDL does load quite a bit into memory (and is pretty CPU heavy), so as such this is assuming a decent CPU, and a decent amount of memory.

When doing our testing at Instructure Con for the reveal of this product on a T2.large for the loader, and a db.m3.xlarge for the database we were able to import over 500GB of data into 3 postgres schemas in under 24 hours.

That being said if you do find greater than 4 hours import times on a Machine with two CPU Cores, and at least 4GB of ram with those options, and a beefy Database we should probably take a look into it.

@jlongland
Copy link
Contributor Author

Thanks for the quick reply @securityinsanity Fully appreciate that it's an example/interim solution. Being written by someone who works on the product, I like how it deals with some of the nuances of Canvas Data. So I was hoping it might be a reasonable interim solution until we've sorted out our longer term data lake approach.

I think I'll shift my approach to S3 and Redshift Spectrum rather than putting more time into an interim solution. But I'll try to circle back at some point to help out with this project. I put together a Cloudformation stack that runs CDL as a scheduled container task against RDS. Perhaps it might be useful to someone, especially if there was some parallel processing - though I understand the concerns about troubleshooting in that situation.

@stuartf
Copy link
Contributor

stuartf commented Mar 8, 2018

I've noticed it's quite slow as well. As someone quite new to Canvas-Data I'm not clear on what exactly I miss by doing skip_historical_imports = true and only_load_final = true (ie if we want a complete picture of the data should I set those both to false). I've run skip_historical_imports = true and only_load_final = false for as long as 15 days without it completing.

It doesn't look like it's saturating any of cpu, memory or network on either the machine running rust or the postgres db.

If parallelizing the row file processing is easy but makes debugging hard, what about a runtime option for disabling the parallelization. That way if things go weird you just set parallel = false and do the debugging.

@jlongland
Copy link
Contributor Author

jlongland commented Mar 8, 2018

The Canvas Data docs make vague references to historical dumps, but it's not clearly stated anywhere that I've found. As you start to look at request data, you'll see that the incremental request data from the nightly dumps may have gaps. We've seen some rather large gaps - I'm not sure if that's just us or if it's more widespread. In addition to the nightly dumps, there's a monthly historical dump that contains all request files. You'll usually see it a couple of hours after your normal daily dump. It's common practice for Canvas Data users to drop the requests table and reload using the historical dump. This is what skip_historical_imports controls.

My understanding is that only_load_final only processes the most recent dump. With the exception of the requests table, this will give you current state because the nightly dumps are complete dumps of all facts and dimensions. The only incremental data is requests, but I believe the docs caution that other tables could be handled incrementally. @securityinsanity is in the best position to confirm though.

Echo'ing the same observation that I'm not seeing any resource saturation while running the loader on the host or the database. There's a small CPU spike as all of the files are downloaded, but then it tails off quickly. During my last test, I was seeing 7 inserts per second.

@Mythra
Copy link
Contributor

Mythra commented Mar 8, 2018

Hey,

So to clear some things up first. @jlongland is pretty much right on the money. Essentially request data is only exported a day at a time (e.g. each day we export one days worth of request data, and not all the data that's ever existed like for the rest of the tables). About once every two months (though that time is variable), we do a "historical refresh". A historical refresh is where we do export all request data that is present since the beginning of your instance or March 2014 (whichever comes later).

We've heard from a lot of customers that most customers don't actually want this from their day to day import since it can take much longer. And so we have an option to just skip downloading it entirely.

As for only_load_final @jlongland is indeed correct on the money. only_load_final = true is a way of not loading all. Right now the Canvas Data Loader with that set to false will attempt to process every report it gets one by one. So if you've had canvas data enabled forawhile and are just barely turning it on without this option you would have to wait for it to import over 20 days worth of dumps. Which isn't really what you want. You just want the latest.

Taking over 15 days to complete an import sounds very wrong on any decently powered hardware even for importing all 20+ days worth of dumps. What's the size of your DB? What's the size of the instance? What school are you with?

@stuartf
Copy link
Contributor

stuartf commented Mar 8, 2018

I'm with Georgia Tech. The machine running the loader is a 4 core 2.7 GHz with 8GB of ram, the docker instance there says it's using ~6% cpu and a few hundred MB of ram though. Postgres is running on an 8 core 2.6 GHz with 16G of ram (OS showing 5G used) w/ 5 postgres processes using about 15% of a cpu each. These machines are both connected to gigabit ethernet but iftop only shows the loader transmitting at about 200Kb and receiving 170Kb. It didn't look like it was going any faster when I connected it to a postgres docker image on the same host as the loader.

@stuartf
Copy link
Contributor

stuartf commented Mar 12, 2018

I tried running with only_load_final = true over the weekend.

INFO 2018-03-09T22:00:19Z: postgres: NOTICE: relation "account_dim" already exists, skipping
INFO 2018-03-09T22:00:20Z: postgres: NOTICE: relation "assignment_dim" already exists, skipping
INFO 2018-03-09T22:05:38Z: postgres: NOTICE: relation "assignment_fact" already exists, skipping
INFO 2018-03-09T22:09:03Z: postgres: NOTICE: relation "assignment_group_dim" already exists, skipping
INFO 2018-03-09T22:10:14Z: postgres: NOTICE: relation "assignment_group_fact" already exists, skipping
INFO 2018-03-09T22:11:25Z: postgres: NOTICE: relation "assignment_group_rule_dim" already exists, skipping
INFO 2018-03-09T22:12:36Z: postgres: NOTICE: relation "assignment_group_score_dim" already exists, skipping
INFO 2018-03-10T01:02:40Z: postgres: NOTICE: relation "assignment_group_score_fact" already exists, skipping
INFO 2018-03-10T01:37:15Z: postgres: NOTICE: relation "assignment_override_dim" already exists, skipping
INFO 2018-03-10T01:37:39Z: postgres: NOTICE: relation "assignment_override_fact" already exists, skipping
INFO 2018-03-10T01:38:05Z: postgres: NOTICE: relation "assignment_override_user_dim" already exists, skipping
INFO 2018-03-10T01:38:13Z: postgres: NOTICE: relation "assignment_override_user_fact" already exists, skipping
INFO 2018-03-10T01:44:10Z: postgres: NOTICE: relation "assignment_rule_dim" already exists, skipping
INFO 2018-03-10T01:48:14Z: postgres: NOTICE: relation "communication_channel_dim" already exists, skipping
INFO 2018-03-10T02:02:55Z: postgres: NOTICE: relation "communication_channel_fact" already exists, skipping
INFO 2018-03-10T02:15:30Z: postgres: NOTICE: relation "conversation_dim" already exists, skipping
INFO 2018-03-10T02:41:20Z: postgres: NOTICE: relation "conversation_message_dim" already exists, skipping
INFO 2018-03-10T03:32:49Z: postgres: NOTICE: relation "conversation_message_participant_fact" already exists, skipping
INFO 2018-03-10T05:08:49Z: postgres: NOTICE: relation "course_dim" already exists, skipping
INFO 2018-03-10T05:13:25Z: postgres: NOTICE: relation "course_score_dim" already exists, skipping
INFO 2018-03-10T05:53:57Z: postgres: NOTICE: relation "course_score_fact" already exists, skipping
INFO 2018-03-10T06:11:36Z: postgres: NOTICE: relation "course_section_dim" already exists, skipping
INFO 2018-03-10T06:16:42Z: postgres: NOTICE: relation "course_ui_canvas_navigation_dim" already exists, skipping
INFO 2018-03-10T06:16:42Z: postgres: NOTICE: relation "course_ui_navigation_item_dim" already exists, skipping
INFO 2018-03-10T09:42:22Z: postgres: NOTICE: relation "course_ui_navigation_item_fact" already exists, skipping
INFO 2018-03-10T13:26:19Z: postgres: NOTICE: relation "discussion_entry_dim" already exists, skipping
INFO 2018-03-10T13:28:38Z: postgres: NOTICE: relation "discussion_entry_fact" already exists, skipping
INFO 2018-03-10T13:30:45Z: postgres: NOTICE: relation "discussion_topic_dim" already exists, skipping
INFO 2018-03-10T13:33:15Z: postgres: NOTICE: relation "discussion_topic_fact" already exists, skipping
INFO 2018-03-10T13:35:29Z: postgres: NOTICE: relation "enrollment_dim" already exists, skipping
INFO 2018-03-10T16:31:47Z: postgres: NOTICE: relation "enrollment_fact" already exists, skipping
INFO 2018-03-10T19:24:09Z: postgres: NOTICE: relation "enrollment_term_dim" already exists, skipping
INFO 2018-03-10T19:24:09Z: postgres: NOTICE: relation "external_tool_activation_dim" already exists, skipping
INFO 2018-03-10T19:24:12Z: postgres: NOTICE: relation "external_tool_activation_fact" already exists, skipping
INFO 2018-03-10T19:24:16Z: postgres: NOTICE: relation "file_dim" already exists, skipping
INFO 2018-03-11T06:17:17Z: postgres: NOTICE: relation "file_fact" already exists, skipping
INFO 2018-03-11T12:45:39Z: postgres: NOTICE: relation "group_dim" already exists, skipping
INFO 2018-03-11T12:46:42Z: postgres: NOTICE: relation "group_fact" already exists, skipping
INFO 2018-03-11T12:47:39Z: postgres: NOTICE: relation "group_membership_dim" already exists, skipping
INFO 2018-03-11T12:51:41Z: postgres: NOTICE: relation "group_membership_fact" already exists, skipping
INFO 2018-03-11T13:22:56Z: postgres: NOTICE: relation "pseudonym_dim" already exists, skipping
INFO 2018-03-11T13:34:49Z: postgres: NOTICE: relation "pseudonym_fact" already exists, skipping
INFO 2018-03-11T16:10:34Z: postgres: NOTICE: relation "requests" already exists, skipping
INFO 2018-03-11T17:14:55Z: postgres: NOTICE: relation "role_dim" already exists, skipping
INFO 2018-03-11T17:14:55Z: postgres: NOTICE: relation "score_dim" already exists, skipping
INFO 2018-03-11T22:23:30Z: postgres: NOTICE: relation "score_fact" already exists, skipping
INFO 2018-03-12T04:45:31Z: postgres: NOTICE: relation "submission_comment_dim" already exists, skipping
INFO 2018-03-12T05:46:18Z: postgres: NOTICE: relation "submission_comment_fact" already exists, skipping
INFO 2018-03-12T07:34:44Z: postgres: NOTICE: relation "submission_comment_participant_fact" already exists, skipping

At that point I killed it because it's clear something is wrong.

@Mythra
Copy link
Contributor

Mythra commented Mar 12, 2018

Hey @stuartf ,

There's actually no error output in that, and those log messages can legitimately appear while running the program. Did you actually notice any "ERROR" output that lead you to stop it? or was there any other sort of problems?

@jlongland
Copy link
Contributor Author

jlongland commented Mar 12, 2018

Hey @securityinsanity,

Had a few minutes to run a test and gather some resource utilization metrics. I'm running CDL as a container task on ECS using a d2.xlarge EC2 instance with no other services or tasks running. For the database, I'm using Aurora MySQL on a db.r4.2xlarge. I've sized both quite large because I was concerned I'd initially under-resourced both when I noticed the slow performance.

Tested with an empty schema with skip_historical_imports = true and only_load_final = true

RDS
rds monitoring 1

rds monitoring 2

rds monitoring 3

ECS
ecs metrics

EC2 Instance
ec2 instance

I haven't had a chance to investigate further, but the insert/select throughput tails out - which seems a bit odd. Obviously parallelizing the inserts would be a big improvement, but it seems like something else is a amiss here.

As time allows, I'm happy to help troubleshoot this further - CDL is one of the better loaders I've seen for Canvas Data. But for the time being, I'll probably shift approach since I know our data set will eventually be too large for CDL in its current state.

EDIT - After ~2 hours, it's created account_dim. assignment_dim is still in-progress with 89k rows.

@stuartf
Copy link
Contributor

stuartf commented Mar 12, 2018

There were no ERROR messages on that run, but look at the timestamps, it had been running for 3 days. We only have 1 semester of real data at this point.

@stuartf
Copy link
Contributor

stuartf commented Mar 14, 2018

2018-03-14-161505_2289x376_scrot
Here's a few graphs showing what my docker container running the loader's utilization looks like. I've bumped the db connection pool up to using 50 connections from the default 10 suspecting that might be a bottleneck, but no improvement. This run is still going, here's what's logged if you want to match time stamps to the graph (note 4 hour offset due to local time vs UTC).

 INFO 2018-03-14T18:40:01Z: cdl_runner: Setting up API Client...
 INFO 2018-03-14T18:40:01Z: cdl_runner: Connecting to RocksDB Store....
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: f7831135-c126-4b5e-aa92-e882b46bccc8 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: b7bda994-96a7-4893-8916-8e47173b5949 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 749f4337-84f3-428a-a82b-50dea982ab83 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 753a8e8c-6912-4579-9e50-85b5e6bf0c80 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 14e9c4ad-21fe-4ad1-8ee1-c4fc40af8952 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 2d96e8e1-65c6-4240-a14f-baaf8288ac2e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: db9956e9-e89d-4756-891f-7d3fdc5b6662 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: e0dab072-ac90-47b9-8483-afb2362db9e3 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 0f70216f-8da3-46ce-a751-9a7a5640d866 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 53eec003-4cea-4f98-a64e-84edea7e764d due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 32da3c20-4ea3-4164-a012-e0b140aa5b3d due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 8cd3df66-62cc-4a28-8063-fc65788d6fc2 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 3eab7c8c-8744-46b7-bedc-fe19fd529fee due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 1831a15c-f117-49b4-8c46-2eb3b5897b1e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 0cfacbb4-2af0-4aa3-bd96-0ed4cd9b7ead due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 23de5eda-53c9-4d1c-96a6-3f19cd5cebd6 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: d2221926-17df-455b-9e41-732511779fb4 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 02fa634a-9e29-40b7-bf8d-043c433225d3 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 80e76380-e518-40ee-ad2d-3d53d6eb70a4 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: b15c9bd2-67aa-4d58-9471-a8a6299fd679 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: cfe32a5a-dad4-4443-a96f-a52c0cbeb41c due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: dde29111-e7cc-4ad4-bcda-cba5eeb28dc5 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 5780478d-8839-4772-8e0d-ba9a7045106d due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 2711cbea-6c30-4c2a-9acc-0cdb9f1ef6fb due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 5232e23b-14b1-4d48-a856-d01cd7f9cdf3 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 27b3cefd-c38f-4985-8fd2-ebbaed13671d due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 1e30ab04-0fa0-4def-85f1-c18976a12e45 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: dab228e8-98d6-43bb-9fa5-f64e9732bea4 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: f1510d88-de08-4a26-b5f3-101790e3bc47 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 11f04540-19e5-4c71-b6fe-3128b07ad064 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 7c601e83-8b26-4634-9bfe-7672ebf09296 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: b6f87c3c-8f6f-4782-be38-990ab175daa3 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 02ffe741-ab43-4de6-ac24-fe53c9f33ad2 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 2b6d2510-ee88-4f2e-95bd-64f8ccd2681e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 1fd5ccb2-79cc-4afa-9a25-8647d7d29c1e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 3ef44b56-1f23-4485-b70d-273485111753 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 43365499-5527-4a63-809b-c76f59f41234 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 551046f8-d582-4044-8f5a-79c3b6dbcd43 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 7440f361-5fdf-4df6-91b2-f81d1048628e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: e6aa0877-eb95-4349-9b92-f685e062b956 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: d8b49479-7b9d-4ef3-baab-60fd64e5f6fa due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 2555d4f7-e918-48a1-9d4e-5b0ee30a67d3 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 682d2792-e161-493a-b476-cb70a100c0ec due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: d072d141-a45a-4b4c-822a-97f9a266677e due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 4544d851-c5d4-40af-8aec-d40760f7cbcb due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 0ea0f52e-4a9b-4354-8d43-7f2b339dcf08 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: fa1c4489-47ea-4b58-aaa5-c91e276f5c67 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 1de99ff4-0fcc-4739-884e-abbe6067338d due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Skipping dump: 2900347e-7ad1-4fcf-9e15-e31b5dd119b0 due to only final selected
 INFO 2018-03-14T18:40:02Z: cdl_runner: Connecting to the DB
 INFO 2018-03-14T18:40:04Z: postgres: NOTICE: relation "account_dim" already exists, skipping
 INFO 2018-03-14T18:40:05Z: postgres: NOTICE: relation "assignment_dim" already exists, skipping
 INFO 2018-03-14T18:45:59Z: postgres: NOTICE: relation "assignment_fact" already exists, skipping
 INFO 2018-03-14T18:49:42Z: postgres: NOTICE: relation "assignment_group_dim" already exists, skipping
 INFO 2018-03-14T18:50:52Z: postgres: NOTICE: relation "assignment_group_fact" already exists, skipping
 INFO 2018-03-14T18:51:52Z: postgres: NOTICE: relation "assignment_group_rule_dim" already exists, skipping
 INFO 2018-03-14T18:52:56Z: postgres: NOTICE: relation "assignment_group_score_dim" already exists, skipping

@jlongland
Copy link
Contributor Author

@stuartf I tried the same thing. I bumped the pool to 100 and there was no difference. I could see the connections on the database, but it didn't improve the throughput.

I know CDL can perform better as I did a few full loads in January and that's consistent with @securityinsanity's testing. I can only assume there's a change somewhere in the dependency chain that's causing this problem - but I'm not sure I have the time to go down that rabbit hole.

@nmadhok
Copy link

nmadhok commented Aug 2, 2018

@jlongland @securityinsanity @stuartf Any further update on this? We ran the data loader to load data into MySQL and after running it for 3-4 days, we noticed it only created account_dim and assignment_dim tables in the database. We finally had to kill it as it seemed to either have been stuck or in an endless loop:

ERROR 2018-08-02T00:38:51Z: cdl_runner::db_client: insert error
ERROR 2018-08-02T00:38:51Z: cdl_runner::db_client: MySqlError { ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x8C\xA0\xF0\x9F...' for column 'description' at row 1 }
ERROR 2018-08-02T00:38:51Z: cdl_runner::importer: process -> for line in finalized_string -> !is_volatile -> ins_res -> is_err
ERROR 2018-08-02T00:38:51Z: cdl_runner::importer: Error(MysqlErr, State { next_error: None, backtrace: None })
 INFO 2018-08-02T00:38:52Z: cdl_runner: Connecting to the DB
ERROR 2018-08-02T04:19:29Z: cdl_runner::db_client: insert error
ERROR 2018-08-02T04:19:29Z: cdl_runner::db_client: MySqlError { ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x91\x8C</...' for column 'description' at row 1 }
ERROR 2018-08-02T04:19:29Z: cdl_runner::importer: process -> for line in finalized_string -> !is_volatile -> ins_res -> is_err
ERROR 2018-08-02T04:19:29Z: cdl_runner::importer: Error(MysqlErr, State { next_error: None, backtrace: None })
 INFO 2018-08-02T04:19:30Z: cdl_runner: Connecting to the DB
ERROR 2018-08-02T05:58:51Z: cdl_runner::db_client: insert error
ERROR 2018-08-02T05:58:51Z: cdl_runner::db_client: MySqlError { ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x91\x80</...' for column 'description' at row 1 }
ERROR 2018-08-02T05:58:51Z: cdl_runner::importer: process -> for line in finalized_string -> !is_volatile -> ins_res -> is_err
ERROR 2018-08-02T05:58:51Z: cdl_runner::importer: Error(MysqlErr, State { next_error: None, backtrace: None })
 INFO 2018-08-02T05:58:52Z: cdl_runner: Connecting to the DB
ERROR 2018-08-02T09:10:38Z: cdl_runner::db_client: insert error
ERROR 2018-08-02T09:10:38Z: cdl_runner::db_client: MySqlError { ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x89' for column 'title' at row 1 }
ERROR 2018-08-02T09:10:38Z: cdl_runner::importer: process -> for line in finalized_string -> !is_volatile -> ins_res -> is_err
ERROR 2018-08-02T09:10:38Z: cdl_runner::importer: Error(MysqlErr, State { next_error: None, backtrace: None })
 INFO 2018-08-02T09:10:39Z: cdl_runner: Connecting to the DB

We had skip_historical_imports = true and only_load_final = true. This almost makes this tool useless and I'm wondering if someone actually got it to work.

@stuartf
Copy link
Contributor

stuartf commented Aug 2, 2018

We gave up trying to fix this as the tool isn't supported upstream and we don't have any other rust projects we're working on a custom solution that we'll be able to maintain. I can't say for sure right now whether we'll publish it publicly.

@UniversalSuperBox
Copy link

I see that this thread has petered out a bit, but I wanted to add my experience to the ring.

I'm running CDL on a VM on a rather middling PC. i5-6500T (4 cores to the VM), 8GB of RAM (3GB assigned to the VM), a SATA SSD...

I've noticed a few things:

  • CDL is not using much CPU at all. It's actually sitting at about 7% of one CPU. The VM in total is taking about 15% of its allotted CPU.
  • System load is at 1.00
  • The postgres workers are going to town on something.

The postgres workers being busy is not unprecedented given what CDL is doing, but I would expect it to be loading data much faster than this. A bodge solution I hacked together using information from the Canvas community (basically just dropping and re-adding all the tables via CSV imports) completed within minutes, while I've been waiting on this for over an hour and it's only on course_ui_navigation_item_fact.

I investigated a little further, checking out the queries that each of the runners were working on. I found some interesting data: https://paste.ubuntu.com/p/rkN8Bf5NW7/ (account identifier at the start of all data replaced with 555555). It seems like every single INSERT is accompanied by a matched DELETE. I suspect these DELETEs are what's taking most of the DB's time, needing to search through a table that's only getting longer as we continue to insert more data.

Looks like the DELETE is a required part of the operation for non-volatile tables in the code.

@Mythra
Copy link
Contributor

Mythra commented Oct 4, 2018

Yes. The DELETEs do take awhile. This was specifically requested feature of the CDL when we were rolling it out. Specifically, users wanted their data to always be queryable regardless if it was a little out of date. As such if we're dealing with a table that doesn't give us a set of consistent ids, we'll delete them one by one.

The biggest bottlenecks of the loader are really:

  • The fact we have to iterate through every record in order to determine if we need to delete a specific value.
  • The fact we don't batch together DELETEs/INSERTs at all.
  • The fact we're processing each table one at a time, without threading out.

All of these facts do slow it down considerably, and we've in the past avoided fixing these bottlenecks because we don't want people to rely on this full time working with Canvas Data. It's meant to be an example application showing users how to interact with the API, and how they could do things like "avoid historical refreshes".

We could one day down the road, fix these features, and turn the CDL into a fully supported solution, but that would have to be a product decision.

@Mythra
Copy link
Contributor

Mythra commented Oct 15, 2018

I was able to get some time to work on this during hackweek, and was able to perform some updates. Specifically:

  • Parallel processing of tables (tables will process in multiple at a time).
  • Adding a flag for: all_tables_volatile which skips the "DELETE" call on each insert, and just performs an INSERT.

This doesn't fix all the potential footguns (reading file entirely into memory and decompressing, unnecessary clones, etc). However, this should turn the example into a slightly more useful example.

@latifatbraimah
Copy link

Hello @securityinsanity, @UniversalSuperB made a remark on finding out that some id columns gets padded with 555555. I believe this is why I get the following error when I tried to run this command:
[root@localhost home]# RUST_LOG=info ./target/release/cdl-runner

Error: 
 INFO 2018-11-13T16:42:52Z: cdl_runner: Setting up API Client...
 INFO 2018-11-13T16:42:52Z: cdl_runner: Connecting to RocksDB Store....
 INFO 2018-11-13T16:42:52Z: cdl_runner: Skipping dump: f2861f59-f11c-4c97-8d09-c349e7f4f3f6 
 INFO 2018-11-13T16:42:52Z: cdl_runner: Skipping dump: 259dfc10-69b7-4f9f-bd89-013082e6c1ef due to only final selected
 INFO 2018-11-13T16:42:52Z: cdl_runner: Skipping dump: a8d4271f-2013-4ff0-8813-b8aa79e85e0b due to only final selected
 INFO 2018-11-13T16:42:53Z: cdl_runner: Connecting to the DB
ERROR 2018-11-13T16:50:35Z: cdl_runner::db_client: insert error
ERROR 2018-11-13T16:50:35Z: cdl_runner::db_client: MySqlError { ERROR 1264 (22003): Out of range value for column 'position' at row 1 }
ERROR 2018-11-13T16:50:35Z: cdl_runner::importer: process -> for line in finalized_string -> !is_volatile -> ins_res -> is_err
ERROR 2018-11-13T16:50:35Z: cdl_runner::importer: Error(MysqlErr, State { next_error: None, backtrace: None })
 INFO 2018-11-13T16:50:35Z: cdl_runner: Done!

My question is: Is there a fix for this?

@UniversalSuperBox
Copy link

Enabling all_tables_volatile is a vast performance gain. Without it turned on, I just had a run take 75 hours to complete. Turning on volatility made my next run take 1.5 hours.

For context, my /tmp/cdl-save is 1.8G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants