-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expected Performance? #6
Comments
Hey @jlongland , To be honest there hasn't been too much production testing for the "Canvas-Data-Loader" since it's meant to be an example, or an interim solution. That being the solution is kept relatively slow for debugging sake (e.g. we don't parallel process rows of files even though we very easily could) so it's easier to figure out where something went wrong. Since canvas data dumps are posted once every 24 hours, a dump should be less than 24 hours to import no matter the size of the account. That being said the CDL does load quite a bit into memory (and is pretty CPU heavy), so as such this is assuming a decent CPU, and a decent amount of memory. When doing our testing at Instructure Con for the reveal of this product on a T2.large for the loader, and a db.m3.xlarge for the database we were able to import over 500GB of data into 3 postgres schemas in under 24 hours. That being said if you do find greater than 4 hours import times on a Machine with two CPU Cores, and at least 4GB of ram with those options, and a beefy Database we should probably take a look into it. |
Thanks for the quick reply @securityinsanity Fully appreciate that it's an example/interim solution. Being written by someone who works on the product, I like how it deals with some of the nuances of Canvas Data. So I was hoping it might be a reasonable interim solution until we've sorted out our longer term data lake approach. I think I'll shift my approach to S3 and Redshift Spectrum rather than putting more time into an interim solution. But I'll try to circle back at some point to help out with this project. I put together a Cloudformation stack that runs CDL as a scheduled container task against RDS. Perhaps it might be useful to someone, especially if there was some parallel processing - though I understand the concerns about troubleshooting in that situation. |
I've noticed it's quite slow as well. As someone quite new to Canvas-Data I'm not clear on what exactly I miss by doing It doesn't look like it's saturating any of cpu, memory or network on either the machine running rust or the postgres db. If parallelizing the row file processing is easy but makes debugging hard, what about a runtime option for disabling the parallelization. That way if things go weird you just set |
The Canvas Data docs make vague references to historical dumps, but it's not clearly stated anywhere that I've found. As you start to look at request data, you'll see that the incremental request data from the nightly dumps may have gaps. We've seen some rather large gaps - I'm not sure if that's just us or if it's more widespread. In addition to the nightly dumps, there's a monthly historical dump that contains all request files. You'll usually see it a couple of hours after your normal daily dump. It's common practice for Canvas Data users to drop the requests table and reload using the historical dump. This is what My understanding is that Echo'ing the same observation that I'm not seeing any resource saturation while running the loader on the host or the database. There's a small CPU spike as all of the files are downloaded, but then it tails off quickly. During my last test, I was seeing 7 inserts per second. |
Hey, So to clear some things up first. @jlongland is pretty much right on the money. Essentially request data is only exported a day at a time (e.g. each day we export one days worth of request data, and not all the data that's ever existed like for the rest of the tables). About once every two months (though that time is variable), we do a "historical refresh". A historical refresh is where we do export all request data that is present since the beginning of your instance or March 2014 (whichever comes later). We've heard from a lot of customers that most customers don't actually want this from their day to day import since it can take much longer. And so we have an option to just skip downloading it entirely. As for only_load_final @jlongland is indeed correct on the money. Taking over 15 days to complete an import sounds very wrong on any decently powered hardware even for importing all 20+ days worth of dumps. What's the size of your DB? What's the size of the instance? What school are you with? |
I'm with Georgia Tech. The machine running the loader is a 4 core 2.7 GHz with 8GB of ram, the docker instance there says it's using ~6% cpu and a few hundred MB of ram though. Postgres is running on an 8 core 2.6 GHz with 16G of ram (OS showing 5G used) w/ 5 postgres processes using about 15% of a cpu each. These machines are both connected to gigabit ethernet but |
I tried running with
At that point I killed it because it's clear something is wrong. |
Hey @stuartf , There's actually no error output in that, and those log messages can legitimately appear while running the program. Did you actually notice any "ERROR" output that lead you to stop it? or was there any other sort of problems? |
Hey @securityinsanity, Had a few minutes to run a test and gather some resource utilization metrics. I'm running CDL as a container task on ECS using a d2.xlarge EC2 instance with no other services or tasks running. For the database, I'm using Aurora MySQL on a db.r4.2xlarge. I've sized both quite large because I was concerned I'd initially under-resourced both when I noticed the slow performance. Tested with an empty schema with I haven't had a chance to investigate further, but the insert/select throughput tails out - which seems a bit odd. Obviously parallelizing the inserts would be a big improvement, but it seems like something else is a amiss here. As time allows, I'm happy to help troubleshoot this further - CDL is one of the better loaders I've seen for Canvas Data. But for the time being, I'll probably shift approach since I know our data set will eventually be too large for CDL in its current state. EDIT - After ~2 hours, it's created account_dim. assignment_dim is still in-progress with 89k rows. |
There were no |
@stuartf I tried the same thing. I bumped the pool to 100 and there was no difference. I could see the connections on the database, but it didn't improve the throughput. I know CDL can perform better as I did a few full loads in January and that's consistent with @securityinsanity's testing. I can only assume there's a change somewhere in the dependency chain that's causing this problem - but I'm not sure I have the time to go down that rabbit hole. |
@jlongland @securityinsanity @stuartf Any further update on this? We ran the data loader to load data into MySQL and after running it for 3-4 days, we noticed it only created
We had |
We gave up trying to fix this as the tool isn't supported upstream and we don't have any other rust projects we're working on a custom solution that we'll be able to maintain. I can't say for sure right now whether we'll publish it publicly. |
I see that this thread has petered out a bit, but I wanted to add my experience to the ring. I'm running CDL on a VM on a rather middling PC. i5-6500T (4 cores to the VM), 8GB of RAM (3GB assigned to the VM), a SATA SSD... I've noticed a few things:
The postgres workers being busy is not unprecedented given what CDL is doing, but I would expect it to be loading data much faster than this. A bodge solution I hacked together using information from the Canvas community (basically just dropping and re-adding all the tables via CSV imports) completed within minutes, while I've been waiting on this for over an hour and it's only on course_ui_navigation_item_fact. I investigated a little further, checking out the queries that each of the runners were working on. I found some interesting data: https://paste.ubuntu.com/p/rkN8Bf5NW7/ (account identifier at the start of all data replaced with 555555). It seems like every single INSERT is accompanied by a matched DELETE. I suspect these DELETEs are what's taking most of the DB's time, needing to search through a table that's only getting longer as we continue to insert more data. Looks like the DELETE is a required part of the operation for non-volatile tables in the code. |
Yes. The DELETEs do take awhile. This was specifically requested feature of the CDL when we were rolling it out. Specifically, users wanted their data to always be queryable regardless if it was a little out of date. As such if we're dealing with a table that doesn't give us a set of consistent ids, we'll delete them one by one. The biggest bottlenecks of the loader are really:
All of these facts do slow it down considerably, and we've in the past avoided fixing these bottlenecks because we don't want people to rely on this full time working with Canvas Data. It's meant to be an example application showing users how to interact with the API, and how they could do things like "avoid historical refreshes". We could one day down the road, fix these features, and turn the CDL into a fully supported solution, but that would have to be a product decision. |
I was able to get some time to work on this during hackweek, and was able to perform some updates. Specifically:
This doesn't fix all the potential footguns (reading file entirely into memory and decompressing, unnecessary clones, etc). However, this should turn the example into a slightly more useful example. |
Hello @securityinsanity, @UniversalSuperB made a remark on finding out that some id columns gets padded with 555555. I believe this is why I get the following error when I tried to run this command:
My question is: Is there a fix for this? |
Enabling For context, my |
Not really an issue, just more of a question. What is the expected performance of the data loader and what size of dumps has it been tested against?
I'm testing using
skip_historical_imports = true
andonly_load_final = true
which amounts to a 1.1 GB dump for our institution. I've yet to successfully complete a full load, locally or on AWS. My local host is a very dated Macbook Pro, so I decided to try on AWS. Even with the loader running on d2.xlarge and RDS instances on db.r4.2xlarge, the loads are too slow to be useful.Before I investigate too much further, I figured I'd touch base to find out more about how CDL has been tested.
The text was updated successfully, but these errors were encountered: