-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/listing #2
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Thanks a lot, that sounds brilliant. I'll try having a look this week-end. |
@daneroo I'm going to have a look, but it's probably going to take me several days to go through it all. Do you prefer that I send comments in small matches as I go along, to make it like a conversation with a back and forth, or do you prefer that I try to review it all once and send the whole first round of comments once I'm fully done? And btw, I propose that we keep this PR as a draft and a discussion point, but we go through each interesting feature (for example starting with changes to navLeft), and for each of them you make a new PR as we go along. Works for you? |
I agree with making smaller changes, and using this PR and its comments as a draft. With respect to the loop in navLeft, the loop became necessary when I was skipping the actual downloading (-list option), where the navLeft was essentially in a tight loop, and the navigation did not always occur as expected. I can verify this again, and/or see how the control could be moved up to navN if it's required (to keep navLeft simpler). Thanks for bearing with me as I get used to this workflow. Let me see how I can break this up into smaller pieces, and I can comment back here. |
Cool.
Ok. I think for now we don't care about -list. So please do check if the changes to navLeft are needed for you without -list. And if yes, let's make that our first PR (with whatever other related changes in navN). If it turns out the changes to navLeft are not needed, let me know, and we'll get to the next interesting piece.
Don't worry, this is a very nice first contribution. It may take some time, but we'll get all the interesting bits in. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to navLeft are not needed by themselves.
The reason I was concerned with traversal speed and timing in general, was to support these ideas:
- General speed increase (for >10k-100k items)
- Full traversal, but only download missing items (optimistic) - not assuming new files are at the top of the main page. i.e.
.lastdone
may not be enough to capture the state. - Full traversal, download and verify the integrity of previous items (pessimistic).
- I also attempted to have several worker tabs/windows to parallelize download - not promising for now.. 8-(
I realize most of these are probably out of scope for what you wanted, so I have started experimenting with different flows in another repo (using puppeteer/node.js), just to test the ideas.
I think perhaps the only thing of use would be the -headless
flag.
Do you have anything else in mind?
I think we're interested in that one. Say a friend suddenly shares photos with you. Now they're in your library, but they're not necessarily at the top of the feed. So we need a reliable way to do a rerun and get only these photos.
I don't know yet, I only had a focused look at navLeft so far. I'll have another go at the whole thing and let you know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the very slow feedback. I should be able to do better from now on.
Summary:
We can then discuss what other feature is most useful if any:
|
Notes for PR(daneroo) feature/listing branch
I don't expect this to be merged as-is. I mostly wanted to share my experiments and get feedback.
If any of it is useful, I can craft proper commits with smaller features.
I am a first-time contributor, and not very experienced with Go, so would appreciate any feedback both on content and process.
Overview
When I first started to use the project, I experienced instability and would need to restart the download process a large number of times to get to the end. The process would timeout after 1000-4000 images. To test, I use two accounts one with ~700 images, and one with 31k images.
The timeouts were mostly coming from the
navLeft()
iterator, so I disabled actual downloading and focused on traversal. (-list
option). I also added an option (-all
) to bypass the incremental behavior of$dldir/.lastDone
. I then added-vt (VerboseTiming
, to isolate timing metrics reporting specifically.The last main issue was the degradation of performance over time. Probably due to resource leaks in the webapp, especially when iterating at high speed. I found that the simplest solution to this was simply to force a reload every 1000 iterations. This typically takes about
1s
and avoids the10X
slowdown I was observing.The following are details of each of the major changes.
navLeft()
By adding the
-list
option, its functionality is isolated.Remove
WaitReady("body",..)
, as it has no effect (The document stays loaded when we navigate from image to image), and replace it with a loop (untillocation != prevLocation
).When only listing, this is the critical part of the loop, so I spent some time experimenting with the timing of the chromedp interactions to optimize total throughput. What seemed to work best was an exponential backoff, so that we can benefit from the most likely fast navigation, but don't overtax the browser when we experience the occasional longer navigation delay (which can go up to multiple seconds).
We also pass in
prevLoaction
fromnavN()
, to support the newnavLeft()
loop, to optimize the throughput.navN()
-list
)batchSize=1000
) items.Termination Criteria
We explicitly fetch the lastPhoto, on the main album page (immediately after authentication), and rely on that, instead of
navLeft()
failing to navigate (location == prevLocation
). The lastPhoto is determined with a CSS selector evaluation (a[href^="./photo/"]
), which is fast and should be stable over time, as it uses the<a href=.. >
detail navigation mechanism of the main album page. This opens another possibility for iteration on the album page itself. (seelistFromAlbum
below)Another issue, although the
.lasDone
being captured on a successful run is useful, as photos are listed in EXIF date order. If older photos are added to the album they would not appear in subsequent runs. So it would be useful in general to rerun over the entire album (-all
).Headless
If we don't need the authentication flow, (and we persist the auth creds in the profile
-dev
), Chrome can be brought up in-headless
mode, which considerably reduces memory footprint (>2Gb
to<500Mb
), and marginally (23%) increases throughput for the listing experiment.listFromAlbum()
This is another experiment where the listing was entirely performed in the main album page (by scrolling incrementally). This is even faster. It would also allow iterating either forward or backward through the album.
In a further experiment, I would like to use this process as a coordinating mechanism and perform the actual downloads in separate (potentially multiple)
tabs/contexts
.Performance: An Argument for a periodic page reload and headless mode
Notice how the latency grows quickly without page reload (ms per iteration): [66, 74, 89, 219, 350, 1008,...], and the cumulative rate drops below
4 items/s
after6000
items.Whereas with page reloading every 1000 images, it is stable at
<60ms
with a sustained rate of17 items/s
:When invoked in
-headless
mode we can reduce latency to~47ms
or21 items/s
, also this reduces memory footprint from>2Gb
to about500Mb
When using
listFromAlbum()
, we can roughly double the iterations speed again to38 items/s
or25ms/item
All of these times were obtained on a MacBook Air (2015) / 8G RAM.
Future
navLeft()
- throw error if loc==prevLoc-verify
,-incr/-optimimistic