-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement? #86
Comments
There are two problems, the app can't know the size of the site before it
starts, and so can't do any type of progress bar, and the app is single
threaded, so every extra page adds time.
The way to improve it would be to rewrite it to be multithreaded and
separate the page parsing from the spider so the spider can go as fast as
it can and just throw all the pages into a parser which can then slowly
chomp through them.
I have considered rewriting it a few times but never had time.
…On Wed, 13 Oct 2021, 02:38 Shooter3k, ***@***.***> wrote:
Is there any way to improve the performance? When using -d 2 or higher, it
seems to balloon in crawl times, that take days to run and always end with
killing the task.
I'm not 100% sure what it's doing, but perhaps a way for it to make
incremental check-ins to the output file or a current progress (of what it
thinks so far) might solve the issue? or perhaps the index just gets too
big for it to process efficiently?
In any case, I'm looking forward to what people suggest
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#86>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWOL6FVZAAETR7RX6F3UGTPIRANCNFSM5F4ALTKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I could include something that said how much has been done so far, but
there is no way to guess how much there is left to do and as the parsing is
done as the spider returns each page, each page hit is also a parse hit so
I couldn't say "hit X pages, parsed Y". The only way to do that would be to
split the parser out of the spider and queue parsing as separate jobs. That
would allow the spider to go at full speed but would require either writing
each file to storage or a lot of memory as each page would need to be
cached somewhere till it is parsed
For the second idea, saving the pages out is possible, but that would still
require rewriting of the main app to split the parser out of the spider so
it could then be ran on its own over the files.
The only other idea to get a progress bar that means anything would be to
run the spider at full speed on its own first to get an idea of pages, then
run the two combined. The problem with that is some users would fire it
off, the spider would get stuck in a loop somewhere, and they would never
get any data back as the actual parsing would never happen.
…On Wed, 13 Oct 2021 at 14:12, Shooter3k ***@***.***> wrote:
I had a couple of thoughts about this that I'd like to throw out there.....
1.
I use an application called "Screaming Frog SEO Spider" and it has a
progress bar that basically shows how much it's crawled out of how much the
spider has found so far. So, even in a single thread, the progress bar is
constantly bouncing around a bit but it's basically showing how much it's
crawled out of how much the spider has found. This is especially nice
because you can judge (roughly) where things are going. In other words, if
the spider finds 10,000 pages in 2 seconds and then 30,000 in 6 seconds,
then you know it's going to take a really long time. Where as if the spider
finds 6 pages in 2 seconds and then 12 pages in 6 seconds, you know it's
probably not going to take very long. Hopefully that makes sense but here
is a little screenshot of what their progress bar looks like after just a
few seconds of starting a crawl.....so you can sort of guess it's probably
going to take 'a long time'.
[image: image]
<https://user-images.githubusercontent.com/10244114/137137575-7ce9b7c6-d407-4bf4-b9e3-82889d3fc93e.png>
2.
the second thought would be to add an optional parameter and have the
spider dump the spider results to a file instead of indexing/crawling them.
That would then give the user the option to crawl them individually (likely
running cewl multiple times) on their own. If you really wanted to go the
extra mile, you could also add an option for cewl to be able to crawl the
results from the created file at a later time
Overall, (IMO) any option that provides some sort of arbitrary progress or
'I'm still running and this is how much I've done so far and this is how
much I think I still need to do' would be helpful. Right now, using -v or
--debug is the only way to validate it's still crawling and not hung up
somewhere.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#86 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWN5G4SUZEGKJFAY6UDUGWAVFANCNFSM5F4ALTKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Well, any options you're willing to do would be greatly appreciated. I love the app and use it a lot |
Hello @digininja , But the Cewl tool is really slow. In fact, it is just sitting there without making any requests, then after an hour or so it continues for a bit and then goes idle repeatedly. I had to hibernate my PC twice instead of shutting it down to keep the tool working. I am using the latest version CeWL 6.1 (Max Length) On Windows 11. I had to use a proxy to monitor the work as there is no indication of work at all in the tool itself (It would be Cewl to show any kind of progress). Command:Task Manager:Proxy:Thanks for the Awesome Auth lab challenges. |
I've never used it on Windows do don't know the base performance levels but
it shouldn't be that slow.
I'll see if I can give it a run against the site later and see what speed I
get.
…On Sat, 14 Oct 2023, 05:10 w4po, ***@***.***> wrote:
Hello @digininja <https://github.com/digininja> ,
I am trying to scrape the Ironman website to solve the last challenge of Cracking
JWT keys (Obscure).
<https://authlab.digi.ninja/JWT_Cracking>
But the Cewl tool is really slow. In fact, it is just sitting there *without
making any requests*, then after an hour or so it continues for a bit and *then
goes idle repeatedly*.
I had to hibernate my PC twice instead of shutting it down to keep the
tool working.
I am using the latest version *CeWL 6.1 (Max Length)* On Windows 11.
I had to use a proxy to monitor the work as there is no indication of work
at all in the tool itself *(It would be Cewl to show any kind of
progress)*.
Command:
[image: Command]
<https://camo.githubusercontent.com/5f74a16c116c9ccbcde1c14ca1d90264ac9bbf41da5f1b366f457cd68b76df44/68747470733a2f2f692e6962622e636f2f3652376d3259422f53637265656e73686f742d323032332d31302d31342d3036353135342e706e67>
Task Manager:
[image: Task Manager]
<https://camo.githubusercontent.com/76ca28a02135ccbeefe0ab0f667c9dabd55dee60874bc53d4ddb7664ff1a79bd/68747470733a2f2f692e6962622e636f2f3547503073326b2f53637265656e73686f742d323032332d31302d31342d3036333533332e706e67>
Proxy:
[image: Proxy]
<https://camo.githubusercontent.com/7e5542f508596dd52e1aa3424b5002b3becfe928f6eb7a03f1daa9ffacb360b6/68747470733a2f2f692e6962622e636f2f4a7244705950742f53637265656e73686f742d323032332d31302d31342d3036333735392e706e67>
Thanks for the Awesome Auth lab challenges.
—
Reply to this email directly, view it on GitHub
<#86 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWMHN4V6VZ2IOMIVJWLX7IGCFANCNFSM5F4ALTKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The same thing in WSL 2.0 Ubuntu, Maybe it's doing some comparison of the new words with the old ones to handle duplicates?
|
I've just installed CeWL in Ubuntu in WSL2 and against my site it is making
tens of requests per second. I've also checked in native Ubuntu and Debian
and they are the same, tens per second.
That is historically what it has been doing so I'd guess that it is your
system that is having problems.
…On Sat, 14 Oct 2023 at 12:49, w4po ***@***.***> wrote:
The same thing in WSL 2.0 Ubuntu,
It's extremely slow, I think it starts with 1 or 2 requests per second,
then the more requests it gathers the slower it becomes,
Maybe it's doing some comparison of the new words with the old ones to
handle duplicates?
I've never used it on Windows do don't know the base performance levels
but it shouldn't be that slow. I'll see if I can give it a run against the
site later and see what speed I get.
… <#m_5671942460851552213_>
—
Reply to this email directly, view it on GitHub
<#86 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWKQM42SBQ5QRXIWLBLX7J36DANCNFSM5F4ALTKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It might be an issue with my system, even though I have a reasonably good one. I've conducted some additional tests on https://www.ironman.com. I also tested it on your site, https://digi.ninja/. During this "Offsite link, not following:..." phase, So I think the bottleneck is somewhere in the checking phase. PS: I'm struggling with the JWT cracking Obscure level. Can you provide any hints? |
Is there any way to improve the performance? When using -d 2 or higher, it seems to balloon in crawl times, that take days to run and always end with killing the task.
I'm not 100% sure what it's doing, but perhaps a way for it to make incremental check-ins to the output file or a current progress (of what it thinks so far) might solve the issue? or perhaps the index just gets too big for it to process efficiently?
In any case, I'm looking forward to what people suggest
The text was updated successfully, but these errors were encountered: