-
Notifications
You must be signed in to change notification settings - Fork 761
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl Job Settings Documentation for 3.4.0 #343
Comments
Hey @mutlurasit can you elaborate a little on what you need help trying to do? The cxml file is essentially the configuration for a "type" of crawl. They provide information for how a crawl is setup to run. |
Thanks for your reply @hennekey I am trying to do things like;
|
@mutlurasit Are you code-savvy? You can implement custom processors to replace the ones you see in the cxml file to achieve custom behavior, like creating separate WARC files per domain. I do not believe that the existing code contains that logic. Annually recrawling would be best accomplished with a cron schedule I think. Unless you want to keep the process running (and expect it to do so without issue) for the whole duration. |
Thanks again @hennekey , I can't say I am that competent with coding but would appreciate if there is any source you can suggest that I can dig around. Do you know which part of the cxml (if any) deals with WARC creation? Regarding scheduling, you are right that sound reasonable I will also look into that as well. |
This is where the code finds the writer (and consequently the file) to use to persist data to a WARC: https://github.com/internetarchive/heritrix3/blob/master/modules/src/main/java/org/archive/modules/writer/WARCWriterProcessor.java#L155 |
Thank you very much one more time! |
Hi! Regarding the question of the issue, is there any resource where documentation is updated and complete? I'm facing this problem, since the resources which I found are ReadTheDocs and the wiki, and they are either not complete/basic (e.g. |
Looks like the logToFile property exists (1) on DecideRuleSequence and (2) on everything inheriting from Scoper.
The Java API documentation is complete in the sense of listing every class and property.
Digging into the code is sometimes a practical necessity to fully understand some of the options and behavior. Heritrix has no dedicated developers and problems are generally solved by affected users contributing fixes. :-) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hello,
I am trying to understand crawler-beans.cxml usage in Heritrix but can't find any best practice articles or detailed manual anywhere. There are a few documentation floating on the net and some explanation inside .cxml but they are either out dated or not complete.
Wiki has a good page for basic usage but as the name suggest it is basic. The rest seems to refer an older version of Heritrix.
Is there any source I am missing or a place where I can examine good practice examples?
Sorry if this is not the appropriate platform for the request.
The text was updated successfully, but these errors were encountered: