-
Notifications
You must be signed in to change notification settings - Fork 3
Distri Configuration
In Distri, tasks of web downloading are assigned by a master control machine to a number of distributed slave machines for executions. Thus, Distri has two separated parts of code: the master part and the slave part.
The "curler.sh" file is the only code that should be run on the slave machines. The location and permissions of this script needs to be properly configured on all the slave machines for the master to use them. The "deploy_curler.sh" script can be used to easily deploy and configure the "curler.sh" script onto the slave machines.
To use Distri on the master machine, we need to first initialize a DistriControl object as the centralized controlling object and a DistriHandle object to handle each of the slave machines. Also, we need to give the DistriHandle objects enough information about the slave machines for them to access and run the scripts on the machines.
DistriControl control = new DistriControl();
control.createHandle("<slave hostname>", "<user name>");
DistriHandle handle = control.getHandleByHost("<slave hostname>");
handle.setKeyPath("<path to the key>");
handle.connect();
To download a web page, a DisriTask object needs to be created which should contain the web page's url. And the task will be handled by the DistriControl object after it is added to the task queue. Since the downloading requests are made to the slave machines asynchronously, a number of callback functions need to be set on the DistriTask objects to handle the finishing of the tasks. The returned information from the slave machines will be constructed as a DistriResult object, which contains http headers, web page, status code, etc. It is totally fine to feed a large set of DistriTasks objects into the DistriControl object's task queue in a very short amount of time. Distri is smart enough to load balance the pressure among all the slave machines and give back the downloaded web pages as quickly as possible.
DistriTask task = new DistriTask("<url>");
control.addTask(task);
task.setFinishCallback(new DistriTask.FinishCallback() {
public void execute(DistriResult result) { ... }
}
task.setFailCallback(new DistriTask.FailCallback() {
public void execute(DistriResult result) { ... }
}
task.setTerminateCallback(new DistriTask.TerminateCallback() {
public void execute(DistriResult result) { ... }
}
Distri also supports recursive web crawling, which means new tasks can be automatically generated based on links included in the web pages downloaded and sent to slave machines for recursive downloading. To use this feature, simply give the DistriTask object a possible recurseDepth
when constructing it, and call task.setLinkSelector("<link pattern>")
to give it a link pattern that needs to be crawled.
For more information on how to use Distri, please refer to the JavaDoc.