A simple robots.txt
service. MrRoboto will fetch and parse robots.txt
files for you and indicate whether a path is crawlable by your user agent. It also has primitive support for the Crawl-dealy
directive.
Available in Hex, the package can be installed as:
-
Add mr_roboto to your list of dependencies in
mix.exs
:def deps do [{:mr_roboto, "~> 1.0.0"}] end
-
Ensure mr_roboto is started before your application:
def application do [applications: [:mr_roboto]] end
Checking a URL is simple, just send {:crawl?, {agent, url}}
to the Warden
server.
GenServer.call MrRoboto.Warden, {:crawl?, {"mybot", "http://www.google.com"}}
The Warden
server will reply with :allowed
, :disallowed
or :ambiguous
.
In the case of an :ambiguous
response it is up to your discression how to proceed. The :ambiguous
response indicates that there were matching allow and disallow directives of the same length.
Crawl-delay
isn't a very common directive but MrRoboto
attempts to support it. The Warden
server supports a delay_info
call which will return the delay value and when the rule was last checked.
iex> GenServer.call MrRoboto.Warden, {:delay_info, {"mybog", "http://www.google.com"}}
%{delay: 1, last_checked: 1459425496}