wikiracer is a Go application which plays "The Wikipedia Game". It takes a start page and an end page and follows hyperlinks to get from the start page to the end page as quickly as possible.
- Basic Usage
- Installation
- Run tests
- Profiling
- Why Go?
- Architecture overview
- Some strategies attempted
- Time spent on project
- Appendix
Start a wikiracer server on port 8000
.
$ wikiracer
INFO[0000] Server is running at http://localhost:8000
In order to initiate a race, make a GET
request to the server's /race
endpoint with the following arguments.
- starttitle (required): The Wikipedia page to start from.
- endtitle (required): The Wikipedia page to find a path to.
- nocache: By default, the server caches all paths previously found. To ignore the cache for this race, set
nocache=1
.
The endpoint returns a JSON response containing a path from the start page to the end page and how long it took to find the path.
{
"path": [
"English language",
"International Phonetic Alphabet",
"University of Pennsylvania"
],
"time_taken": "72.815763ms"
}
If no path was found in the time limit (see below), the JSON response will look like:
{
"message": "no path found within 1m0s",
"path": [],
"time_taken": "1m0s"
}
The following environment variables can be used to customize the behavior of wikiracer.
WIKIRACER_PORT
: The port on which to run a HTTP server (default8000
).EXPLORE_ALL_LINKS
: Sometimes, the MediaWiki API doesn't return all links in once response. As a result, wikiracer continues to query the MediaWiki API until all the links are returned. IfEXPLORE_ALL_LINKS
is set to"false"
, then wikiracer will not continue even if there are more links.EXPLORE_ONLY_ARTICLES
: By default, the wikiracer only searches the main Wikipedia namespace, which includes all encyclopedia articles, lists, disambiguation pages, and encyclopedia redirects. IfEXPLORE_ONLY_ARTICLES
is set to"false"
, then wikiracer will explore all Wikipedia namespaces. (Read more about namespaces here.)WIKIRACER_TIME_LIMIT
: The time limit for the race, after which wikiracer gives up. Must be a string which can be understood by time.ParseDuration (default1m
).NUM_FORWARD_LINKS_ROUTINES
: The number of concurrent getLinks workers to run (default 15).NUM_BACKWARD_LINKS_ROUTINES
: The number of concurrent getLinks workers to run (default 15).
Create a directory and clone the repo:
$ mkdir -p $GOPATH/src/github.com/sandlerben/wikiracer
$ cd $GOPATH/src/github.com/sandlerben/wikiracer
$ git clone https://github.com/sandlerben/wikiracer.git .
Install the dependencies using dep, the semi-official Go dependency management tool.
$ dep ensure
Finally, install wikiracer.
$ go install
The full test suite can be run with:
$ go test ./...
wikiracer exposes a pprof endpoint which allows it to be profiled in a few ways:
- Using go-torch, which can generate a flamegraph visualizing the program's workload. (Side note: I wrote this tool!)
- Using
go tool pprof
, which can create CPU profiles, memory profiles, and blocking profiles and visualize each in various different ways.
I wrote this application in Go for a few reasons:
- I have experience using Go in the past from go-torch and transcribe4all.
- Go has extremely powerful concurrency primitives (goroutines and channels) and an excellent runtime/scheduler which make writing fast, thread-safe concurrent programs more straightforward than other languages.
- Go has a great standard library. In particular, it's easy to create a full-featured HTTP server and make HTTP requests using Go's built-ins.
The wikiracer/web
package encapsulates the logic for handling HTTP requests. The package exposes two endpoints:
/race
returns a path from a start page to an end page./health
returns a message indicating that the server is alive and healthy.
The wikiracer/web
package uses the gorilla/mux
router, an extremely popular Go URL dispatcher.
The wikiracer/web
package also features a simple cache of paths previously found (implemented as a Go map). That way, a path from same start to some end page only needs to be found once.
The wikiracer/race
package encapsulates the Wikipedia exploring logic; it is the most interesting part of the application.
A race.Racer
encapsulates all the state needed for one race, including:
- The page to start at
- The page to find a path to
- A record of how pages found traversing from the start page were reached during the race
- A record of how pages found traversing backwards from the end page were reached during the race
- Synchronization primitives (
sync.Once
) - Channels to allow concurrent workers to communicate
- A "meeting point"(which can be used to compute a full path (as will be described below)
- Any errors that occurred during the race
A race.Racer
exposes one public function, Run
, which returns a path from a start page to an end page.
Under the hood, there is a lot going on inside the race
package. Specifically, a number of forwardLinks
workers and backwardLinks
workers concurrently explore the graph of Wikipedia pages until a path to the end page is found.
At a high level, the Wikipedia graph is explored in the following manner:
- The start page is added to the
forwardLinks
channel and the end page is added to thebackwardLinks
channel. forwardLinks
workers traverse the Wikipedia graph forward from the start page. At each iteration, they read a page from theforwardLinks
channel and call the MediaWiki API to add all pages linked from that page to the channel. In other words, they expand the start page's connected component (technically, weakly connected component).backwardLinks
workers traverse the Wikipedia graph backward from the end page. At each iteration, they read a page from thebackwardLinks
channel and call the MediaWiki API to add all pages which link to that page to the channel. In other words, they expand the end page's connected component.
These stages are described in more detail below.
Clearly, the forwardLinks
and backwardLinks
workers are extremely similar (and much of the code is shared between them). forwardLinks
workers explore the start page's connected component and backwardLinks
explore the end page's connected component. In order to accomplish this, forwardLinks
workers query the links
property of pages and backwardLinks
workers query the linkshere
property of pages.
These workers "work" as follows:
- At each iteration, a worker takes a page from either the
forwardLinks
orbackwardLinks
channel. It then queries for the page'slinks
orlinkshere
property to get "neighboring" pages. - Next, it checks if any of these "neighbors" crosses the cut between the start page's connected component and the end page's connected component. If any do, the worker sets the
meetingPoint
variable to that page and closes thedone
channel to signal that an answer was found. - If a meeting point was not found, make a record of how we got to each neighbor. In other words, add mappings from neighbor to the parent page to
pathFromStartMap
orpathFromEndMap
. - When a
meetingPoint
is found, usepathFromStartMap
to recreate the path fromstart
tomeetingPoint
and usepathFromEndMap
to recreate the path frommeetingPoint
toend
.
Lots more fun technical implementation details can be found in the appendix below.
The original version of wikiracer worked quite differently and it is fully described here.
At a high level, the old implementation worked as follows. Pages passed through a two-stage pipeline:
- A page was added to the checkLinks channel. A checkLinks worker checked if the page connected to the end page using the
pltitles
parameter of the MediaWiki API. If it did, wikiracer returned. If it did not, the page was added to the getLinks channel. - A getLinks worker took the page and added all the "neighbor" pages linked from the page to the checkLinks channel.
While this approach worked for most cases, it performed poorly when few pages linked to the end page. For example, with an end page like "Segment", which is a disambiguation page linked to by only about a dozen other pages, wikiracer would not find an answer.
In contrast, the new approach which also traverses backwards from the end page can "escape" end pages with low in-degree. The new approach finds pages that are much more likely to be found while traversing from the start page.
As mentioned above, the number of forwardLinks
/backwardLinks
worker goroutine can be customized. However, I wanted to find the best default for most races.
First, some background. All workers check to see if a channel called done
is closed before getting work from the forwardLinks
or backwardLinks
channels. A closing of done
is the signal that all the workers should stop working and exit.
In the original implementation, the main request goroutine waited for all worker goroutines to exit using a sync.WaitGroup
. When the end page was found, the following happened:
- A worker found a
meetingPoint
. Success! - The worker closed
done
to signal that all the other workers could stop. - [Eventually, maybe after a second or more] All the workers reached the code which checks if
done
is closed and then exited. - After all workers exited, the main goroutine was unblocked and returned the path.
A key problem with this approach was that the time taken by step 3 grew with the number of concurrent workers.
I started with two of one worker type and two of the other worker type. While increasing the number of concurrent workers led to the end page being found faster, these gains were eaten up by a longer step 3.
I solved this problem by having the main goroutine wait for the done
channel to be closed instead of using a sync.WaitGroup
. In effect, this unblocks the main goroutine even before all the worker goroutines exit. This approach, coupled with increasing the number of concurrent workers significantly increased the response time (doubling/tripling it in some cases!).
Sometimes, the MediaWiki API doesn't return all links for a page in once response. The default behavior of the application is to query the API until all the links are returned. However, I hypothesized that not exploring all the links for a page would make wikiracer faster. My thoughts were the following:
- Links on a Wikipedia page are probably pretty similar (topic-wise) to that page.
- If all the links on every page were explored, the wikiracer could get "trapped" in one topic area of Wikipedia and keep exploring related pages.
- Therefore, ignoring some links on pages would get to other parts of Wikipedia and find the end page faster.
It turned out that when I tested this approach, wikiracer did consistently worse, even when the start page and end page were totally unrelated. (See for yourself by setting EXPLORE_ALL_LINKS
to false
.) This was the case for a few reasons.
- Sometimes the best way to get out of one part of Wikipedia is to get to a very general and link-abundant page (e.g. "United States"). Without exploring all links, these general pages would often be skipped.
- Not only that, but upon arriving on a general page with lots of links, only a small fraction of those links would be explored.
- Finally, sometimes there is really only one way to get from one page to another. For example, the best way to get from "USB-C" to "Weston, Florida" is to pass through "American Express" (any other path has many more hops). Without exploring all links, a "key" page like "American Express" can be easily missed.
- Core implementation: 15 hours.
- Web: 1 hour.
- Race: 14 hours.
- Testing and documentation: 3 hours.
There are many ways to parse JSON in Go, but I opted to use the buger/jsonparser
library for a few reasons:
- It doesn't require you to recreate the structure of the JSON in a
struct
beforehand. This makes programming much faster. - In benchmarks,
jsonparser
is as fast or faster than all other Go JSON parsing libraries. See here. It is 10x faster than the standardencoding/json
package!
Clearly, the MediaWiki API ought to be called as often as possible in order to find a path as fast as possible. However, the API documentation does not include a clear quota or request limit. Rather, the API will return 429 Too Many Requests
occasionally.
To get around this, I abstracted the core requesting code into a function called loopUntilResponse
which makes a request. If it receives a 429 Too Many Requests
, it waits for 100 milliseconds and tries the request again.
The time limit is enforced by the giveUpAfterTime
worker. It takes a time.Timer
, and when the Timer
finishes, the giveUpAfterTime
reads a message from the timer.C
channel and closes the done
channel.
Mock testing is key to isolating a specific part of the code in a unit test. Therefore, when testing the race
package, I used httpmock
to mock the responses to http.Get
. When testing the web
package, I used mockery
and testify
to create a mock race.Racer
for testing.