Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cudnn-batch to next #70

Open
wants to merge 65 commits into
base: cudnn-batch-tomerge
Choose a base branch
from
Open

Conversation

alreadydone
Copy link
Owner

No description provided.

TFiFiE and others added 30 commits September 3, 2018 14:56
Case-sensitive coordinates are a thing in SGF, not GTP.

Pull request leela-zero#1793.
* Install to ${CMAKE_INSTALL_BINDIR},
  some distros like to put games in /usr/games.

* Store/load leelaz_opencl_tuning and load weights file from
  system directories, i.e.
  ~/.local/share/leela-zero on Unix

* Better error reporting when network weights file is not found.

Pull request leela-zero#1618.
* MTCS: Skip current expanding child when doing uct select.

Search thread should explore other nodes in this case, this would save
the search from some useless searches.

It has benefit for batching support too. Before this change, all
threads could be busy waiting for the first node being expanded.

Give expanding node a huge virtual loss instead to avoid crash when
only one child exists.

Pull request leela-zero#1794.
As demanded by GTP, improving the input handling of GameState::play_textmove 
in the process (which now would crash if given a pass or resignation).

Pull request leela-zero#1814.
Updated README to compile under Linux with Boost filesystem.

Required after 73f1f93.

Pull request leela-zero#1813.
According to @Atarust at leela-zero#1806 this fixes kernel compilation error with his
configuration. No performance difference.

Pull request leela-zero#1820.
Copying on weight construction keeps a copy of the weights on the host memory,
at least for recent NVIDIA GPUs. Creating a buffer and then copying later on
doesn't, and this saves memory.

Pull request leela-zero#1818.
* Thread-safe UCTNodePointer

This makes almost all UCTNodePointer operations thread-safe.
The only exceptions are destructors and when it is 'moved out'
Should even handle concurrent inflate() calls properly.

Uses atomic operations to emulate locks only when needed.

This includes support for re-expansion by forcibly moving the state back 
to INITIAL on a single-thread context.

Pull request leela-zero#1764.
Avoid having duplicate copies of the network weights in memory.

Pull request leela-zero#1795.
Fixes clang warning.

Pull request leela-zero#1841.
When doing auto precision detection, make sure prior implementation 
is destroyed before trying new implementation

Pull request leela-zero#1842.
* Count memory consumption of a search tree by introducing a
  referencer for UCTNodePointer and UCTNode.
* NNCache: Add method to get estimated memory consumption.
* Extend Network with methods to estimate network size, network cache
  size and resize cache.
* Estimate total memory consumption as
    estimated network size +
    number of gpus * 85MB +
    estimated tree size * 1.1 +
    estimated cache size * 1.1
* Add command `lz-setoption` which behaves like set_option from UCI spec.
* Add option to set maximum memory consumption in MB.
* Add option to configure ratio of memory reserved for nn cache
  and search tree.
* Add command 'lz-estimatememory' which shows estimated memory consumption.
* Initialize maximum tree size and cache size after
  the network initialization.

Pull request leela-zero#1741.
If a node is fully expanded but is reverted to INITIAL state, there is no chance
it returns to EXPANDED state. Don't revert nodes to INITIAL state if it is fully
expanded.

Some additional small bugfixes.

Pull request leela-zero#1851.
* Add policy prior in analysis output.
* Store policy as float instead of string in OutputAnalysisData.

Pull request leela-zero#1836.
We preferably store the analysis in the original format for sorting, and
only do the conversion for display at display time.

Don't use the completely meaningless tag "N" for move policy prior.
We'll use "prior" instead.

Winrate is currently output in 1/100th of a percent, so we'll use the
same format for priors. I'm not sure why winrate is not just using floats,
but I assume GUIs now already rely on this, and it might avoid some weird
bugs related to locale.
This adds Eigen as a default matrix/vector library via a submodule.
This has a load of advantages:

* It can be used as a replacement for a cBLAS library when it is
  not available, cannot be found, or is outdated compared to the
  compiler or CPU.
* Because Eigen is header only, it significantly eases the build
  prequisite requirements.
* The Eigen code paths are much more readable from a mathematical
  perspective.
* Eigen can optimize more heavily for known matrix sizes. The
  current code doesn't yet take advantage of this, though.

The downsides:

* Eigen might be a bit slower than other BLAS libraries. (Nevertheless,
  on my system it is faster than OpenBLAS)
* Binaries built with Eigen are optimized for the CPU it was compiled
  on and don't port as well to other CPUs. So you need seperate binaries
  for wider client support.

* Default Eigen in CMake, add tests.

Default the Eigen library in CMake, as it's the fastest for most
contemporary CPUs and configurations, and the easiest to build.

We can optionally use BLAS by adding the USE_BLAS define, and
will try to locate BLAS/OpenBLAS if so. This is useful for
binaries for distribution such as our releases or distros.

Split all tests to cover both Eigen and BLAS.

Update build instructions to remove BLAS as a dependency, use CMake on
all Unixy platforms, and use HTTPS.

Pull request leela-zero#1692.
See discussion in pull request leela-zero#1642.

This adds an optional side to move in the lz-analyze command, instead
of only a posting interval. This makes the format more consistent with
all other GTP commands.

We check the amount and format of the arguments so we are backwards
compatible with GUIs that send the old format, i.e. with only a posting
interval.

Pull request leela-zero#1872.
Rework the Network initialization to pull out the OpenCL benchmarking
for precision autodetection. Add support to ForwardPipe to report
whether it is necessary to run the benchmarks. If the answer is no, and
fp16 compute works, we assume that's what we want.

This avoids the benchmarking overhead on modern AMD cards and probably
on the latest ones from NVIDIA too.

Pull request leela-zero#1873.
gcp and others added 30 commits October 12, 2018 09:26
Required on macOS, and probably other platforms.

Fixes issue leela-zero#1901.

Pull request leela-zero#1910.
Implement a few more parameters that can be set via lz-setoption,
specifically visits, playouts, pondering, resign threshold and the lag
buffer.

We currently don't check the provided values against the reported
min/max values but rely on the UI not to mess up. This could be
addressed in a refactoring. Similarly, commandline and setoption values
should probably treated in a unified way.

Remove the bogus boolean return value from GTP processing functions.

Minor style fix for old GTP code.

Pull request leela-zero#1927.
We need to call UCTNode::create_children() even if we aren't expanding
because that moves our node's state from INITIAL to EXPANDED.

Pull request leela-zero#1928.
If tuner failed during precision autodetection error output in stdout was read
as a GTP message.

Pull request leela-zero#1935.
* Fall back to single precision when the GPU claims fp16 support 
  but it doesn't work.
* Net initialization fixes:
- Try at least one selfcheck eval when autodetecting precision
- Revive selfcheck when using Eigen

Pull request leela-zero#1934.
Many lz-setoption commands are forgetting to add the closing GTP = if
they are successful. This will freeze GUIs.

Fixes issue leela-zero#1940.
We've supported HTTPS on the server side for a while now, make it the
default.
Colab has updated and the instructions here probably no longer work.
They should probably be hosted elsewhere, too.
Define a variable closer to the usage point.
* Separate FPU-reduction setting for root.
* Removed fpu_root_reduction.

Pull request leela-zero#1960.
Link to Google Cloud tutorial on Google Docs.

Pull request leela-zero#1961.
Delete outdated questions and answers.

Pull request leela-zero#196.
Disabling input buffering on Windows causes breakage that looks like
input buffering stays enabled. This was accounted for in the code, but
the #define check was against a non-default flag, and a different one as
used elsewhere.
Even though SGF defaults to size 19 boards, we should not try
to set up a board that size if LZ has not been compiled to support
it.

Pull request leela-zero#1964.
Without this, it's empirically not possible to load the current 256x40
networks on a 32-bit machine.
If we are trying to auto-select the best device for OpenCL, never select
a CPU. This will cause the engine to refuse to run when people are
trying to run the OpenCL version without a GPU or without GPU drivers,
instead of selecting any slow and suboptimal (and empirically extremely
broken) OpenCL-on-CPU drivers.

Falling back to CPU-only would be another reasonable alternative, but
doesn't provide an alert in case the GPU drivers are missing.

Improves behavior of issue leela-zero#1994.
Fix full tuner for heterogeneous GPUs and auto precision detection.

--full-tuner implies --tune-only
--full-tuner requires an explicit precision

Fixes leela-zero#1973.

Pull request leela-zero#2004.
Very minor speedup of about 2% with batch size of 1.
With batch size of 5 there is a speedup of about 5% with half precision
and 12% with single precision.

Out transformation memory accesses are almost completely coalesced
with the new kernel.

Pull request leela-zero#2014.
From upstream a807dcf0f8623d40dc5ce9d1eb00ffd0e46150c7.
* CPUPipe : change winograd transformation constants to an equation.

Combined with a series of strength reduction changes, 
improves netbench by about 8%.

* Convert some std::array into individual variables

For some reason this allows gcc to optimize the code better,
improving netbench by 2%.

Pull request leela-zero#2021.
Use hard-coded equations instead of matrix multiplication.

Pull request leela-zero#2023.
Fix Validation -k option by reading its value before the parser is reused.

Pull request leela-zero#2024.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.