An optimized tree of integer tokens with counts and extra data in each node
This package provides an optimized tree structure containing integer tokens and extra data associated with them. This is intended for supporting certain machine learning applications involving strings and grammatical learning.
For now, install from the github repository.
This package includes tests written using the unittest
framework.
They can be run using, for example, nose
, which can be installed
using pip3 install nose
.
To run the tests, just use the following command in the top-level directory:
nosetests
To create a tree and add sequences, use something like the following:
from tokentree import TokenTree
t = TokenTree()
t.add((1, 2, 3))
t.add((1, 3, 8, 10))
The tree is designed to accomodate millions or even hundreds of millions of values. Insertion time is linear in the length of the sequence. The sequences can consist of arbitrary integers, positive or negative. Tokens must fit in a 64-bit integer.
It is possible to store extra data with each node. Functions for
merging extra data with each node are provided as arguments the
constructor (see below). The extra data is provided as the extra
keyword argument to add
.
It is also possible to specify a count or weight for each addition.
This value will be added to the count of each node in the path. The
value is stored as a floating point number and defaults to 1. The
optional count is specified with the count
keyword argument to
add
.
The following illustrates adding with extra data and weight:
t.add(seq, count=3.5, extra="noun")
The tree is created using the TokenTree()
constructor. This
constructor accepts the following keyword arguments:
merge_extra
: a function to merge theextra
argument in a newly added node to the information already in a tree node. This should return the new data. Default isNone
. The function is called asmerge_extra(seq, i, old_data, extra, count)
. For the first call,old_data
will be None.seq
is the sequence being added;i
is the position at the sequence (0 means at tree root,len(seq)
means inserting the last node).
The tree works as an iterator. This makes it very easy to enumerate through all paths in the tree. The enumerator will also enumerate through all prefixes of sequences that were added to the tree. It is guaranteed that the prefix of a sequence will always be enumerated before any longer sequences with the same prefix. In effect, the iteration is in depth-first order. Within each prefix, the sequences of the next length are enumerated in ascending order of their token values.
The following illustrates enumerating the sequences. For each
iteration, the iterator returns (seq, node)
, where seq
is a
tuple of integers, and node
is the node at that sequence.
import tokentree
t = TokenTree()
t.add((1, 2))
...
for seq, node in t:
print(node.get_count(), seq)
The find
function looks up a specific sequence from the tree. It is
possible to look up added sequences and their prefixes. It returns None
if the given sequence is not in the tree.
import tokentree
t = TokenTree()
t.add((1, 2))
...
node = t.find(seq)
if node is None:
print("Not found")
print(node.get_count(), node.get_token(), node.get_extra())
The get_token_count(token)
method on the tree returns the number
of times that the given token has been added to the tree (in any
position in any sequence), or when count
is used when adding, the
sum of all counts for all occurrences of the token.
This function merely retrieves precomputed data, so it is quite fast.
The get_root()
method returns the root node of the tree. The root
node always exists and corresponds to the empty sequence.
The get_count()
method returns the sum of all weights added to the
tree. it is the same as the value of get_count()
on the root
node.
A node is mostly opaque for applications. However, the following functions can be used to query information about a node.
The have_children()
method on a node returns True
if the node
has children, and False
otherwise.
The get_token()
method on a node returns the token of the node. The token
is always an integer value (but can be negative).
The get_count()
method on a node returns the count value of the
node. The count is the sum of the count
values of all sequences
that have been added that have the sequence leading to the node as
their prefix (including the count value for the sequence of the node).
It is possible to iterate over the children of the node by simply using the node as an iterator. The following illustrates this:
import tokentree
t = TokenTree()
t.add([1, 3, 4])
...
node = t.find([1, 3])
for child in node:
print(child.get_count(), child.get_token())
The official repository of this project is on github.
Please email to ylo at clausal.com if you wish to contribute or have patches or suggestions.
Copyright (c) 2018 Tatu Ylonen. This package is free for both commercial and non-commercial use. It is licensed under the MIT license. See the file LICENSE for details.
Credit and linking to the project's website and/or citing any future papers on the project would be highly appreciated.