A Naive Bayes machine learning classifier.
- Developed as an investigation into capabilities and implementation characteristics
- Written in the Rust programming language
- Supports sparse data that do not fit well into a vector (ie words in a document)
- Has mitigation for very small probabilities (floating point values)
- Not recommended for production use. There is no package for this on crates.io
- Install Rust using rustup https://rustup.rs/
- Clone the repository (see below)
- Run
cargo test
orcargo build
Classify single characters as belonging to 1 of 3 categories.
use assert_approx_eq::assert_approx_eq;
use naive_bayes::FixedClassifier;
use naive_bayes::FixedLearner;
let classifier = FixedLearner::default()
.update_batch(&mut [['a'], ['b'], ['c'], ['a'], ['a']].iter(), "01")
.update_batch(&mut [['b'], ['b'], ['c'], ['b'], ['a']].iter(), "02")
.update_batch(&mut [['b'], ['c'], ['c'], ['b'], ['c']].iter(), "03")
.make_classifier();
let actual = classifier.classify(&['a']);
assert_eq!(actual.best().unwrap().0, "01");
assert_approx_eq!(actual.best().unwrap().1, 0.75, 0.1);
D: Copy + Debug + Eq + Hash
- type of data being classified, for example str or i64const DS: usize
- the size of the array of data in a training or classification instanceH: Copy + Debug + Eq + Hash
- the target hypothesis/label/category/classification of the data.
update - updates the learner with a single training instance (an instance is a fixed length array)
pub fn update(&mut self, data: &[D; DS], hypothesis: H) -> &mut Self
update_batch - updates the learner with a batch of training instances (an instance is a fixed length array) associated with a single hypothesis.
pub fn update_batch(&mut self, data: &mut dyn Iterator<Item = &[D; DS]>, hypothesis: H) -> &mut Self
classify - classifies a single data instance (one instance is a fixed length array of data values)
pub fn classify(&self, data: &[D; DS]) -> Results<H>
The Learner records the following counts, which are later converted into probabilities.
A count of the number of training instances for a given hypothesis/label/class:
count_hypotheses: HashMap<H, f64>
Each instance that is learnt or classified is represented as a fixed size array. So, for each separate array index in each training instance, a count is kept of the number of times each distinct data value occurred with a given hypothesis/label/class.
count_joint: [HashMap<(D, H), f64>; DS]
A count is kept of the number of training instances:
count_total: f64
When a Classifier is required, the counts maintained by the Learner are converted into probabilities and a snapshot in time of these values is taken. Due to numerical issues, logarithms of these probabilities are used (see later sections for details).
The log probability of a specific hypothesis/label/class occurring in the training data:
log_priors : HashMap<H, f64>
For each position in each instance, the log probability of each a specific hypothesis/label/class occurring in the training data when a specific data values is also seen
log_likelihoods: [HashMap<D, Vec<(H, f64)>>; DS]
To classify an instance containing data D, find the probability of each H by multiplying the values associated with D and H. These can be found by lookups in the data structures log_priors and log_likelihoods. As discussed below, adding logarithms is used instead of directly multiplying values.
The 'naive' assumption of 'naive bayes' assumes that the probability at each position in the input array is independent of the other positions. This means that the Learner processes each position in the input array separately. Also under this assumption, the Classifier can estimate the probability of a specific combination of data values at each array position by multiplying together the probabilities at each individual position.
It is important to avoid multiplication by probability estimates of zero, as this always results in a combined estimate of zero. For this reason, the Classifier will assume that missing values of p(d|h) have some small but non-zero estimate of the probability. The Learner determines p(d|h) given a count of |(d,h)| and count of |h|:
p(d|h) = |(d,h)| / |h|
- The probability above is conditional on h and log2 of this value is stored precomputed for later use by the Classifier.
- To reduce storage, only p(d|h) that occur in the training data are stored. For this reason, and as noted above, the Classifier will assume that missing values of p(d|h) have some small but non-zero estimate of the probability.
- To speed up classification, the mapping h->p(d|h) is stored in another mapping that uses d as the key. This requires only O(|i|) lookups, where |i| is the number of positions in the input array. The result is:
d -> h -> p(d|h)
HashMap<D, HashMap<H, f64>>
- To reduce storage again, the Hashmap is converted to a more compact Vec<>. The result is:
d -> h -> p(d|h)
HashMap<D, Vec<(H, f64)>>>
- update_batch looks up and updates the count of h once per batch.
When a unique data value does not often occur in a much larger training data set, its corresponding probability will be very small. This can lead to numerical underflow when using floating point multiplication.
Taking each small number to be the same value v, then the value decreases exponentially with each number (say n times) included in the product:
value = v^n
The example below shows n=3 but a production system would have a much larger value for n and so would run out of floating point precision very quickly.
step1 = 0.001 * 0.001
step1 = 0.000001
step2 = 0.000001 * 0.001
step2 = 0.000000001
step3 = 0.000000001 * 0.001
step3 = 0.000000000001
There is a homeomorphism between multiplying real numbers and adding their logarithms. Adding logarithms does not lead to numerical underflow in the same way.
In other words, converting floating point values to their logarithms and adding them, then converting back the result using exponentiation will preserve more precision.
Normalisation in this context means multiplying and then dividing by their sum. This is implemented as follows:
Work out the maximum precision for an IEEE double precision floating point value - this is 53 binary digits mantissa = 2^-53.
Using logarithms again, divide the maximum precision by the number of values we are multiplying, this will give a threshold
For 3 values:
|V|=3
threshold = log2(mantissa) - log2(|V|)
threshold = -53 -1.58496250072
threshold = -54.58
Take logarithms of each value v in V, giving logv. Discard any value logv that is less than threshold Exponentiate the remaining values logv Divide each exponentiated value by their sum, as normal.
Floating-point numbers have three components:
a sign bit
an exponent
a fraction (mantissa)
So that the final number is:
sign * 2^exponent * (1 + fraction)
Floating point numbers are stored with one less bit of fraction than they can use, because zeroes are represented specially and all non-zero numbers have at least one non-zero binary bit.
Combining this, the digits of precision for a floating point number is:
32-bit float has 24 bits of precision (fraction) or approximately 7.22 decimal digits
64-bit double has 53 bits of precision (fraction) or approximately 15.95 decimal digits
- test with larger data volumes
- document stream data (where size of input is unknown)
- document core algorithm, in google docs
- refactor into sparse matrix format
- refactor into normal versus log probability operations
- determine better value for LOG2_PLACEHOLDER_PROBABILITY
https://www.cs.rhodes.edu/~kirlinp/courses/ai/f18/projects/proj3/naive-bayes-log-probs.pdf
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
Naive Bayes is a very well known technique. There are many alternatives for Rust available through cargo: https://crates.io/search?q=naive%20bayes
MIT permissive license. See LICENSE for full license details.