Skip to content
/ Fred Public

A fast, scalable and light-weight C++ Fréchet and DTW distance library, exposed to python and focused on clustering of polygonal curves.

License

Notifications You must be signed in to change notification settings

derohde/Fred

Repository files navigation

Fred alt text

A fast, scalable and light-weight C++ Fréchet distance library, exposed to python and focused on (k,l)-clustering of polygonal curves.

Ingredients C++ Backend

import Fred.backend as fred

Number of Threads

By default, Fred will automatically determine the number of threads to use. If you want to set an upper limit, call fred.set_maximum_number_threads(number).

Curve

  • signature: fred.Curve(np.ndarray), fred.Curve(np.ndarray, str name)
  • properties: fred.Curve.values: curves as np.ndarray, fred.Curve.name: get name of curve, fred.Curve.dimensions: dimension of curve, fred.Curve.complexity: number of points of curve

Curves

  • signature: fred.Curves()
  • methods: fred.Curves.add(curve): add curve, fred.Curves[i]: get ith curve, len(fred.Curves): number curves, fred.Curves.simplify(l): return set of simplified curves
  • properties: fred.Curves.m: maximum complexity of the contained curves

continous Fréchet distance

  • signature: fred.continuous_frechet(curve1, curve2)
  • returns: fred.Continuous_Frechet_Result with members value, time_bounds: running-time for upper and lower bound, number_searches: number of free space diagrams built, time_searches: running-time for free spaces
continuous Frechet distance config
  • approximation error: fred.set_continuous_frechet_epsilon(epsilon) with parameter epsilon, which defaults to 0.001
  • rounding (rounding up to 3 decimals): fred.set_continuous_frechet_rounding(round) with parameter round, which defaults to true

discrete Fréchet distance

  • signature: fred.discrete_frechet(curve1, curve2)
  • returns: fred.Discrete_Frechet_Result with members value and time

Clustering

Distance_Matrix

A fred.Distance_Matrix() can be used to speed up consecutive calls of fred.discrete_klcenter and fred.discrete_klmedian. As the name suggests, it stores the distances already computed.

discrete (k,l)-center clustering (continuous Fréchet) -- multiple calls

  • from Approximating (k,l)-center clustering for curves
  • signature: fred.discrete_klcenter_multi(k, l, curves, distances, center_domain, random_first_center) with parameters
    • k: number of centers
    • l: maximum complexity of the centers, only used when center_domain is default value
    • distances: fred.Distance_Matrix
    • center_domain: possible centers, defaults to empty fred.Curves(), in this case the input is simplified and used as center domain
    • random_first_center: determines if first center is chosen uniformly at random or first curve is used as first center, optional, defaults to true
  • returns: fred.Clustering_Result with mebers
    • value: objective value
    • time: running-time
    • assignment: empty if compute_assignment has not been called

discrete (k,l)-median clustering (continuous Fréchet) -- multiple calls

  • Algorithm 6 in Coresets for (k,l)-Clustering under the Fréchet distance + simplification
  • signature: fred.discrete_klmedian_multi(k, l, curves, distances, center_domain) with parameters
    • k: number of centers
    • l: maximum complexity of the centers, only used when center_domain is default value
    • distances: fred.Distance_Matrix
    • center_domain: possible centers, optional parameter, if not given the input is simplified and used as center domain
  • returns: fred.Clustering_Result with mebers
    • value: objective value
    • time: running-time
    • assignment: empty if compute_assignment has not been called

discrete (k,l)-center clustering (continuous Fréchet) -- oneshot

  • from Approximating (k,l)-center clustering for curves
  • signature: fred.discrete_klcenter(k, l, curves, center_domain, random_first_center) with parameters
    • k: number of centers
    • l: maximum complexity of the centers, only used when center_domain is default value
    • center_domain: possible centers, optional parameter, if not given the input is simplified and used as center domain
    • random_first_center: determines if first center is chosen uniformly at random or first curve is used as first center, optional, defaults to true
  • returns: fred.Clustering_Result with mebers
    • value: objective value
    • time: running-time
    • assignment: empty if compute_assignment has not been called

discrete (k,l)-median clustering (continuous Fréchet) -- oneshot

  • Algorithm 6 in Coresets for (k,l)-Clustering under the Fréchet distance + simplification
  • signature: fred.discrete_klmedian(k, l, curves, center_domain) with parameters
    • k: number of centers
    • l: maximum complexity of the centers, only used when center_domain is default value
    • center_domain: possible centers, optional parameter, if not given the input is simplified and used as center domain
  • returns: fred.Clustering_Result with mebers
    • value: objective value
    • time: running-time
    • assignment: empty if compute_assignment has not been called

Clustering Result

  • signature: fred.Clustering_Result
  • methods: len(fred.Clustering_Result): number of centers, fred.Clustering_Result[i]: get ith center, fred.Clustering_Result.compute_assignment(fred.Curves): assigns every curve to its nearest center
  • members: value: objective value, time: running-time, assignment: empty if compute_assignment was not called

Cluster Assignment

  • signature: fred.Cluster_Assignment
  • methods: len(fred.Cluster_Assignment): number of centers, fred.Cluster_Assignment.count(i): number of curves assigned to center i, fred.Cluster_Assignment.get(i,j): get index of jth curve assigned to center i

Dimension Reduction via Gaussian Random Projection

Installation

Get requirements under Ubuntu: make pre

Python3 installation into userdir: make install

If something does not work with Boost

Manual installation of Boost

  • mkdir $HOME/boost (This folder is hardcoded in setup.py, another location won't work.)
  • cd /tmp
  • wget https://dl.bintray.com/boostorg/release/1.73.0/source/boost_1_73_0.tar.gz
  • tar -xzf boost_1_73_0.tar.gz
  • cd boost_1_73_0
  • ./bootstrap.sh --with-python=/usr/bin/python3
  • ./b2 install --prefix=$HOME/boost

After that, go back to Freds folder and run make clean and then make install

Test

Just run python py/test.py.

Mini Example

import Fred.backend as fred
import Fred
import numpy as np
import pandas as pd

curve1d = fred.Curve(np.array([1., 2.])) # Curve stores a polygonal curve with 
                                         # at least two points of at least one 
                                         # and equal number of dimensions

curve2d1 = fred.Curve(np.array([[1., 0.], [2., 1.], [3., 0.]])) # any number of dimensions and points works
curve2d2 = fred.Curve(np.array([[1., -1.], [2., -2.], [3., -1.]]), "optional name, e.g. displayed in plot") 

print(curve2d1)

Fred.plot_curve(curve2d1, curve2d2)
Fred.plot_curve(curve2d2, fred.weak_minimum_error_simplification(curve2d2, 2))

print("distance is {}".format(fred.continuous_frechet(curve2d1, curve2d2).value))

print("download HUGE curves") 

import requests, zipfile, io             # you can use all libraries 
                                         # that work with numpy to read data into fred
                                         
re = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00447/data.zip", stream=True)
zf = zipfile.ZipFile(io.BytesIO(re.content))

ps1 = fred.Curve(pd.read_csv(zf.open('PS1.txt'), delimiter="\t", header=None).values[:50], "PS1")
ps2 = fred.Curve(pd.read_csv(zf.open('PS2.txt'), delimiter="\t", header=None).values[:50], "PS2")
ps3 = fred.Curve(pd.read_csv(zf.open('PS3.txt'), delimiter="\t", header=None).values[:50], "PS3")
ps4 = fred.Curve(pd.read_csv(zf.open('PS4.txt'), delimiter="\t", header=None).values[:50], "PS4")
ps5 = fred.Curve(pd.read_csv(zf.open('PS5.txt'), delimiter="\t", header=None).values[:50], "PS5")
ps6 = fred.Curve(pd.read_csv(zf.open('PS6.txt'), delimiter="\t", header=None).values[:50], "PS6")

curves = fred.Curves() # for clustering or if you want to apply dimension reduction
                       # you need to encapsulate your curves in a Curves object
              
curves.add(ps1)
curves.add(ps2)
curves.add(ps3)
curves.add(ps4)
curves.add(ps5)
curves.add(ps6)

Fred.plot_curve(curves)

curves = fred.dimension_reduction(curves, 0.95) # fred is pretty fast but with high dimensional data
                                                # a dimension reduction massively improves running-time
                                                # even for smaller values of epsilon
                                                
Fred.plot_curve(curves)
                                  
# Oneshot clustering - if you already know the value of k
                                  
clustering = fred.discrete_klcenter(2, 10, curves) # fast but coarse
          
clustering = fred.discrete_klmedian(2, 10, curves) # slow but better results

print("clustering cost is {}".format(clustering.value))

for i, center in enumerate(clustering):
    print("center {} is {}".format(i, center))
    
    
Fred.plot_curve(clustering)

# Multiple clustering calls - if you need to find a suitable value for k

dm = fred.Distance_Matrix() # computing the Fréchet distance is costly,
                            # therefore we buffer each distance already computed to
                            # speed up consecutive clustering calls
                            
for k in range(2, 6):
    
    clustering = fred.discrete_klcenter_multi(k, 10, curves, dm)
    print("clustering cost is {}".format(clustering.value))
            
    clustering = fred.discrete_klmedian_multi(k, 10, curves, dm)
    print("clustering cost is {}".format(clustering.value))
    
clustering.compute_assignment(curves)

for i in range(0, len(clustering)):
    for j in range(0, clustering.assignment.count(i)):
        print("{} was assigned to center {}".format(curves[clustering.assignment.get(i,j)].name, clustering[i].name))