LearnedSketch

This repository contains the code for the paper Learning-Based Frequency Estimation Algorithms in ICLR 2019.

If you find the code useful, please cite our paper!

@inproceedings{hsu2019learning,
  title={Learning-Based Frequency Estimation Algorithms.},
  author={Hsu, Chen-Yu and Indyk, Piotr and Katabi, Dina and Vakilian, Ali},
  booktitle={International Conference on Learning Representations},
  year={2019}
}

Environment

The code is developed under the following configurations:

Ubuntu 14.04, Python 3.4
Tensorflow-GPU 1.4.0, CUDA 8.0.61, cuDNN 6

To install required python packages, run pip install -r requirement.txt.

For setting up tensorflow with GPU support (CUDA, cuDNN): https://www.tensorflow.org/install

Datasets

Internet traffic dataset

Dataset website: http://www.caida.org/data/passive/passive_dataset.xml

We preprocessed the number of packets and features for each unique internet flow in each minute (equinix-chicago.dirA.20160121-${minute}00.ports.npy):

>>> import numpy as np
>>> data = np.load('equinix-chicago.dirA.20160121-140000.ports.npy').item()
>>> data.keys()
dict_keys(['y', 'x', 'note'])
>>> data['note']
'./data/2016/20160121-130000.UTC/equinix-chicago.dirA.20160121-140000'
>>> data['x'].shape         # 1142626 unique flows in this minute
(1142626, 11)
>>> data['x'][0][:8]        # source ip (first 4) and destination ip (last 4)
array([ 198.,  115.,   14.,  163.,    1.,   91.,  194.,    1.])
>>> data['x'][0][8:10]      # source port and destination port
array([     6.,  35059.])
>>> data['x'][0][-1]        # protocol type
22.0
>>> data['y'][0]            # number of packets
153733

Please request data access on the CAIDA website (https://www.caida.org/data/passive/passive_dataset_request.xml). We can share the preprocessed data once you email us the approval from CAIDA (usually takes 2~3 days).

Search query dataset

Dataset website: https://jeffhuang.com/search_query_logs.html

We preprocessed the number of times a search query appears in each day in to separate files (aol_00${day}_len60.npz):

>>> import numpy as np
>>> data = np.load('aol_0075_len60.npz')
>>> data.files
['query_lens', 'char_list', 'counts', 'queries', 'query_char_ids']
>>>
>>> data['queries'][0]          # the search query
'google'
>>> data['counts'][0]           # the number of times it appears
3158
>>> data['char_list']           # charector vocabulary
array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
       '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ' ', '.', '-',
       "'", '&', ';', 'nv'],
      dtype='<U2')
>>> data['query_char_ids'][0]   # the search query after character encoding (char -> int)
                                # queries longer than 60 characters are truncated.
array([  7.,  15.,  15.,   7.,  12.,   5.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.])
>>> data['query_lens'][0]       # length of the query
6

The preprocessed data can be downloaded here: https://drive.google.com/open?id=1jWot63TJY8WHk5_bQPgIz-EPV0a6BDVX (~37GB after unpacking)

For evaluation (see "Using the model's predictions" below), download the (query, counts) pairs here: https://drive.google.com/open?id=1RclH6mFvbGKm5aB4MQT9HVgQO94GLNmd (~30 GB after unpacking)

Model training and inference

Example commands are in run.sh.

run_ip_model.py and run_aol_model.py include the model definitions, training, and inference code for internet traffic and search query datasets respectively. The models are defined in construct_graph(). The code create multiple folders:

./log:          training config, git commit, terminal outputs for each run
./model:        saved model checkpoints (every ```--eval_n_epochs``` epochs)
./predictions   saved inference results (predicted counts for the items) during training
./summary:      tensorboard summary files

For only running model inference, add flag --evaluate and use --resume to provide the path to a model checkpoint. To change the neural network, training, and other parameters, please check the input arguments.

Using the model's predictions

To use the model's predictions in the LearnedSketch algorithm, see example commands in eval.sh.

The script runs each algorithm with different amounts of space and different number of hash functions. We use the results to generate the loss vs space curves in the paper. Below shows the algorithms each python script implements:

count_min_param.py: Count-Min (CM)
cutoff_count_min_para.py: Learned CM, Table Lookup CM, Learned CM with Ideal Oracle
(use --count_sketch to replace Count-Min with Count-Sketch)

The results and logs will be saved in ./param_results/.

To reproduce the loss vs space curves in the papers using our trained model's predictions:

Download model predictions used in the paper here: https://drive.google.com/file/d/1PlmYUYEWHKJWOOyR1GrBuV3mcbSBdpiV/view
Run eval.sh (see comments for the different algorithms & baselines)
Run loss_vs_space_ip.sh (or loss_vs_space_aol.sh) in the plotting folder to plot the figures

Alternatively, you can also download the saved param_results here: https://drive.google.com/open?id=1n2jDVhvKPwtFevyej42hGRieOeqIYoSJ and run loss_vs_space_ip.sh (or loss_vs_space_aol.sh) directly.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
plotting		plotting
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
count_min_param.py		count_min_param.py
cutoff_count_min_param.py		cutoff_count_min_param.py
eval.sh		eval.sh
requirements.txt		requirements.txt
run.sh		run.sh
run_aol_model.py		run_aol_model.py
run_ip_model.py		run_ip_model.py
sketch.py		sketch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LearnedSketch

Table of contents

Environment

Datasets

Internet traffic dataset

Search query dataset

Model training and inference

Using the model's predictions

About

Releases

Packages

Languages

License

YUZHEN200495/learnedsketch

Folders and files

Latest commit

History

Repository files navigation

LearnedSketch

Table of contents

Environment

Datasets

Internet traffic dataset

Search query dataset

Model training and inference

Using the model's predictions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages