Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A number of suggested fixes #1

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Preamble

This Simple Public License 2.0 (SimPL-2.0 for short) is a plain language implementation of GPL 2.0. The words are different, but the goal is the same - to guarantee for all users the freedom to share and change software. If anyone wonders about the meaning of the SimPL, they should interpret it as consistent with GPL 2.0. Original text available on the web: http://opensource.org/licenses/SimPL-2.0.

Simple Public License (SimPL) 2.0

The SimPL applies to the software's source and object code and comes with any rights that I have in it (other than trademarks). You agree to the SimPL by copying, distributing, or making a derivative work of the software.

You get the royalty free right to:

Use the software for any purpose;
Make derivative works of it (this is called a "Derived Work");
Copy and distribute it and any Derived Work.

If you distribute the software or a Derived Work, you must give back to the community by:

Prominently noting the date of any changes you make;
Leaving other people's copyright notices, warranty disclaimers, and license terms in place;
Providing the source code, build scripts, installation scripts, and interface definitions in a form that is easy to get and best to modify;
Licensing it to everyone under SimPL, or substantially similar terms (such as GPL 2.0), without adding further restrictions to the rights provided;
Conspicuously announcing that it is available under that license.

There are some things that you must shoulder:

You get NO WARRANTIES. None of any kind;
If the software damages you in any way, you may only recover direct damages up to the amount you paid for it (that is zero if you did not pay anything). You may not recover any other damages, including those called "consequential damages." (The state or country where you live may not allow you to limit your liability in this way, so this may not apply to you);

The SimPL continues perpetually, except that your license rights end automatically if:

You do not abide by the "give back to the community" terms (your licensees get to keep their rights if they abide);
Anyone prevents you from distributing the software under the terms of the SimPL.

If you have questions, please contact [email protected] regarding the license or use of this for industrial purposes.
21 changes: 13 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,14 @@ This package is written in C++ and Python. We require at least g++ version 5 and

3. Prerequisites

Software:
+ C++ compiler
+ Python 2.7

The following packages are needed in Python for the code to run:

```
C++, Python 2, ngram, sklearn, numpy, scipy, matlib
ngram, sklearn, numpy, scipy, matlib
```

Remark: In order to install using pip, one will need to run the following commands if errors arise from the terminal due to recent changes with SSH in pip (Linux and MacOS)
Expand All @@ -38,8 +42,8 @@ pip2 install numpy scipy matplotlib

```
cd C++Codes
g++ -std=c++11 *.cpp -fopenmp (on Windows and Linux)
g++ *.cpp -fopenmp (on MacOS)
g++ -o minhash -std=c++11 *.cpp -fopenmp (on Windows and Linux)
g++ -o minhash *.cpp -fopenmp (on MacOS)
```

Remark: For mac users, the g++ version needs to be 5 or higher.
Expand All @@ -65,7 +69,7 @@ Use the C++ Package folder in this repository. This is a fast minhash package wh

1. Update the Config file for minhash and run the program (Remember to change the outputfile name option to Restaurant_pair.csv or the particular name of your data set.) The second and third arguments are K and L respectively.
```
./a.out Config.txt 1 10
./C++Codes/minhash config_restaurant.txt 1 10
```

The output is `Restaurant_pair.csv` where the output is candidate record pairs:
Expand All @@ -81,7 +85,7 @@ Rec1 Rec2
where there are many customizable options.

```
Python pipeline.py --input Restaurant_pair.csv --goldstan data/Restaurant.csv --output any_custom_file_name
python pipeline.py --input Restaurant_pair.csv --goldstan data/Restaurant.csv --output any_custom_file_name
```


Expand All @@ -105,7 +109,7 @@ ID RR (reduction ratio) LSHE

LSHE is the proposed estimator. RR is the reduction ratio of the number of sampled pairs used in the estimation out of total possible pairs.

#### fasthash Script
#### Unique Entity Estimation Script

For better usabiity, an example script `run_script.sh` produces the estimation of our LSHE estimates very similar to our paper as well as our LSHE plots. This script will run all four data sets, assuming the user has access to the two public data sets and two private data sets. To run the script, simply change into the main directory and them run

Expand All @@ -128,5 +132,6 @@ Year = {2018},
Journal = {Annals of Applied Statistics, To Appear}}
```

#### Awknowledgements
We would like to thank the Human Rights Data Analysis Group (HRDAG) for providing the data that has movitated this work. Specifically, we thank Megan Price and Patrick Ball for stimulating conversations and feedback that would have not made this work possible. This work would also have not been possible without the support and encouragement of Steve Fienberg and Lars Vilhuber.
### Acknowledgements

We would like to thank the Human Rights Data Analysis Group (HRDAG) for providing the data that has movitated this work. Specifically, we thank Megan Price and Patrick Ball for stimulating conversations and feedback that would have not made this work possible. This work would also have not been possible without the support and encouragement of Steve Fienberg and Lars Vilhuber.
4 changes: 2 additions & 2 deletions config_restaurant.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ Thresh=3
#Give the input CSV file. First line will be ignored (assumed to be header). Every line will be treated as a #record.
#The line number of record will be its ID. That is the fist line after header is treated as record with ID 1 etc.

Input=data/restaurant.csv
Input=data/Restaurant.csv
#Output File: this will contain a pair of record IDs in each line indicating a possible match.

Output=restaurant_pair.csv
Output=Restaurant_pair.csv
##############################################################################
#These are advanced parameters depending on memory
##############################################################################
Expand Down
20 changes: 10 additions & 10 deletions run_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,30 @@

#!/bin/bash

g++-7 -std=c++11 C++Codes/*.cpp -o output -fopenmp
g++-7 -std=c++11 C++Codes/*.cpp -o minhash -fopenmp

For Restaurant
for ((i=6;i<=25;i+=6)) ;
do for ((j=1;j<=10; j++));
do ./output config_restaurant.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.3 --input restaurant_pair.csv --goldstan data/restaurant.csv --output log-restaurant ;
do ./minhash config_restaurant.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.3 --input restaurant_pair.csv --goldstan data/restaurant.csv --output log-restaurant ;
done
done

python plot.py --input log-restaurant --gt 753

#For CD
for ((i=6;i<=20;i+=4)) ;
do for ((j=1;j<=3; j++));
do ./output config_cd.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.5 --input cd_pair.csv --goldstan data/cd.csv --delimiter ';' --output log-cd ;
done
done
# for ((i=6;i<=20;i+=4)) ;
# do for ((j=1;j<=3; j++));
# do ./minhash config_cd.txt 1 $i; python pipeline.py --flag 0 --id $i --trainsize 0.5 --input cd_pair.csv --goldstan data/cd.csv --delimiter ';' --output log-cd ;
# done
# done

python plot.py --input log-cd --gt 9508
# python plot.py --input log-cd --gt 9508

#For Voter
# for ((i=25;i<=40;i+=5)) ;
# do for ((j=1;j<=10; j++));
# do ./output config_voter.txt 4 $i; python pipeline.py --flag 0 --id $i --trainsize 0.1 --input voter_pair.csv --goldstan data/voter.csv --delimiter ',' --c 0.0001 --output log-voter ;
# do ./minhash config_voter.txt 4 $i; python pipeline.py --flag 0 --id $i --trainsize 0.1 --input voter_pair.csv --goldstan data/voter.csv --delimiter ',' --c 0.0001 --output log-voter ;
# done
# done

Expand All @@ -36,7 +36,7 @@ g++-7 -std=c++11 C++Codes/*.cpp -o output -fopenmp
# python preprocess.py

#for ((i=1;i<=10;i++)) ;
# do ./output config_syria.txt 15 10; python pipeline_for_syria.py --input syria_pair.csv --output log-syria --rawdata data/syria.csv --goldstandpair data/syria_train.csv;
# do ./minhash config_syria.txt 15 10; python pipeline_for_syria.py --input syria_pair.csv --output log-syria --rawdata data/syria.csv --goldstandpair data/syria_train.csv;
#done

#python count.py --input log-syria
18 changes: 18 additions & 0 deletions setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Setup script
# Assumes presence of Anaconda

# Create an environment
conda create --name LSH python=2.7
source activate LSH

# Install packages from Anaconda
conda install numpy
conda install scipy

# Install packages using pip
pip install --pre subprocess32
pip install ngram
pip install sklearn
pip install matlib

# this fails due to dependency failure: matlib.h