Skip to content

Commit

Permalink
Fixed README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
drkostas committed May 15, 2020
1 parent 06850dc commit 1bc0cea
Show file tree
Hide file tree
Showing 5 changed files with 273 additions and 214 deletions.
237 changes: 148 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Template Python Project
[![CircleCI](https://circleci.com/gh/drkostas/template_python_project/tree/master.svg?style=svg)](https://circleci.com/gh/drkostas/template_python_project/tree/master)
[![GitHub license](https://img.shields.io/badge/license-GNU-blue.svg)](https://raw.githubusercontent.com/drkostas/template_python_project/master/LICENSE)
[![CircleCI](https://circleci.com/gh/drkostas/HGN/tree/master.svg?style=svg)](https://circleci.com/gh/drkostas/HGN/tree/master)
[![GitHub license](https://img.shields.io/badge/license-GNU-blue.svg)](https://raw.githubusercontent.com/drkostas/HGN/master/LICENSE)

## Table of Contents
+ [About](#about)
Expand All @@ -24,9 +24,18 @@
+ [Acknowledgments](#acknowledgments)

## About <a name = "about"></a>
This is a template repository for python projects.

<i>This README serves as a template too. Feel free to modify it until it describes your project.</i>
Code for the paper "[A Distributed Hybrid Community Detection Methodology for Social Networks.](https://www.mdpi.com/1999-4893/12/8/175)"
<br><br>
The proposed methodology is an iterative, divisive community detection process that combines the network topology features
of loose similarity and local edge betweenness measure, along with the user content information in order to remove the
inter-connection edges and thus unravel the subjacent community structure. Even if this iterative process might sound
computationally over-demanding, its application is certainly not prohibitive, since it can be safely concluded
from the experimentation results that the aforementioned measures are that well-informative and highly representative,
so merely few iterations are required to converge to the final community hierarchy at any case.
<br><br>
Implementation last tested with [Python 3.6](https://www.python.org/downloads/release/python-36),
[Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/)
and [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0)

## Getting Started <a name = "getting_started"></a>

Expand All @@ -36,7 +45,8 @@ and testing purposes. See deployment for notes on how to deploy the project on a

### Prerequisites <a name = "prerequisites"></a>

You need to have a machine with Python > 3.6 and any Bash based shell (e.g. zsh) installed.
You need to have a machine with Python = 3.6, Apache Spark = 2.4.5, GraphFrames = 0.8.0
and any Bash based shell (e.g. zsh) installed. For Apache Spark = 2.4.5 you will also need Java 8.


```
Expand All @@ -47,31 +57,34 @@ echo $SHELL
/usr/bin/zsh
```

You will also need to setup the following:
- Gmail: An application-specific password for your Google account.
[Reference 1](https://support.google.com/mail/?p=InvalidSecondFactor),
[Reference 2](https://security.google.com/settings/security/apppasswords)
- Dropbox: An Api key for your Dropbox account.
[Reference 1](http://99rabbits.com/get-dropbox-access-token/),
[Reference 2](https://dropbox.tech/developers/generate-an-access-token-for-your-own-account)
- MySql: If you haven't any, you can create a free one on Amazon RDS.
[Reference 1](https://aws.amazon.com/rds/free/),
[Reference 2](https://bigdataenthusiast.wordpress.com/2016/03/05/aws-rds-instance-setup-oracle-db-on-cloud-free-tier/)


### Set the required environment variables <a name = "env_variables"></a>

In order to run the [main.py](main.py) or the tests you will need to set the following
environmental variables in your system:
environmental variables in your system (or in the [spark.env file](spark.env)):

```bash
$ export DROPBOX_API_KEY=<VALUE>
$ export MYSQL_HOST=<VALUE>
$ export MYSQL_USERNAME=<VALUE>
$ export MYSQL_PASSWORD=<VALUE>
$ export MYSQL_DB_NAME=<VALUE>
$ export EMAIL_ADDRESS=<VALUE>
$ export GMAIL_API_KEY=<VALUE>
$ export SPARK_HOME="<Path to Spark Home>"
$ export PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
$ export JAVA_HOME="<Path to Java 8>"

$ cd $SPARK_HOME

/usr/local/spark
$ ./bin/pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
```
## Installing, Testing, Building <a name = "installing"></a>
Expand Down Expand Up @@ -123,24 +136,24 @@ make help
```bash
$ make clean server=local
make delete_venv
make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Deleting venv..
rm -rf venv
make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
make[1]: Leaving directory '/home/drkostas/Projects/HGN'
make clean_pyc
make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Cleaning pyc files..
find . -name '*.pyc' -delete
find . -name '*.pyo' -delete
find . -name '*~' -delete
make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
make[1]: Leaving directory '/home/drkostas/Projects/HGN'
make clean_build
make[1]: Entering directory '/home/drkostas/Projects/template_python_project'
make[1]: Entering directory '/home/drkostas/Projects/HGN'
Cleaning build directories..
rm --force --recursive build/
rm --force --recursive dist/
rm --force --recursive *.egg-info
make[1]: Leaving directory '/home/drkostas/Projects/template_python_project'
make[1]: Leaving directory '/home/drkostas/Projects/HGN'

```

Expand Down Expand Up @@ -183,35 +196,91 @@ running install

## Running the code locally <a name = "run_locally"></a>

In order to run the code now, you will only need to change the yml file if you need to
and run either the main or the created console script.
In order to run the code now, you should place under the [data/input_graphs](data/input_graphs) the graph you
want the communities to be identified from.<br>
You will also only need to create a yml file for any new graph before executing the [main.py](main.py).

### Modifying the Configuration <a name = "configuration"></a>

There is an already configured yml file under [confs/template_conf.yml](confs/template_conf.yml) with the following structure:
There two already configured yml files: [confs/quakers.yml](confs/quakers.yml)
and [confs/hamsterster.yml](confs/hamsterster.yml) with the following structure:

```yaml
tag: production
cloudstore:
config:
api_key: !ENV ${DROPBOX_API_KEY}
type: dropbox
datastore:
config:
hostname: !ENV ${MYSQL_HOST}
username: !ENV ${MYSQL_USERNAME}
password: !ENV ${MYSQL_PASSWORD}
db_name: !ENV ${MYSQL_DB_NAME}
port: 3306
type: mysql
email_app:
config:
email_address: !ENV ${EMAIL_ADDRESS}
api_key: !ENV ${GMAIL_API_KEY}
type: gmail
tag: dev # Required
spark:
- config: # The spark settings
spark.master: local[*] # Required
spark.submit.deployMode: client # Required
spark_warehouse_folder: data/spark-warehouse # Required
spark.ui.port: 4040
spark.driver.cores: 5
spark.driver.memory: 8g
spark.driver.memoryOverhead: 4096
spark.driver.maxResultSize: 0
spark.executor.instances: 2
spark.executor.cores: 3
spark.executor.memory: 4g
spark.executor.memoryOverhead: 4096
spark.sql.broadcastTimeout: 3600
spark.sql.autoBroadcastJoinThreshold: -1
spark.sql.shuffle.partitions: 4
spark.default.parallelism: 4
spark.network.timeout: 3600s
dirs:
df_data_folder: data/dataframes # Folder to store the DataFrames as parquets
spark_warehouse_folder: data/spark-warehouse
checkpoints_folder: data/checkpoints
communities_csv_folder: data/csv_data # Folder to save the computed communities as csvs
input:
- config: # All properties required
name: Quakers
nodes:
path: data/input_graphs/Quakers/quakers_nodelist.csv2 # Path to the nodes file
has_header: true # Whether they have a header with the attribute names
delimiter: ','
encoding: ISO-8859-1
feature_names: # You can rename the attribute names (the number should be the same as the original)
- id
- Historical_Significance
- Gender
- Birthdate
- Deathdate
- internal_id
edges:
path: data/input_graphs/Quakers/quakers_edgelist.csv2 # Path to the edges file
has_header: true # Whether they have a header with the source and dest
has_weights: false # Whether they have a weight column
delimiter: ','
type: local
run_options: # All properties required
- config:
cached_init_step: false # Whether the cosine similarities and edge_betweenness been already been computed
# See the paper for info regarding the following attributes
feature_min_avg: 0.33
r_lvl1_thres: 0.50
r_lvl2_thres: 0.85
max_edge_weight: 0.50
betweenness_thres: 10
max_sp_length: 2
min_comp_size: 2
max_steps: 30 # Max steps for the algorithm to run if it doesn't converge
features_to_check: # Which attributes to take into consideration for the cosine similarities
- id
- Gender
output: # All properties required
- config:
logs_folder: data/logs
save_communities_to_csvs: false # Whether to save the computed communities in csvs or not
visualizer:
dimensions: 3 # Dimensions of the scatter plot (2 or 3)
save_img: true
folder: data/plots
steps: # The steps to plot
- 0 # The step before entering the main loop
- -1 # The Last step
```
The `!ENV` flag indicates that a environmental value follows.
The `!ENV` flag indicates that a environmental value follows. For example you can set: <br>`logs_folder: !ENV ${LOGS_FOLDER}`<br>
You can change the values/environmental var names as you wish.
If a yaml variable name is changed/added/deleted, the corresponding changes should be reflected
on the [Configuration class](configuration/configuration.py) and the [yml_schema.json](configuration/yml_schema.json) too.
Expand All @@ -223,59 +292,55 @@ First, make sure you are in the created virtual environment:
```bash
$ source venv/bin/activate
(venv)
OneDrive/Projects/template_python_project dev
OneDrive/Projects/HGN dev
$ which python
/home/drkostas/Projects/template_python_project/venv/bin/python
/home/drkostas/Projects/HGN/venv/bin/python
(venv)
```

Now, in order to run the code you can either call the `main.py` directly, or the `template_python_project` console script.
Now, in order to run the code you can either call the `main.py` directly, or the `HGN` console script.

```bash
$ python main.py --help
usage: main.py -m {run_mode_1,run_mode_2,run_mode_3} -c CONFIG_FILE [-l LOG]
[-d] [-h]
$ python main.py -h
usage: main.py -c CONFIG_FILE [-d] [-h]
A template for python projects.
A Distributed Hybrid Community Detection Methodology for Social Networks.
required arguments:
-m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3}
Description of the run modes
Required Arguments:
-c CONFIG_FILE, --config-file CONFIG_FILE
The configuration yml file
-l LOG, --log LOG Name of the output log file
optional arguments:
-d, --debug enables the debug log messages
Optional Arguments:
-d, --debug Enables the debug log messages
-h, --help Show this help message and exit
# Or
$ template_python_project --help
usage: template_python_project -m {run_mode_1,run_mode_2,run_mode_3} -c
CONFIG_FILE [-l LOG] [-d] [-h]
$ hgn --help
usage: hgn -c CONFIG_FILE [-d] [-h]
A template for python projects.
A Distributed Hybrid Community Detection Methodology for Social Networks.
required arguments:
-m {run_mode_1,run_mode_2,run_mode_3}, --run-mode {run_mode_1,run_mode_2,run_mode_3}
Description of the run modes
Required Arguments:
-c CONFIG_FILE, --config-file CONFIG_FILE
The configuration yml file
-l LOG, --log LOG Name of the output log file
optional arguments:
-d, --debug enables the debug log messages
Optional Arguments:
-d, --debug Enables the debug log messages
-h, --help Show this help message and exit
```

## Deployment <a name = "deployment"></a>

The deployment is being done to <b>Heroku</b>. For more information
you can check the [setup guide](https://devcenter.heroku.com/articles/getting-started-with-python).
It is recommended that you deploy the application to a Spark Cluster.<br>Please see:
- [Spark Cluster Overview \[Apache Spark Docs\]](https://spark.apache.org/docs/latest/cluster-overview.html)
- [Apache Spark on Multi Node Cluster \[Medium\]](https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b)
- [Databricks Cluster](https://docs.databricks.com/clusters/index.html)
- [Flintrock \[Cheap & Easy EC2 Cluster\]](https://github.com/nchammas/flintrock)

Make sure you check the defined [Procfile](Procfile) ([reference](https://devcenter.heroku.com/articles/getting-started-with-python#define-a-procfile))
and that you set the [above-mentioned environmental variables](#env_variables) ([reference](https://devcenter.heroku.com/articles/config-vars)).

## Continuous Integration <a name = "ci"></a>

Expand All @@ -291,17 +356,11 @@ Read the [TODO](TODO.md) to see the current task list.

## Built With <a name = "built_with"></a>

* [Dropbox Python API](https://www.dropbox.com/developers/documentation/python) - Used for the Cloudstore Class
* [Gmail Sender](https://github.com/paulc/gmail-sender) - Used for the EmailApp Class
* [Heroku](https://www.heroku.com) - The deployment environment
* [Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/) - Fast and general-purpose cluster computing system
* [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0) - A package for Apache Spark which provides DataFrame-based Graphs.
* [CircleCI](https://www.circleci.com/) - Continuous Integration service


## License <a name = "license"></a>

This project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments <a name = "acknowledgments"></a>

* Thanks το PurpleBooth for the [README template](https://gist.github.com/PurpleBooth/109311bb0361f32d87a2)

2 changes: 1 addition & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@
- [X] Add ColorLogging
- [X] Update requirements
- [X] Check that current unit tests are working
- [ ] Fix README
- [X] Fix README
- [ ] Create unit tests for the HGN code
- [ ] Add Metadata Class that stores step times into a database
2 changes: 1 addition & 1 deletion main.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def _argparser() -> argparse.Namespace:
"""Setup the argument parser."""

parser = argparse.ArgumentParser(
description='A template for python projects.',
description='A Distributed Hybrid Community Detection Methodology for Social Networks.',
add_help=False)
# Required Args
required_arguments = parser.add_argument_group('Required Arguments')
Expand Down
6 changes: 3 additions & 3 deletions spark.env
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SPARK_HOME=
PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
JAVA_HOME=
export SPARK_HOME=
export PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
export JAVA_HOME=
Loading

0 comments on commit 1bc0cea

Please sign in to comment.