Data, Data-Mining and Visualization for the RESTify experiment.
This repository hosts sources and raw input data that allows replication of empiric findings around the RESTify experiment. The data can be reproduced and inspected with a Jupyter Notebook instance, or for more experienced users and collaborators with a preconfigured PyCharm project.
To replicate our data analysis, you have four options:
- Inspect the rendered preview on GitHub, using only your browser
=> You will see all figures of this paper, pre-rendered. However, you will not be able to modify or execute the notebook. Some internal links may not work. - Deploy a local Jupyter Notebook as preconfigured Docker container.
=> The fastest and simplest way to replicate our work and findings. - Manually set up a local Jupyter Notebook.
=> Similar to the previous option, you can replicate the work and findings. The manual setup requires proficiency with python installations. - Manually run individual parts of the data analysis with the PyCharm IDE:
=> Full access to all implementation details. The preferred option for peer-reviewers, software developers and data scientists who want to investigate and understand our work. You can replicate our findings, and on top inspect the implementation. You can debug the code, verify correctness of our implementation, and if desired build on top.
This repository hosts a Docker configuration that creates a container Jupyter Notebook instance with all runtime
dependencies.
The notebook allows you to locally replicate our methodology and all findings, together with in-depth explanations.
Instructions for Docker (MacOS / Linux host):
- Install Docker
(After install, test your setup with:docker run hello-world
) - Clone this repository:
git clone https://github.com/m5c/RestifyJupyter.git
- Build and Run the Jupyter Notebook Container:
cd RestifyJupyter; ./docker-autostart.sh
(On linux, you may need to prefix thedocker
command withsudo
) - Access the Notebook: http://127.0.0.1:8889/notebooks/Restify.ipynb
If you see a notebook with all paper figures and stats, you have succeeded.
This section explains how to run the Jupyter Notebook instance natively. For this to work, you must install all runtime dependencies. The below steps will install the dependencies in a virtual environment, so your system-wide python installation stays clutter-free.
- Install
Python 3.9
or newer. Make sure the newly installed python version is set as default. Verify with:python --version
- Go into the project and create a new virtual environment (local python folder with all dependencies):
cd RestifyJupyter
python3 -m venv .env
source .env/bin/activate
- Install all required python libraries, using the
pip3
package manager:
pip3 install pandas numpy matplotlib plotly scipy statsmodels seaborn jupyter
You can also install all at once, withpip3 install -r requirements.txt
- Start up the Notebook:
- Start the Notebook:
jupyter notebook
- Access the Notebook: http://localhost:8888/notebooks/Restify.ipynb
- Start the Notebook:
Complementary to the replication of our results with a Jupyter Notebook, you can also directly execute the python code used for data mining. This option provides an in depth access to implementation details and is intended for data scientist who want to either:
- Validate the correctness of our extracted data at coding level.
- Enrich our the data analysis we implemented by additional insights.
All runtime dependencies, including python itself, can be directly installed from PyCharm, however it is important that the IDE is configured to use the correct interpreter.
- Install PyCharm. The free Community Edition is sufficient.
- Install the
python3
interpreter. You find a corresponding option in thePyCharm -> Settings
menu:
- Install all required libraries. Open the
PyCharm -> Settings -> Project -> Interpreter
menu:- Click the
+
sign, then install everything listed inrequirements.txt
- Click the
- Install PyLint. Open the plugins menu:
PyCharm -> Settings -> Plugins
:
- Configure PyLint to use the root
.pylintrc
config file, so it correctly resolves imports.
- Configure PyLint to use the root
- Select the desired run configuration, to replicate any of our results:
- For every code cell of the Notebook, there is a corresponding preconfigured run configuration.
- We recommend that you run the
run_all_pseudo_cell.py
script, which recreates all statistical figures and listings from the paper.
- Inputs:
The Notebook works on the CSV data, stored insource-csv-files
. It is the same data as provided in our replication bundle. - Outputs:
- Figures are generated to
generated-plots
- Intermediate CSV files are generated to
generated-csv-files
- Figures are generated to
This section is only relevant for data analysts who want to tweak the notebook output / visualization, or reuse part of the codebase for similar project layouts.
For scatter plots and scatter series you can easily change how samples are annotated. Just pass a different LabelMaker
at the moment of scatter instantiation.
LabelMakers
are defined in restify_mining/scatter_plotters/extractors
.
If you with to annotated only selected dots, edit the labeloverride.csv
and use a custom LabelMaker
.
- To remove all labels, use the
EmptyLabelMaker
. - To annotate full codenames (colour + animal) use the
FullLabelMaker
. - To annotate group internal codenames (only animal), use
AnimalLabelMaker
.
This software is under open source MIT License.
- Principal Investigator: Maximilian Schiedermeier
- Academic Supervisors: Bettina Kemme , Jorg Kienzle
- Implementation: Maximilian Schiedermeier
- Research Ethics Board Advisor: Lynda McNeil