Pectus Finance challenge - Survey data analysis

How it is structured

This challenge was approached as an exploration effort using Jupyter Notebooks due to time constraints, so there is no "main" function/file to start the application, but the following notebooks order is advisable due to the Markdown explanation present on each notebook, as well as the creation of intermediary Parquet files during the exploration:

data_1.ipynb
data_2.ipynb
data_3.ipynb
analysis.ipynb -- This needs to be the last due to the Parquet files dependency.

How to set up the environment

It was used PySpark and SparkSQL to manipulate data similar as a regular database, and because of that some external dependency is needed before actually running the notebooks (besides actually installing Jupyter):

Install Apache Spark, the version used was 3.2.1, downloadable here
Set SPARK_HOME environment variable to the path of the unzipped Spark folder.
[If on Windows] Download the following winutils.exe package, and set the HADOOP_HOME environment variable to its path.

The dependency of this project was managed using Poetry, but it is not a hard requirement to run these notebooks, as the project is only dependent on pyspark==3.2.1 and findspark==2.0.1 packages.

Some design decisions

Like mentioned, the project is structured as a collection of notebooks, but this was just because of the timeframe (see WISHLIST.md for a production-ready checklist for this project)
It was separated on different notebooks due to the extensive need of exploration and analysis of corner cases.
The cleaned data was persisted in Apache Parquet format to improve performance on the final notebook analysis.ipynb
The decisions regarding every transformation, query and assumption is written as Markdown cells in all notebooks. Again, prefer to read them in order to fully understand because similar logics were not repeated on every notebook.

Feedback/Notes

This a really nice challenge, in regards of the difficulty itself and for the possible improvements, such as consuming APIs or using more robust libraries or packages to, for example, analyze text and extract more data.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
salary_data		salary_data
.gitignore		.gitignore
README.md		README.md
WISHLIST.md		WISHLIST.md
analyses.ipynb		analyses.ipynb
data_1.ipynb		data_1.ipynb
data_2.ipynb		data_2.ipynb
data_3.ipynb		data_3.ipynb
instructions.md		instructions.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pectus Finance challenge - Survey data analysis

How it is structured

How to set up the environment

Some design decisions

Feedback/Notes

About

Releases

Packages

Languages

jvaesteves/employee_survey_data_analysis

Folders and files

Latest commit

History

Repository files navigation

Pectus Finance challenge - Survey data analysis

How it is structured

How to set up the environment

Some design decisions

Feedback/Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages