diff --git a/.gitignore b/.gitignore index f30d07c7..cf752a94 100644 --- a/.gitignore +++ b/.gitignore @@ -7,7 +7,7 @@ __pycache__/ .ipynb_checkpoints */.ipynb_checkpoints/* -# 3W toolkit documentation +# 3W Toolkit documentation html/ *.html diff --git a/3W_DATASET_STRUCTURE.md b/3W_DATASET_STRUCTURE.md index f36aac0c..a5dab89a 100644 --- a/3W_DATASET_STRUCTURE.md +++ b/3W_DATASET_STRUCTURE.md @@ -1,6 +1,6 @@ -The 3W dataset consists of multiple CSV files saved in the [dataset](dataset) directory and structured as follows. +The 3W Dataset consists of multiple CSV files saved in the [dataset](dataset) directory and structured as follows. There are two types of subdirectory: -* The [folds](dataset/folds) subdirectory holds all 3W dataset configuration files. For each specific project released in the 3W project there will be a file that will specify how and which data must be loaded for training and testing in multiple folds of experimentation. This scheme allows implementation of cross validation and hyperparameter optimization by the 3W toolkit users. In addition, this scheme allows the user to choose some specific characteristics to the desired experiment. For example: whether or not simulated and/or hand-drawn intances should be considered in the training set. It is important to clarify that specifying which instances make up which folds will always be random but fixed in each configuration file. This is considered necessary so that results obtained for the same problem with different approaches can be compared; -* The other subdirectories holds all 3W dataset data files. The subdirectory names are the instances' labels. Each file represents one instance. The filename reveals its source. All files are standardized as follow. There are one observation per line and one series per column. Columns are separated by commas and decimals are separated by periods. The first column contains timestamps, the last one reveals the observations' labels, and the other columns are the Multivariate Time Series (MTS) (i.e. the instance itself). \ No newline at end of file +* The [folds](dataset/folds) subdirectory holds all 3W Dataset configuration files. For each specific project released in the 3W Project there will be a file that will specify how and which data must be loaded for training and testing in multiple folds of experimentation. This scheme allows implementation of cross validation and hyperparameter optimization by the 3W Toolkit users. In addition, this scheme allows the user to choose some specific characteristics to the desired experiment. For example: whether or not simulated and/or hand-drawn intances should be considered in the training set. It is important to clarify that specifying which instances make up which folds will always be random but fixed in each configuration file. This is considered necessary so that results obtained for the same problem with different approaches can be compared; +* The other subdirectories holds all 3W Dataset data files. The subdirectory names are the instances' labels. Each file represents one instance. The filename reveals its source. All files are standardized as follow. There are one observation per line and one series per column. Columns are separated by commas and decimals are separated by periods. The first column contains timestamps, the last one reveals the observations' labels, and the other columns are the Multivariate Time Series (MTS) (i.e. the instance itself). \ No newline at end of file diff --git a/3W_TOOLKIT_STRUCTURE.md b/3W_TOOLKIT_STRUCTURE.md index dfacf9de..20ece6c1 100644 --- a/3W_TOOLKIT_STRUCTURE.md +++ b/3W_TOOLKIT_STRUCTURE.md @@ -1,4 +1,4 @@ -The 3W toolkit is a software package written in Python 3 structured in the following sub-modules: +The 3W Toolkit is a software package written in Python 3 structured in the following sub-modules: * **base**: groups the objects used by the other sub-modules; * **dev**: has all the resources related to development of Machine diff --git a/BACKLOG.md b/BACKLOG.md index 6c0706ff..290ba5c0 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -1,20 +1,20 @@ -The list of priority improvements for the 3W project that we intend to develop collaboratively with the community is detailed below. +The list of priority improvements for the 3W Project that we intend to develop collaboratively with the community is detailed below. -* Extend the 3W dataset with more instances of new event types; -* Finalize incorporation of MAIS into the 3W toolkit; +* Extend the 3W Dataset with more instances of new event types; +* Finalize incorporation of MAIS into the 3W Toolkit; * Evaluate and if appropriate start using [Git LFS](https://git-lfs.com/); * Configure other GitHub resources that may be useful for our development. What resources exactly? * Incorporate and provide in this repository documentation automatically generated from docstrings. How exactly? * Review strategy for generating `folds_clf_XX.csv`; * Review strategy for virtual environment specification (`environment.yml`); * Develop a `setup.py`. Is this module interesting for our project? -* Develop tool to generate `diff` between versions of the 3W dataset -* Improve presentation of the [3W dataset citation list](LIST_OF_CITATIONS.md); +* Develop tool to generate `diff` between versions of the 3W Dataset +* Improve presentation of the [3W Dataset citation list](LIST_OF_CITATIONS.md); * Develop unit tests for the main methods and functions; * Set up action for automatic execution of unit tests after creating PRs; * Establish coding guidelines. Which one? * Reevaluate the use of the [rolling_window.py](toolkit/rolling_window.py). Is there a better option or a newer version? * Evaluate inclusion of specific features for hyperparameter optimization; * Assess feasibility and benefits of using [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html); -* Evaluate the use of [Docker](https://www.docker.com/) to facilitate the use of the 3W toolkit and the approval of contributions; +* Evaluate the use of [Docker](https://www.docker.com/) to facilitate the use of the 3W Toolkit and the approval of contributions; * Establish one or more time-related metrics for anomaly detection and classification. \ No newline at end of file diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a4002fd6..15b718d7 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -12,9 +12,9 @@ [semver]: https://semver.org [semver-shield]: https://img.shields.io/badge/semver-2.0.0-blue -# Welcome to the 3W project contributing guide +# Welcome to the 3W Project contributing guide -:+1::tada::sparkles: Thank you for investing your time in contributing to the 3W project! :sparkles::tada::+1: +:+1::tada::sparkles: Thank you for investing your time in contributing to the 3W Project! :sparkles::tada::+1: We expect to receive various types of contributions from individuals, research institutions, startups, companies and partner oil operators. @@ -26,8 +26,8 @@ In this guide we present how you can propose each type of contributions that we * [Making questions](#making-questions) * [Before contributing](#before-contributing) * [Levels for contributions](#levels-for-contributions) - * [3W dataset's structure](#3w-datasets-structure) - * [3W toolkit's structure](#3w-toolkits-structure) + * [3W Dataset's structure](#3w-datasets-structure) + * [3W Toolkit's structure](#3w-toolkits-structure) * [Executing examples](#executing-examples) * [Proposing contributions](#proposing-contributions) * [Citation](#citation) @@ -35,14 +35,14 @@ In this guide we present how you can propose each type of contributions that we * [Documentation improvements](#documentation-improvements) * [Cosmetic improvements](#cosmetic-improvements) * [Other improvements](#other-improvements) - * [New 3W dataset's overviews](#new-3w-datasets-overviews) + * [New 3W Dataset's overviews](#new-3w-datasets-overviews) * [New approaches and algorithms](#new-approaches-and-algorithms) * [Additional requirements](#additional-requirements) * [Backlog](#backlog) # Getting started -The recommended first step is to read this [README](README.md) for an overview of the 3W project. +The recommended first step is to read this [README](README.md) for an overview of the 3W Project. # Making questions @@ -60,7 +60,7 @@ It is also very important to know, participate and follow the discussions. Click ## Levels for contributions -We expect to receive contributions at different levels, as shown in the figure below. Objects with background in yellow indicate types of contributions enabled by the 3W project current version. The other objects above the 3W project indicate types of contributions that will be enabled in the next versions. Some examples of contributions at each level are: +We expect to receive contributions at different levels, as shown in the figure below. Objects with background in yellow indicate types of contributions enabled by the 3W Project current version. The other objects above the 3W Project indicate types of contributions that will be enabled in the next versions. Some examples of contributions at each level are: * Level 1: * You can identify and report issues with data or annotations; @@ -81,17 +81,17 @@ We expect to receive contributions at different levels, as shown in the figure b ![Levels for contributions](images/levels_for_contributions.png) -## 3W dataset's structure +## 3W Dataset's structure -At level 1, the 3W dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md). +At level 1, the 3W Dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md). -## 3W toolkit's structure +## 3W Toolkit's structure -At level 2, the 3W toolkit is implemented in sub-modules as discribed [here](3W_TOOLKIT_STRUCTURE.md). +At level 2, the 3W Toolkit is implemented in sub-modules as discribed [here](3W_TOOLKIT_STRUCTURE.md). ## Executing examples -To execute examples of how to use the 3W toolkit available in this repository, see the instructions related to [reproducibility](README.md#reproducibility). +To execute examples of how to use the 3W Toolkit available in this repository, see the instructions related to [reproducibility](README.md#reproducibility). # Proposing contributions @@ -101,7 +101,7 @@ For each type of expected contribution, there is a subsection below with specifi ## Citation -As far as we know, the 3W dataset was useful and cited by the works listed [here](LIST_OF_CITATIONS.md). If you know any other paper, master's degree dissertation or doctoral thesis that cites the 3W dataset, we will be grateful if you let us know by commenting [this](https://github.com/Petrobras/3W/discussions/3) **discussion**. If you use any resource published in this repository, we ask that it be properly cited in your work. Click on the ***Cite this repository*** link on this repository landing page to access different citation formats supported by the GitHub citation feature. +As far as we know, the 3W Dataset was useful and cited by the works listed [here](LIST_OF_CITATIONS.md). If you know any other paper, master's degree dissertation or doctoral thesis that cites the 3W Dataset, we will be grateful if you let us know by commenting [this](https://github.com/Petrobras/3W/discussions/3) **discussion**. If you use any resource published in this repository, we ask that it be properly cited in your work. Click on the ***Cite this repository*** link on this repository landing page to access different citation formats supported by the GitHub citation feature. ## Bugs @@ -115,21 +115,21 @@ It is important to keep in mind that this toolkit's documentation is generated i ## Cosmetic improvements -Changes that are cosmetic in nature and do not add anything substantial to the stability, functionality, or testability of the 3W project are also welcome. In this case, please create a **pull requests** on a branch called `cosmetic_improvements` directly. +Changes that are cosmetic in nature and do not add anything substantial to the stability, functionality, or testability of the 3W Project are also welcome. In this case, please create a **pull requests** on a branch called `cosmetic_improvements` directly. ## Other improvements -If you intend to work and propose a more significant improvement, please consult our [backlog](BACKLOG.md) first. If you have any questions about the most aligned strategy for the 3W project, please consult or create **discussions**. When your improvement is ready, please create a **pull request** on a branch called `other_improvements`. +If you intend to work and propose a more significant improvement, please consult our [backlog](BACKLOG.md) first. If you have any questions about the most aligned strategy for the 3W Project, please consult or create **discussions**. When your improvement is ready, please create a **pull request** on a branch called `other_improvements`. It is important to keep in mind that all source code is implemented according to the style guide established by [PEP 8](https://peps.python.org/pep-0008/). This is guaranteed with the use of the [Black formatter](https://github.com/psf/black) with default options. Therefore, while codes have lines up to 88 characters (Black formatter's default option), each line with docstring or comment must be up to 72 characters long as established in PEP 8. -## New 3W dataset's overviews +## New 3W Dataset's overviews -Visualization is one of the most important steps in this type of project. Therefore, you can propose [Jupyter Notebooks](https://jupyter.org/) with different views. For this, submit a **pull request** on a branch called `new_3w_datasets_overviews` with a file named `overviews\[your_name_here]\main.ipynb` that you've developed. If we like your overview, your file could be listed in this repository as a 3W toolkit's example of use. +Visualization is one of the most important steps in this type of project. Therefore, you can propose [Jupyter Notebooks](https://jupyter.org/) with different views. For this, submit a **pull request** on a branch called `new_3w_datasets_overviews` with a file named `overviews\[your_name_here]\main.ipynb` that you've developed. If we like your overview, your file could be listed in this repository as a 3W Toolkit's example of use. ## New approaches and algorithms -Would you like to share in this repository as 3W toolkit's examples of use approaches and algorithms for already incorporated problems? The procedure for this is to submit a **pull request** on a branch called `new_approaches_and_algorithms` with [Jupyter Notebooks](https://jupyter.org/) that you've developed in the directory corresponding to the chosen problem. +Would you like to share in this repository as 3W Toolkit's examples of use approaches and algorithms for already incorporated problems? The procedure for this is to submit a **pull request** on a branch called `new_approaches_and_algorithms` with [Jupyter Notebooks](https://jupyter.org/) that you've developed in the directory corresponding to the chosen problem. Specific problems will be incorporated into this project gradually. At this point, we can work on: @@ -144,4 +144,4 @@ Here are additional requirements for contributions to be incorporated into this # Backlog -The list of priority improvements for the 3W project that we intend to develop collaboratively with the community is detailed in the file [BACKLOG.md](BACKLOG.md). \ No newline at end of file +The list of priority improvements for the 3W Project that we intend to develop collaboratively with the community is detailed in the file [BACKLOG.md](BACKLOG.md). \ No newline at end of file diff --git a/LIST_OF_CITATIONS.md b/LIST_OF_CITATIONS.md index 1e7f77e9..1db8ccfe 100644 --- a/LIST_OF_CITATIONS.md +++ b/LIST_OF_CITATIONS.md @@ -1,4 +1,4 @@ -As far as we know, the 3W dataset was useful and cited by the works listed below. If you know any other paper, final graduation project, master's degree dissertation or doctoral thesis that cites the 3W dataset, we will be grateful if you let us know by commenting [this](https://github.com/Petrobras/3W/discussions/3) discussion. If you use any resource published in this repository, we ask that it be properly cited in your work. Click on the ***Cite this repository*** link on this repository landing page to access different citation formats supported by the GitHub citation feature. +As far as we know, the 3W Dataset was useful and cited by the works listed below. If you know any other paper, final graduation project, master's degree dissertation or doctoral thesis that cites the 3W Dataset, we will be grateful if you let us know by commenting [this](https://github.com/Petrobras/3W/discussions/3) discussion. If you use any resource published in this repository, we ask that it be properly cited in your work. Click on the ***Cite this repository*** link on this repository landing page to access different citation formats supported by the GitHub citation feature. 1. R.E.V. Vargas, C.J. Munaro, P.M. Ciarelli. A methodology for generating datasets for development of anomaly detectors in oil wells based on Artificial Intelligence techniques. I Congresso Brasileiro em Engenharia de Sistemas em Processos. 2019. https://www.ufrgs.br/psebr/wp-content/uploads/2019/04/Abstract_A019_Vargas.pdf. diff --git a/README.md b/README.md index c76d86a0..57821561 100644 --- a/README.md +++ b/README.md @@ -23,10 +23,10 @@ * [Licenses](#licenses) * [Versioning](#versioning) * [Questions](#questions) -* [3W dataset](#3w-dataset) +* [3W Dataset](#3w-dataset) * [Structure](#structure) * [Overview](#overview) -* [3W toolkit](#3w-toolkit) +* [3W Toolkit](#3w-toolkit) * [Structure](#structure-1) * [Incorporated Problems](#incorporated-problems) * [Examples of Use](#examples-of-use) @@ -34,9 +34,9 @@ # Introduction -This is the first repository published by Petrobras on GitHub. It supports the 3W project, which aims to promote experimentation and development of Machine Learning-based approaches and algorithms for specific problems related to detection and classification of undesirable events that occur in offshore oil wells. +This is the first repository published by Petrobras on GitHub. It supports the 3W Project, which aims to promote experimentation and development of Machine Learning-based approaches and algorithms for specific problems related to detection and classification of undesirable events that occur in offshore oil wells. -The 3W project is based on the 3W dataset, a database described in [this paper](https://doi.org/10.1016/j.petrol.2019.106223), and on the 3W toolkit, a software package that promotes experimentation with the 3W dataset for specific problems. The name **3W** was chosen because this dataset is composed of instances from ***3*** different sources and which contain undesirable events that occur in oil ***W***ells. +The 3W Project is based on the 3W Dataset, a database described in [this paper](https://doi.org/10.1016/j.petrol.2019.106223), and on the 3W Toolkit, a software package that promotes experimentation with the 3W Dataset for specific problems. The name **3W** was chosen because this dataset is composed of instances from ***3*** different sources and which contain undesirable events that occur in oil ***W***ells. ## Motivation @@ -51,23 +51,23 @@ Creating a dataset and making it public to be openly experienced can greatly fom The 3W is the first pilot of a Petrobras' program called [Conexões para Inovação - Módulo Open Lab](https://tecnologia.petrobras.com.br/modulo-open-lab). This pilot is an ***open project*** composed by two major resources: -* The [3W dataset](#3w-dataset), which will be evolved and supplemented with more instances from time to time; -* The [3W toolkit](#3w-toolkit), which will also be evolved (in many ways) to cover an increasing number of undesirable events during its development. +* The [3W Dataset](#3w-dataset), which will be evolved and supplemented with more instances from time to time; +* The [3W Toolkit](#3w-toolkit), which will also be evolved (in many ways) to cover an increasing number of undesirable events during its development. -Therefore, our strategy is to make these resources publicly available so that we can develop the 3W project with a global community collaboratively. +Therefore, our strategy is to make these resources publicly available so that we can develop the 3W Project with a global community collaboratively. ## Ambition With this project, Petrobras intends to develop (fix, improve, supplement, etc.): -* The [3W dataset](#3w-dataset) itself; -* The [3W toolkit](#3w-toolkit) itself; +* The [3W Dataset](#3w-dataset) itself; +* The [3W Toolkit](#3w-toolkit) itself; * Approaches and algorithms that can be incorporated into systems dedicated to monitoring undesirable events in offshore oil wells during their respective drilling, completion and production phases; * Tools that can be useful for our ambition. ## Governance -The 3W project was conceived and publicly launched on May 30, 2022 as a strategic action by Petrobras, led by its department responsible for Flow Assurance and its research center ([CENPES](https://www.petrobras.com.br/inovacao-e-tecnologia/centro-de-pesquisa)). Since then, 3W has become increasingly consolidated at Petrobras in several aspects: more professionals specialized in labeling instances, more projects and teams using the resources made available by 3W, more investment in developing the digital tools needed to label and export instances, more interest in including different types of undesirable events that occur in wells during the drilling, completion and production phases, etc. +The 3W Project was conceived and publicly launched on May 30, 2022 as a strategic action by Petrobras, led by its department responsible for Flow Assurance and its research center ([CENPES](https://www.petrobras.com.br/inovacao-e-tecnologia/centro-de-pesquisa)). Since then, 3W has become increasingly consolidated at Petrobras in several aspects: more professionals specialized in labeling instances, more projects and teams using the resources made available by 3W, more investment in developing the digital tools needed to label and export instances, more interest in including different types of undesirable events that occur in wells during the drilling, completion and production phases, etc. Due to this evolution, from May 1st, 2024 the 3W's governance is now done with the participation of the Petrobras' department responsible for Well Integrity. @@ -85,19 +85,19 @@ It is also very important to know, participate and follow the discussions. See t ## Licenses -All the code of this project is licensed under the [Apache 2.0 License][apache] and all 3W dataset data files (CSV files in the subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by]. +All the code of this project is licensed under the [Apache 2.0 License][apache] and all 3W Dataset data files (CSV files in the subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by]. ## Versioning -In the 3W project, three types of versions will be managed as follows. +In the 3W Project, three types of versions will be managed as follows. -* Version of the 3W toolkit: specified in the [__init__.py](toolkit/__init__.py) file; -* Version of the 3W dataset: specified in the [dataset.ini](dataset/dataset.ini) file; -* Version of the 3W project: specified with tags in the git repository; +* Version of the 3W Toolkit: specified in the [__init__.py](toolkit/__init__.py) file; +* Version of the 3W Dataset: specified in the [dataset.ini](dataset/dataset.ini) file; +* Version of the 3W Project: specified with tags in the git repository; * We will exclusively use the semantic versioning defined in https://semver.org; * Versions will always be updated manually; -* Versioning of the 3W toolkit and 3W dataset are completely independent of each other; -* The version of the 3W project will be updated whenever, and only when, there is a new commit in the `main` branch of the repository, regardless of the updated resource: 3W toolkit, 3W dataset, project documentation, example of use, etc; +* Versioning of the 3W Toolkit and 3W Dataset are completely independent of each other; +* The version of the 3W Project will be updated whenever, and only when, there is a new commit in the `main` branch of the repository, regardless of the updated resource: 3W Toolkit, 3W Dataset, 3W Project's documentation, example of use, etc; * We will only use annotated tags and for each tag there will be a release in the remote repository (GitHub); * Content for each release will be automatically generated with functionality provided by GitHub. @@ -105,23 +105,23 @@ In the 3W project, three types of versions will be managed as follows. See the discussions section. If you don't get clarification, please open discussions to ask your questions so we can answer them. -# 3W dataset +# 3W Dataset To the best of its authors' knowledge, this is the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data. For more information about the theory behind this dataset, refer to the paper **A realistic and public dataset with rare undesirable real events in oil wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223)). ## Structure -The 3W dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md). +The 3W Dataset consists of all CSV files in the subdirectories of the [dataset](dataset) directory and structured as detailed [here](3W_DATASET_STRUCTURE.md). ## Overview -A 3W dataset's general presentation with some quantities and statistics is available in [this](overviews/_baseline/main.ipynb) Jupyter Notebook. +A 3W Dataset's general presentation with some quantities and statistics is available in [this](overviews/_baseline/main.ipynb) Jupyter Notebook. -# 3W toolkit +# 3W Toolkit -The 3W toolkit is a software package written in Python 3 that contains resources that make the following easier: +The 3W Toolkit is a software package written in Python 3 that contains resources that make the following easier: -* [3W dataset](#3w-dataset) overview generation; +* [3W Dataset](#3w-dataset) overview generation; * Experimentation and comparative analysis of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells during their respective drilling, completion and production phases; * Standardization of key points of the Machine Learning-based algorithm development pipeline. @@ -129,7 +129,7 @@ It is important to note that there are arbitrary choices in this toolkit, but th ## Structure -The 3W toolkit is implemented in sub-modules as discribed [here](3W_TOOLKIT_STRUCTURE.md). +The 3W Toolkit is implemented in sub-modules as discribed [here](3W_TOOLKIT_STRUCTURE.md). ## Incorporated Problems @@ -141,9 +141,9 @@ All specification is detailed in the [CONTRIBUTING GUIDE](CONTRIBUTING.md). ## Examples of Use -The list below with examples of how to use the 3W toolkit will be incremented throughout its development. +The list below with examples of how to use the 3W Toolkit will be incremented throughout its development. -* 3W dataset's overviews: +* 3W Dataset's overviews: * [Baseline](overviews/_baseline/main.ipynb) * [André Machado's overview](overviews/AndreMachado/main.ipynb) * Binary classifier of Spurious Closure of DHSV: @@ -153,7 +153,7 @@ For a contribution of yours to be listed here, follow the instructions detailed ## Reproducibility -For all results generated by the 3W toolkit to be consistent, we recommend you create and use a virtual environment with the packages versions specified in the [environment.yml](environment.yml), which was generated with [conda](https://docs.conda.io). Our current recommendation is to use the conda distributed by [Miniforge](https://conda-forge.org/download/). Download and install Miniforge according to the official instructions. Open a prompt on your operating system (Windows, Linux or MacOS). Make sure the current directory is the directory where you have the 3W. Run the following commands as needed: +For all results generated by the 3W Toolkit to be consistent, we recommend you create and use a virtual environment with the packages versions specified in the [environment.yml](environment.yml), which was generated with [conda](https://docs.conda.io). Our current recommendation is to use the conda distributed by [Miniforge](https://conda-forge.org/download/). Download and install Miniforge according to the official instructions. Open a prompt on your operating system (Windows, Linux or MacOS). Make sure the current directory is the directory where you have the 3W. Run the following commands as needed: * To create a virtual environment from our [environment.yml](environment.yml): ``` @@ -163,7 +163,7 @@ $ conda env create -f environment.yml ``` $ conda activate 3W ``` -* To use the 3W toolkit resources interactively: +* To use the 3W Toolkit resources interactively: ``` $ python ``` diff --git a/dataset/README.md b/dataset/README.md index 854c015b..34717b66 100644 --- a/dataset/README.md +++ b/dataset/README.md @@ -16,11 +16,11 @@ # Introduction -All 3W dataset data files (CSV files in the subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by]. +All 3W Dataset data files (CSV files in the subdirectories of the [dataset](dataset) directory) are licensed under the [Creative Commons Attribution 4.0 International License][cc-by]. # Release Notes -Each subsection below contains release notes for a specific 3W dataset version. Differences from the immediately previous version are highlighted. +Each subsection below contains release notes for a specific 3W Dataset version. Differences from the immediately previous version are highlighted. ## 1.0.0 @@ -43,12 +43,12 @@ Highlights: 1. Normal periods of certain instances with anomalies have been increased as possible. We tried to have instances with minimum normal periods of 1 hour; 1. Names of certain files with instances have changed due to increased normal periods; 1. Periods of certain instances have been relabeled; -1. Time series of certain variables were added because these variables were contextualized after the previous version of the 3W dataset was created; -1. Time series of certain variables were removed because these variables were decontextualized after the previous version of the 3W dataset was created; -1. Time series of certain variables have been completely changed due to these variables having been recontextualized after the creation of the previous version of the 3W dataset; +1. Time series of certain variables were added because these variables were contextualized after the previous version of the 3W Dataset was created; +1. Time series of certain variables were removed because these variables were decontextualized after the previous version of the 3W Dataset was created; +1. Time series of certain variables have been completely changed due to these variables having been recontextualized after the creation of the previous version of the 3W Dataset; 1. Certain variable values ​​have undergone minimal change due to different rounding; -1. The 3W dataset's main configuration file ([dataset.ini](dataset.ini)) has been updated; -1. The Jupyter Notebook with the [3W dataset's baseline general presentation](../overviews/_baseline/main.ipynb) has been updated. +1. The 3W Dataset's main configuration file ([dataset.ini](dataset.ini)) has been updated; +1. The Jupyter Notebook with the [3W Dataset's baseline general presentation](../overviews/_baseline/main.ipynb) has been updated. ## 1.1.1 @@ -59,5 +59,5 @@ Highlights: 1. Issue #60 has been resolved; 1. Issue #65 has been resolved; 1. Certain variable values ​​have undergone minimal change due to different rounding; -1. The 3W dataset's main configuration file ([dataset.ini](dataset.ini)) has been updated; -1. The Jupyter Notebook with the [3W dataset's baseline general presentation](../overviews/_baseline/main.ipynb) has been updated. \ No newline at end of file +1. The 3W Dataset's main configuration file ([dataset.ini](dataset.ini)) has been updated; +1. The Jupyter Notebook with the [3W Dataset's baseline general presentation](../overviews/_baseline/main.ipynb) has been updated. \ No newline at end of file diff --git a/dataset/dataset.ini b/dataset/dataset.ini index 2625b5b2..df87a234 100644 --- a/dataset/dataset.ini +++ b/dataset/dataset.ini @@ -1,14 +1,14 @@ -# 3W dataset's main configuration file. +# 3W Dataset's main configuration file. # -# All settings inherent in the 3W dataset that can be used by your -# consumers, including the 3W toolkit, are maintained in this file. +# All settings inherent in the 3W Dataset that can be used by your +# consumers, including the 3W Toolkit, are maintained in this file. # In this file, we use the configuration language supported by the # configparser module. # Versions in gereral # [Versions] -# 3W dataset version (may be different than 3W toolkit version) +# 3W Dataset version (may be different than 3W Toolkit version) DATASET = 1.1.1 # This section defines descriptions of all columns of CSV data files @@ -25,7 +25,7 @@ T-JUS-CKGL = Temperature downstream of the GLCK [oC] QGL = Gas lift flow [sm3/s] class = Label of the observation -# Common properties of all event types covered by the 3W project +# Common properties of all event types covered by the 3W Project # [Events] # Internal names of all event types diff --git a/overviews/AfranioMelo/0-normal-data-eda.ipynb b/overviews/AfranioMelo/0-normal-data-eda.ipynb index b3e7a6ed..3f509b63 100644 --- a/overviews/AfranioMelo/0-normal-data-eda.ipynb +++ b/overviews/AfranioMelo/0-normal-data-eda.ipynb @@ -310,7 +310,7 @@ "source": [ "## Instances and time periods\n", "\n", - "According to the definition of the dataset's authors, instances are blocks corresponding to specific time intervals. The dataset contains 1984 instances, of which 597 are normal." + "According to the definition of the 3W Dataset's authors, instances are blocks corresponding to specific time intervals. The dataset contains 1984 instances, of which 597 are normal." ] }, { diff --git a/overviews/AndreMachado/main.ipynb b/overviews/AndreMachado/main.ipynb index 865c57c6..6a39ea12 100644 --- a/overviews/AndreMachado/main.ipynb +++ b/overviews/AndreMachado/main.ipynb @@ -4,14 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 3W dataset's General Presentation" + "# 3W Dataset's General Presentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This is a general presentation of the 3W dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n", + "This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n", "\n", "For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223))." ] @@ -27,7 +27,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This Jupyter Notebook presents a new 3w dataset overview. For this, One **interactive plot graph** from a specific instance from an event class is presented. \n", + "This Jupyter Notebook presents a new 3w Dataset overview. For this, One **interactive plot graph** from a specific instance from an event class is presented. \n", "By default, the instance is downsampling (n=100) and applied Z-score Scaler.\n", "To help the visualization transient labels were changed to '0.5'." ] @@ -63513,7 +63513,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this part, we generate a complete interactive HTML report from the data set. It is possible to have a complete view of the 3W dataset of one event class, such as the number of lines, number of columns (variables), number of missing values (null cells, NaNs), duplicate lines, size, and the types of variables that we have in the database. In addition, the tool also brings statistics, histograms, interactions, and correlations.\n", + "In this part, we generate a complete interactive HTML report from the data set. It is possible to have a complete view of the 3W Dataset of one event class, such as the number of lines, number of columns (variables), number of missing values (null cells, NaNs), duplicate lines, size, and the types of variables that we have in the database. In addition, the tool also brings statistics, histograms, interactions, and correlations.\n", "\n", "In the Warnings field, the report already brings some things that we will have to be careful about when analyzing the dataset. With this, it is possible to assess the need or not to perform some initial treatment on the data, before starting the exploration.\n", "\n", diff --git a/overviews/_baseline/main.ipynb b/overviews/_baseline/main.ipynb index 62502769..4255230b 100644 --- a/overviews/_baseline/main.ipynb +++ b/overviews/_baseline/main.ipynb @@ -4,14 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# 3W dataset's General Presentation" + "# 3W Dataset's General Presentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "This is a general presentation of the 3W dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n", + "This is a general presentation of the 3W Dataset, to the best of its authors' knowledge, the first realistic and public dataset with rare undesirable real events in oil wells that can be readily used as a benchmark dataset for development of machine learning techniques related to inherent difficulties of actual data.\n", "\n", "For more information about the theory behind this dataset, refer to the paper **A Realistic and Public Dataset with Rare Undesirable Real Events in Oil Wells** published in the **Journal of Petroleum Science and Engineering** (link [here](https://doi.org/10.1016/j.petrol.2019.106223))." ] @@ -27,7 +27,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This Jupyter Notebook presents the 3W dataset in a general way. For this, some tables, graphs, and statistics are presented." + "This Jupyter Notebook presents the 3W Dataset in a general way. For this, some tables, graphs, and statistics are presented." ] }, { @@ -937,7 +937,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The following table shows the amount of instances that compose the 3W dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label." + "The following table shows the amount of instances that compose the 3W Dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label." ] }, { @@ -1092,7 +1092,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Considering only **real instances** and **threshold of 1%**, the 3W dataset has the following amount of instances." + "Considering only **real instances** and **threshold of 1%**, the 3W Dataset has the following amount of instances." ] }, { @@ -1199,7 +1199,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If **simulated instances** are also considered, the amount of instances in 3W dataset become the one listed below." + "If **simulated instances** are also considered, the amount of instances in 3W Dataset become the one listed below." ] }, { @@ -1281,7 +1281,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "After also considering the **hand-drawn instances**, we get the final amount of instances in 3W dataset." + "After also considering the **hand-drawn instances**, we get the final amount of instances in 3W Dataset." ] }, { diff --git a/problems/01_binary_classifier_of_spurious_closure_of_dhsv/README.md b/problems/01_binary_classifier_of_spurious_closure_of_dhsv/README.md index 1d99a4bd..06f7e206 100644 --- a/problems/01_binary_classifier_of_spurious_closure_of_dhsv/README.md +++ b/problems/01_binary_classifier_of_spurious_closure_of_dhsv/README.md @@ -3,7 +3,7 @@ The main aspects of this problem are: * It is a binary classifier in the sense that labels associated with Spurious Closure of DHSV are considered examples of the positive class and all other labels are considered examples of the negative class; -* It is an OVA (one versus all) classifier. The negative class has examples extracted from normal and all other event types present in the 3W dataset; +* It is an OVA (one versus all) classifier. The negative class has examples extracted from normal and all other event types present in the 3W Dataset; * Spurious Closure of DHSV transients are treated as different from Spurious Closure of DHSV steady state; * Only real instances are used. diff --git a/problems/01_binary_classifier_of_spurious_closure_of_dhsv/_baseline/main.ipynb b/problems/01_binary_classifier_of_spurious_closure_of_dhsv/_baseline/main.ipynb index 5c67822d..2bcb2658 100644 --- a/problems/01_binary_classifier_of_spurious_closure_of_dhsv/_baseline/main.ipynb +++ b/problems/01_binary_classifier_of_spurious_closure_of_dhsv/_baseline/main.ipynb @@ -11,13 +11,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This is an **example** of how to use the 3W toolkit, a software package written in Python 3 that contains resources that make the following easier:\n", + "This is an **example** of how to use the 3W Toolkit, a software package written in Python 3 that contains resources that make the following easier:\n", "\n", - "* 3W dataset overview generation;\n", + "* 3W Dataset overview generation;\n", "* Experimentation and comparative analysis of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells during their respective production phases;\n", "* Standardization of key points of the Machine Learning-based algorithm development pipeline.\n", "\n", - "The 3W toolkit and the 3W dataset are major resources that compose the 3W project, a pilot of a Petrobras' program called [Conexões para Inovação - Módulo Open Lab](https://prd.hotsitespetrobras.com.br/pt/nossas-atividades/tecnologia-e-inovacao/conexoes-para-inovacao/) that promotes experimentation of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells." + "The 3W Toolkit and the 3W Dataset are major resources that compose the 3W Project, a pilot of a Petrobras' program called [Conexões para Inovação - Módulo Open Lab](https://prd.hotsitespetrobras.com.br/pt/nossas-atividades/tecnologia-e-inovacao/conexoes-para-inovacao/) that promotes experimentation of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells." ] }, { @@ -31,11 +31,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This [Jupyter Notebooks](https://jupyter.org/) presents a **basic** example of how to use the 3W toolkit's resources to develop an experiment for a specific problem.\n", + "This [Jupyter Notebooks](https://jupyter.org/) presents a **basic** example of how to use the 3W Toolkit's resources to develop an experiment for a specific problem.\n", "\n", "You can adapt this example to experiment other approaches. To do so, follow the instructions included in the following codes as comments.\n", "\n", - "**IMPORTANT**: in order to experiment very different approaches with other Machine Learning pipelines, we need to evolve the 3W toolkit first. Your help with this is greatly appreciated." + "**IMPORTANT**: in order to experiment very different approaches with other Machine Learning pipelines, we need to evolve the 3W Toolkit first. Your help with this is greatly appreciated." ] }, { @@ -96,7 +96,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As the 3W toolkit defines and standardizes a number of things, we don't need to worry about labels and IDs associated with the specific event type chosen, number of folds, and which folds consider which instances." + "As the 3W Toolkit defines and standardizes a number of things, we don't need to worry about labels and IDs associated with the specific event type chosen, number of folds, and which folds consider which instances." ] }, { @@ -130,7 +130,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can see below that the 3W toolkit has methods to extract samples for both training and testing, and also to calculate metrics for each fold." + "You can see below that the 3W Toolkit has methods to extract samples for both training and testing, and also to calculate metrics for each fold." ] }, { @@ -151,7 +151,7 @@ " # \n", " # It is interesting to mention that the metrics obtained with this \n", " # simple approach seem to be good because of the considerable imbalance \n", - " # of the 3W dataset that is not addressed by the 3W toolkit.\n", + " # of the 3W Dataset that is not addressed by the 3W Toolkit.\n", " #\n", " # You can modify this section to try other more interesting approaches.\n", " # All you have to do is generate an array (numpy.ndarray) `y_pred` \n", @@ -180,7 +180,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The 3W toolkit provides a method for retrieving and presenting the metrics calculated for each fold." + "The 3W Toolkit provides a method for retrieving and presenting the metrics calculated for each fold." ] }, { diff --git a/toolkit/__init__.py b/toolkit/__init__.py index c1aabd5f..4e274e71 100644 --- a/toolkit/__init__.py +++ b/toolkit/__init__.py @@ -1,9 +1,9 @@ -"""This is the 3W toolkit, a software package written in Python 3 that -is one of the 3W project's major components. +"""This is the 3W Toolkit, a software package written in Python 3 that +is one of the 3W Project's major components. This toolkit contains resources that make the following easier: -- 3W dataset overview generation; +- 3W Dataset overview generation; - Experimentation and comparative analysis of Machine Learning-based approaches and algorithms for specific problems related to undesirable events that occur in offshore oil wells during their respective @@ -28,7 +28,7 @@ - Binary Classifier of Spurious Closure of DHSV. Examples of how to use this toolkit will be incremented throughout its -development. Please, check the project's README.md file for more details. +development. Please, check the 3W Project's README.md file for more details. It is important to note that there are arbitrary choices in this toolkit, but they have been carefully made to allow adequate comparative diff --git a/toolkit/base.py b/toolkit/base.py index bd7c82a5..6d1f5d47 100644 --- a/toolkit/base.py +++ b/toolkit/base.py @@ -22,7 +22,7 @@ # Methods # def load_config_in_dataset_ini(): - """Loads all configurations present in the 3W dataset's main + """Loads all configurations present in the 3W Dataset's main configuration file. Raises: @@ -30,7 +30,7 @@ def load_config_in_dataset_ini(): Exception: Error if the configuration file cannot be loaded. Returns: - dict: Dict with all configurations present in the 3W dataset's + dict: Dict with all configurations present in the 3W Dataset's main configuration file. This dict is formated with the basic configuration language used by the configparser module. @@ -38,7 +38,7 @@ def load_config_in_dataset_ini(): # Check if the configuration file exists in the expected path if not exists(PATH_DATASET_INI): raise Exception( - f"the 3w dataset's main configuration file was not found " + f"the 3w Dataset's main configuration file was not found " f"in {PATH_DATASET_INI}" ) @@ -49,14 +49,14 @@ def load_config_in_dataset_ini(): dataset_ini.read(PATH_DATASET_INI) except Exception as e: raise Exception( - f"the 3w dataset's main configuration file " + f"the 3w Dataset's main configuration file " f"({PATH_DATASET_INI}) could not be loaded. {e}" ) return dict(dataset_ini) -# Loads all configurations present in the 3W dataset's main +# Loads all configurations present in the 3W Dataset's main # configuration file and provides specific configurations in different # granularity and formats # @@ -101,14 +101,14 @@ def load_config_in_dataset_ini(): # class EventType: """This class encapsulates properties (constants and default values) - for each type of event covered by the 3W project.""" + for each type of event covered by the 3W Project.""" def __init__(self, event_name): """Initializes an event. Args: event_name (srt): Event type name to be initialized. This - name must be a section name in the 3W dataset's main + name must be a section name in the 3W Dataset's main configuration file. """ event_section = DATASET_INI.get(event_name) diff --git a/toolkit/misc.py b/toolkit/misc.py index fc7a2de4..06605f78 100644 --- a/toolkit/misc.py +++ b/toolkit/misc.py @@ -1,4 +1,4 @@ -"""This is the 3W toolkit's miscellaneous sub-module. +"""This is the 3W Toolkit's miscellaneous sub-module. All resources that do not fit in the other sub-modules are define here. """ @@ -40,7 +40,7 @@ def label_and_file_generator(real=True, simulated=False, drawn=False): """This is a generating function that returns tuples for all indicated instance sources (`real`, `simulated` and/or `hand-drawn`). Each tuple refers to a specific instance and contains - its label (int) and its full path (Path). All 3W dataset's instances + its label (int) and its full path (Path). All 3W Dataset's instances are considered. Args: @@ -84,7 +84,7 @@ def label_and_file_generator(real=True, simulated=False, drawn=False): def get_all_labels_and_files(): """Gets lists with tuples related to all real, simulated, or - hand-drawn instances contained in the 3w dataset. Each list + hand-drawn instances contained in the 3w Dataset. Each list considers instances from a single source. Each tuple refers to a specific instance and contains its label (int) and its full path (Path). @@ -108,26 +108,26 @@ def get_all_labels_and_files(): def create_table_of_instances(real_instances, simulated_instances, drawn_instances): """Creates a table of instances (pandas.DataFrame) that shows the - amount of instances that compose the 3W dataset, by knowledge source + amount of instances that compose the 3W Dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label. Args: real_instances (list): List with tuples related to all - real instances contained in the 3w dataset. Each tuple + real instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). simulated_instances (list): List with tuples related to all - simulated instances contained in the 3w dataset. Each tuple + simulated instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). drawn_instances (list): List with tuples related to all - hand-drawn instances contained in the 3w dataset. Each tuple + hand-drawn instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). Returns: pandas.DataFrame: The created table that shows the amount of - instances that compose the 3W dataset, by knowledge source + instances that compose the 3W Dataset, by knowledge source (real, simulated and hand-drawn instances) and by instance label. """ @@ -192,7 +192,7 @@ def filter_rare_undesirable_events(toi, threshold, simulated=False, drawn=False) Args: toi (pandas.DataFrame): Table that shows the amount of instances - that compose the 3W dataset, by knowledge source (real, + that compose the 3W Dataset, by knowledge source (real, `simulated` and `hand-drawn` instances) and by instance label. This object is not modified in this function. threshold (float): Relative limit that establishes rare event @@ -307,7 +307,7 @@ def create_and_plot_scatter_map(real_instances): Args: real_instances (list): List with tuples related to all - real instances contained in the 3w dataset. Each tuple + real instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). @@ -461,7 +461,7 @@ def count_properties_instances(instances): def calc_stats_instances(real_instances, simulated_instances, drawn_instances): - """Calculates the 3W dataset's fundamental aspects related to + """Calculates the 3W Dataset's fundamental aspects related to inherent difficulties of actual data. Three statistics are calculated: Missing Variables, Frozen Variables, and Unlabeled Observations. All instances, regardless of their source, influence @@ -469,15 +469,15 @@ def calc_stats_instances(real_instances, simulated_instances, drawn_instances): Args: real_instances (list): List with tuples related to all - real instances contained in the 3w dataset. Each tuple + real instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). simulated_instances (list): List with tuples related to all - simulated instances contained in the 3w dataset. Each tuple + simulated instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path). drawn_instances (list): List with tuples related to all - hand-drawn instances contained in the 3w dataset. Each tuple + hand-drawn instances contained in the 3w Dataset. Each tuple must refer to a specific instance and must contain its label (int) and its full path (Path).