-
Notifications
You must be signed in to change notification settings - Fork 3
/
07-ph3-scoping.Rmd
51 lines (29 loc) · 11.5 KB
/
07-ph3-scoping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Phase 3: Scoping
The word "scoping" can have different meanings to different people, but at its core it means to assess or investigate something. This phase is indeed an investigation: the goal is to explore the supporting data and test some of your hypotheses and assumptions. Note that the goal of this phase is not to necessarily create a proof-of-concept; this phase usually has no concrete deliverables and is more open-ended. Rather this phase is aimed at understanding the size, structure and complexity of the supporting data and to generate a clearer understanding of the problem and what a solution could look like. Ultimately this phase is aimed at determining whether or not it is possible to deliver a successful project.
## Step 1: Cleaning your data and assessing quality
Usually this phase starts with a data exploration, during which you should be asking questions such as:
- Are the data in a clean and usable format?
- If not what manual/automated effort would be required to get the data in a usable format?
- How complex are the transformations required?
Once you have the data in a usable format, you should ask standard EDA (exploratory data analysis) questions: What abnormalities does the data set have? What do the distributions look like? What do outliers look like? How are missing values recorded? How many missing values are there? Are there any unusual entries that require explanation?
:::{.infobox}
**Data access and security**
Getting access to data can often be a challenge due to security policies that may be in place. This is especially common in financial services, but is certainly not limited to this industry. There are many ways of making data access secure so you may have to work with the data administrators to work out a safe and secure plan that works for them. We have each had to endure inconveniences to access data, such as sitting in data-secure rooms with no internet access, having encrypted laptops sent to us to use for our work or having had to access data via VPNs and cloud-based servers that restricted access based on multi-factor authentication. For some industries, we have even had to go through extensive background checks. There are often quite a few hurdles to jump through when it comes to data security so we recommend discussing this with your client early in order to get these hurdles out of the way.
:::
What defines clean data can vary greatly across professions and industries. For some people, hundreds of differently structured excel files that are human-readable would be considered clean data; for many data scientists this would be a nightmare, requiring a substantial investment to merge all of the files for them to be analysed as a coherent whole. Databases or data warehouses, in contrast, tend not to suffer from such problems of *data structure*. However, even well-structured data can require substantial cleaning. For example the dataset could contain duplicate entries, missing values or inconsistently-recorded measurements, to name but a few potential data cleanliness problems. While exploring and working with a poorly-structured or unclean dataset may be agonising, knowing the state of the data that you will use for the project is critical in order to accurately appreciate the scale of the prospective project. Unexpected data cleaning can take weeks and will invariably cause delays to your project, so this insight will allow you to budget appropriately when creating your project plan.
## Step 2: Testing assumptions
The next step is reducing some of the uncertainty and risk by carrying out hypothesis testing. What do we mean by that? In one way, we mean statistical hypothesis testing in the truest sense of the word: formulating null and alternative hypotheses and applying appropriate statistical tests to see how well the evidence supports them. However, we also mean testing more general ideas about the data, usually based on anecdotal evidence or domain experience from those who know the business the best. For example, your client may be confident that sales are highest on Fridays -- use that as the basis for an experiment in your EDA and see if the data really do support that notion. Often with such assumptions, the data will indeed confirm what your client knows (or, more accurately, *thinks* they know) about the data. But sometimes those assumptions are wrong, and you are unable to find support for the assumption **in the data you have**.
It is important to note here that hypothesis tests that fail to support your client's assumptions do not necessarily mean that those assumptions are false. We emphasised the words "in the data you have" in the preceding paragraph for a reason: your data may not be representative of reality. This touches on a fundamental notion in data science -- that data is not the same as ground truth. Data are artifacts and are affected by biases in collection methodology, study design, storage, aggregation and interpretation. In short, the data that comes to you has been affected by the many design choices that were made by others -- choices about every step on the journey between the real world and the number you see in front of you. No dataset is ever completely unbiased. For an in-depth discussion of this topic, we strongly recommend [this article](https://medium.com/@angebassa/data-alone-isnt-ground-truth-9e733079dfd4){target="_blank"} by Angela Bassa.
Does this mean that data science is worthless? Of course not. But it does mean that you should keep in mind that some bias will be baked into every dataset. You job is to identify where biases may lay and account for them in your mission to "reach a factual understanding of truth" [@bassa_2017]. In an ideal world, you (the data scientist) would have a say in how data are collected and curated, therefore allowing you to understand what design choices were made. In reality this almost never happens, leaving you to work with data that has been affected by the choices of others. During this phase of the project, you will want to validate some of the foundational assumptions upon which your project will be based. Our advice is to do this as scientifically as possible so that you can maximise your chances of accurately assessing what will be involved to deliver the project objectives.
## What you hope to learn from scoping
We recommend that you time-box the scoping phase of your project. Exactly how much time you spend will vary: in some cases a few hours of exploring the data may be sufficient, in others you may want to spend several weeks going through a formal in-depth EDA and data validation activity. Time-boxing allows the team to focus their efforts on the most pressing concerns without starting to implement the entire project. Remember, **the goal is not to get started with producing results per se, but rather to reduce risk and explore possible solutions.** In software engineering, this is often done in the form of a design sprint. In data science, we can also work in sprints but the process is slightly different as we need to validate a hypothesis -- the process is far more scientific. Doing this work beforehand helps us drastically reduce uncertainties and ensures that projects which are unlikely to succeed don’t get started; if we discover there is insufficient or insignificant data to reach the goal then we can pause the analysis in order to collect more suitable data, delaying the project until its goals are achievable.
The reader might think that doing this phase is hindering the project – delaying it from really getting started and providing little in the way of tangible outcomes. In our experience, this is untrue: when we scope out a project we can more accurately predict what the outcome will be and what resources are needed to achieve the end goal. This translates into less risk for clients and data scientists alike. Being able to quickly demonstrate progress or build a basic proof-of-concept reassures all stakeholders that the project will be successful. For the data scientist, the lower levels of risk can help us to more-confidently estimate the time required for the remainder of the work, thereby avoiding the temptation to add budget padding to defend against unexpected obstacles. We have personally had a 100% success rate of winning project proposals when we have included this stage as a discrete engagement. The reason is clear -- having had time to scope out the project, we could write proposals that were more competitive because they lacked the padding often used to protect against project uncertainty.
:::{.infobox}
**Using science with models**
It is essential to note that general engineering projects differ from data science projects. Data science projects need to be scientifically correct and for this, you need to follow a scientific process. It is not enough to make a library call to train a model and assume it will perform a certain way just because you tested a few cases. Models you create need to be robust or you will get large varying results that cannot be explained. Making models robust to outliers is an important step in modelling. So be warned, it is often not enough to have used a machine learning library in the past, rather you should be capable of following a scientific process required to build what is necessary for a data science project to succeed.
:::
In this phase, it is also important to start thinking about how you will test your final solution. Can you use TDD (test-driven development)? This strategy requires you to set up extensive test cases now that you will use to validate your final solution later. Doing this will let you think through the problem more clearly and write down what criteria the result will adhere to.
## Is success possible?
Recall from Figure \@ref(fig:bottom-fig) that after this scoping phase, you should assess whether your prospective project is viable. Now is the time to answer that question based on what you learned from this phase of the project. If you believe that you will be able to plan and deliver a successful project, then you should move on to Phase 4.
If you are unsure, or believe you cannot deliver a successful outcome, then you will need to go back to the client to either re-define success (Phase 2) or establish a different business case (Phase 1). This result is not a failure but rather an example of your due diligence arriving at an important finding. While it may not be the outcome you or your client wants, it is essential that you allow the scoping project to fail if need be. At the very least you saved yourself from failing at a larger scale. In this scenario, it is useful to clearly identify the reasons for your misgivings and consider how they may be overcome. For instance, would the collection of more data change your mind? Can the data be enriched with another source? Perhaps the incoming data you have been given suffers from some sort of collection or aggregation bias that you cannot account for? If you can somehow make a good recommendation for how the project can still be successful with a few concrete changes, that would be a very valid outcome. For example, can click-stream data be collected from the website if there are insufficient orders to make decent recommendations for customers who have never bought an item? This would solve the cold start problem many recommendation engines face.
If you are happy that your project "has legs", then it's time to move on to Phase 4: Project definition. This is where you concretely design the project, including recommended approach, phases, milestones, timelines, resources and budgets. It's a big stage and a big chapter, and one that will be critical for the success of most project. See you there!