-
Notifications
You must be signed in to change notification settings - Fork 3
/
_06-phase5.Rmd
119 lines (65 loc) · 27.1 KB
/
_06-phase5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# Deep dive into Phase 5 – exploration, execution and pivots {#execution}
At this point, you have done your homework. You have explored the project’s feasibility, you have scoped the work and you have put together a project plan that resonates with your client/manager. You have determined that the proposed work makes sense in terms of feasibility and business value to your client/manager, and you have decided that it passes muster in a larger, contextual sense. You have prepared, worked hard and thought about what lies ahead of you. It’s time to start with the implementation phase!
However, the fact that the project has begun does not mean that you have no more questions nor need to tap into subject matter experts. It is, therefore, essential that the stakeholders involved in the project remain available for question-and-answer sessions. Questions will range from data clarification issues to more specific industry knowledge questions and how the end-user will be using the product. The more the stakeholders remain involved the more input they will have along the way and the happier they will be with the result. We recommend scheduling time slots that together accumulate to a minimum of 0.5 days a week for clarifying meetings.
At this point, we would like to address a common misconception: that developers and data scientists need to have domain expertise to solve domain-specific problems. While we do find that familiarity with a specific industry can certainly help a data science project get started, we have rarely seen that industry knowledge is essential. We would further argue that a lack of domain expertise can be a positive thing, for it allows one to view a problem in a way that is unencumbered by prior assumptions and biases. Naturally, anyone working in a new field will have a lot to learn and may get off to a slow start, but in our experience, as long as subject matter experts are available to help answer questions, prior industry knowledge is not essential.
In thinking about our projects and how we have worked, we generally have four main stages to project execution: research, prototyping, building and evaluation. Naturally, every project is different and has different needs, but rarely have our projects not at least touched on each. Similarly, the movement between them is not always linear or ordered, so, likely, your project may not follow such a smooth trajectory.
In our experience, we normally spend 25% of our time researching, 25% on prototyping and the remaining 50% of our time on building and evaluating. However, it should be noted that some projects are, by definition, experimental with a POC as the objective. In such projects, research and prototyping carry a bit more weight. However, in most cases, these projects do require substantial documentation, reporting and knowledge transfer. Thus, the amount of time you spend on developing your POC may increase, but you should not underestimate the time required for the documentation and knowledge transfer. Often this is more time-consuming than building a solution, and you should budget your time accordingly.
We group research and prototyping under a broader umbrella of “proof-of-concept (POC)”. During the POC stage of a project, your goal should be to answer two questions: “Is the goal of the project possible?” and “How should I get to it?” These loosely correspond to the research and prototyping stages of a project, respectively. Naturally, these questions are related – you can’t predict that a goal is possible without having some idea of how to get there. Often you have several possible approaches to choose from; you want to identify the best of those possibilities, test it on your data and determine if it is a sensible approach to build and develop further.
Similarly, building and evaluation often go hand-in-hand, as we discuss in more detail below. We have also included “go to client”, which is less of a stage and more of a call to check in that should occur frequently.
## Research
Nearly every project includes periods of research. What do we mean by research? Essentially, we mean taking an in-depth look into the problem at hand. This includes considering different ways to frame the question and weighing the potential approaches to answer it.
Almost invariably this will include a search-engine search. Many data scientists will have a library of resources they can turn to, which could include books, blogs, articles, colleagues and mentors, or even tweets. You will also probably gather a list of websites that you can turn to for information.
We all learn in different ways, but you should bear in mind that talking to an expert is often the most efficient manner to get information; the time you save can be put towards other activities that could add value to the project. (Naturally, you will need to have a network of people you can turn to, and this is not something that can be built quickly. Your network of colleagues will likely be one of your most valuable assets, and you should not underestimate how important it is to build, nurture and contribute to it. Please see chapter \@ref(help) for tips and advice on how to build your professional network.)
While gathering information from experts can go a long way towards understanding the task at hand, it is important to do your research as well. Do your homework to find out how industry leaders solve similar problems to those that you are facing. While you may not need to use the state-of-the-art technology (and very often you won’t), it is good to know what benefits the most cutting-edge technologies offer, as well as what costs they incur. Indeed, every solution will have pros and cons and it is your task to make an informed decision. The time you spend researching will translate into substantial improvements in efficiency. Consider whether you can find research papers describing or comparing similar approaches and if libraries already exist with basic implementations of the algorithms you need. Utilising what is available lets you invest your time in creating something new and maybe even allows you to give back to the community. For instance, you may end up extending someone else’s open-source library. If so, this can be a good opportunity to contribute a pull/merge request – most developers welcome suggestions (however you should look for contributing guides before creating one). Did you end up comparing different techniques? Publish a paper or blog about it. Others will appreciate the information and you will get your name out into the wider community.
Learning a new field can be overwhelming. Yet as data scientists, we often need to have a thorough understanding of the work or algorithmic approaches before we can build a solution. While this may seem daunting, you should remember that when you are new to a data science project, you have a window of time in which you're allowed to be ignorant. In other words, it's acceptable to not know the field because you're new to it. That empowers you to ask questions, even ones that you worry maybe "stupid". This window of opportunity will not be open forever, so use it when you're starting and don't be self-conscious about it!
When immersing yourself in a new field, it is good to be aware of biases and industry knowledge that is often considered to be general knowledge. It is easy to miss information and not even realise that you are missing this knowledge. A good example of this is the financial sector and the stock market. When predicting stock prices most people are aware that they need to do backtesting to test for statistical significance. Without any further financial knowledge, you might simply think that when your model has significant predictive power, you will make money on the stock market. Many amateur algorithmic traders have found out from experience that it isn’t that simple. They simply didn’t have the financial knowledge to see the pitfalls in their analysis and so their algorithms failed due to a lack of information.
## Prototyping
Once you have researched possible approaches to your project’s goal it’s time to choose the best candidate and see if it will work. This is the goal of prototyping. You may have found through your research that there is a single, clear best way to reach a project’s goal. In that case, formal prototyping is somewhat superfluous and you can move on to building directly. In other cases, there is no clear best way to achieve a project’s goal and you have to test the possible approaches empirically, testing several options on your data to make an informed decision about which seems the best based on the evidence you gather. In either case, it is a good idea to build a very simple prototype for your project so that you can be confident that what you are going to build out in the next stage is, in fact, a viable solution.
Once you have chosen an approach that you believe will work, you can move on to the building stage. (However, before doing so, it’s a good idea to check in with your client/manager, as we discuss below.)
On the other hand, your prototyping efforts may give you unwelcome news: that your top choice candidate approach does not do well with your data. In this case, we’re afraid you will have to start the research again, re-evaluating if you still think that the result is possible and exploring another candidate approach for prototyping. This is frustrating but important – research can be slow and you may have to be patient. We all want to start building a solution, but building the wrong solution will be costly! Be sure to use your research time wisely so that you gather accurate, reliable information and make well-informed decisions.
As mentioned above, half of your hands-on time in your project (what we call “exploration, execution and pivots”) should be spent on building your solution. Therefore, by this point in your project, you should have decided on a course of action, selected the algorithms you intend to use, demonstrated that the chosen solution is going to work and started iterating on putting it together into a coherent product. If at the halfway point, you do not have a working prototype, it’s time to think carefully about whether you will be able to deliver the expected outcome. The need to discuss the state of the project is especially important in this situation; while it may not be a conversation you want to have, waiting to communicate the state of a project that may be in trouble will usually make the situation worse, not better.
## Build, assess, rinse and repeat
For many data scientists (the authors included) the building stage is where the fun happens. Here you focus on developing your prototype further to create a product that meets – and sometimes exceeds – your project’s objectives. The objective could be anything from a great model that runs locally, to a full-fledged solution that scales in production.
If you have entered this phase with a respectable prototype, you will most likely start the build by trying to improve upon your results. Often this comes down to expanding your dataset and using scientific creativity to devise new ways of using it. Correspondingly, feature engineering is often a big part of the building process. If you have identified shortcomings in the data you started with, you may want to look for ways to bring in other data. Sometimes this involves external datasets that are publicly available. For example, the Index of Multiple Deprivation can be a very useful dataset to include in your analysis when trying to find relationships between UK geography and an observation.
Building is iterative by nature. This applies not only to the analysis but also to the code that underlies it. A good example of this is in code optimisation: we all want fast code, but trying to get the code logic working while simultaneously optimising the code can be difficult. Our advice is to get it working first, then worry about making it fast and robust. Often framing your coding work in the context of building a software package or library can help: the process of building a package/library inherently forces you to write better code that is more robust and efficient. Similarly, testing your code rigorously will force it to be better and less brittle.
As you are building and testing the output, you will undoubtedly explore various possible approaches and assess their strengths and weaknesses as you refine the work. But how should you make these comparisons – what should you take into account when deciding on how to proceed? Assessment is a key part of this stage that can help ensure that you are making wise choices.
:::{.infobox}
**Understanding what's going on inside your model**
A common notion is that artificial intelligence is complicated and cannot be understood. It is often treated as a black box with a magical outcome. Developers who use machine learning libraries without fully understanding these often create or support this misconception. This viewpoint, however, originates from a lack of knowledge. Most machine learning libraries are open source and a skilled data scientist will be able to dig into their models and alter or understand the results it produces.
At their core, machine learning models are mathematical models -- data is manipulated through a series of mathematical operations. These operations are not random but should be well chosen to produce favourable results. Understanding this process lets you understand the limitations and the overall behaviour of the model. Once this is properly understood, one can make deliberate changes that improve the model overall, remove bias or simply deal better with outliers that had been neglected previously.
**What is open source?**
Open Source is any computer software that is distributed with its source code available for modification. That means it usually includes a license for programmers to change the software in any way they choose: They can fix bugs, improve functions, or adapt the software to suit their own needs.
As a side note be aware when choosing your machine learning toolkit. When using closed and proprietary systems like those provided by IBM, Google and Amazon, be aware that data scientists will lose visibility and the ability to understand the system fully.
:::
When many non-specialists think of assessment the word “accuracy” comes to mind. Experienced data scientists dislike this term, as it means something very specific and is seldom a meaningful metric to evaluate model performance. You should consider other metrics such as precision, recall or F1 score. However, assessment extends far beyond model performance; many factors will dictate what the right choice is for your project. For example, you may need a highly interpretable model, or you may need something fast. Exactly what is important for your project can vary, and you should keep this in mind throughout this phase. How you assess your work should not be an afterthought; on the contrary, you should think about how you intend to measure performance from the outset.
Having a meaningful way to assess your work will also be important for bench-marking. Benchmarks are a handy way to set a standard against which you can compare your model’s performance. We encourage you to set benchmarks early in your project and set aside time to develop these benchmarks. Knowing what the standard was before your project can give you a demonstrable way to show added value from your work. Your final results should always be quantifiable, so set yourself up for success and define your KPI’s and measure your final success against them.
### Testing
As with software development, data science projects should be rigorously tested to ensure model performance. Create test cases and make sure your model adheres to them. Later if you need to update the model with new data, these test cases should still hold and make any update robust. Testing is somewhat of an art form, and it takes a lot of experience and practice to become good at devising a sensible, rigorous testing scheme. Furthermore, the notion of “unit testing” is not entirely sufficient in data science. Nonetheless, it’s good to have an understanding of testing principles.
If you or a colleague has made changes to some code and want to replace the old version, how do you know it has not introduced any problems (let alone that it’s improved)? Unit tests help ensure this for software, along with code reviews of course. For data science work, we’ve found it critical to let reviewers inspect results from experiments.
Another common use case for unit testing in software is to ensure things work properly before automatic deployment to a production environment. For machine learning models in a production environment, you can run code that retrains your model on new data automatically and redeploys your updated model to production. In this situation, you may want the deployment to fail if your updated model fails to meet certain conditions. To handle this, abort the deployment if your retraining script throws an error. This allows you to build any gatekeeping checks you wish while ensuring consistent quality.
::: {.infobox}
***A note on notes***
Any experience researcher will know that a critical part of the work is in keeping detailed records. This is no different for the data scientist – keeping track of results, as well as detailed records of assumptions and choices that you make, is essential. Aside from the fact that this is scientific best-practice, it also makes practical sense: it is not uncommon to examine your final results and realise that something doesn’t make intuitive sense. In this case, you will have to do some detective work to figure out why. Good notes that include information about the assumptions and potential errors you made along the way will make that task a lot easier.
In a recent episode of one of our favourite podcasts, [Not So Standard Deviations](http://nssdeviations.com/85-oldold-with-special-guest-jenna-krall){target="_blank"}, host Roger Peng and guest Jenna Krall discussed the topic of how assumptions and undocumented workflows can impact data science research. If you are interested in learning more about this topic, than [Project TIER](https://www.projecttier.org/) is a great place to start.
While saying that you should take good notes may seem obvious in principle, it is very easy to fall short in practice. Data science projects can move fast, and we all get caught up in the excitement of discovery or writing code that works well. It’s good to be excited – the work is exciting – but don’t let yourself forget to be a good scientist in the process.
:::
## Evaluate
As you are building, it’s a good idea to keep in mind the four levels of project evaluation we outlined in Chapter 4. Recall that we described them as such:
- The **process level** is focused on the actions taken towards producing deliverables.
- The **product** level is concerned with the deliverables themselves and whether they meet the technical requirements of the project.
- The **business level** describes how well the project brings value to your client/manager.
- The **contextual level** is the most abstract and relates to the circumstances surrounding a project and the externalities that affect it.
We encourage you to revisit these levels often throughout your project. During the Build stage, you will mostly focus on the process and product levels. But it is also important to not lose sight of the higher business and contextual levels. When you have finished building and are ready to hand over your work to your client/manager, consider how well your project, as a whole, has satisfied these higher, more abstract levels of evaluation. If you feel that you have done a good job on all four levels, then you can be reasonably confident that you have designed, executed and delivered something of value.
## Go to client/manager
Going to your client/manager is not a stage in project execution per se because this should be something that is done regularly throughout the life of your project. It’s not something that should be done only at the end once the project is complete. Instead, your client/manager should be contacted frequently as they are vitally important to the success of your project. You should include them frequently and give them a deep understanding of what you are working on.
Some find client/manager interactions to be difficult and stressful, and even the easiest client/manager relationships can become strained when a project doesn’t go according to plan. The most common problem is that people are busy and you might feel uncomfortable asking for their time. You may find that your client/manager does not want to be involved, or thinks that they don’t have much to contribute. Quite the contrary is true in reality: their involvement is crucial for project success. We recommend you emphasise this point early in your working relationship and make sure that your client/manager knows the expectations you have for their involvement.
The contrary can also be true, you could work with people who are overbearing and want to be too involved. This can often come across as micromanagement and as such, it is important to set boundaries early on. Agree on when you will have meetings and give updates so that you can free the rest of your day for productive work. Either way work on building a trusting relationship.
Often clients/managers mistakenly believe that their role is to give you their requirements, data and infrastructure and that you will go away and silently build a product that is exactly what they imagined. Naturally, this is misguided: data science projects are complex and involve many decisions. While you may know the dataset and problem well, you will not be in a position to understand the business case as well as the client/manager. Some decisions are yours to make, but the responsibility for others lies squarely with the client/manager. You should not try to make these decisions on your own. If your client/manager resists, you should make every attempt to get your client/manager to understand that data science projects require an iterative process and the more that you can get their feedback the more the final product will meet their requirements.
Once you have convinced your client/manager of their involvement, plan regular meetings. To keep these meetings going, you need to make them see that the meetings are valuable to them, so use the time provided effectively. You can do this by planning the meeting and taking ownership over the agenda. It is a good idea to prepare slides with any findings you have uncovered, visual representations of what you are working on or simply a list of questions to go through. People tend to be able to focus better on the problem at hand when they have something to look at.
Your client/manager will have a vision that they need to realise. You need to transform that vision into something tangible. Their vision will not likely be what you will end up delivering but in some ways might be better and in other ways worse. Everything is a trade-off and we need to make choices or the project will never finish and we would need an infinite budget. The problem is often that they will not understand the technical challenges and the trade-offs that you make. It is therefore essential that you get buy-in for the decisions that you make. Explain how making a certain decision impacts their vision but gets them the tangible results required.
Non-technical people often have different terms and terminology that they use to describe problems. Do ask them to define and clarify terminology so that you remain on the same page. Don’t assume that you can look this up later or that you are working with the same definition. Make notes of these definitions and use them when you talk to your client/manager later. Collaboration is a two-way street so it can also be really useful to carefully educate the people that you work with. When teaching others you don’t want to come across as an arrogant know-it-all that knows things other people don’t. Instead, focus on the collaborative aspect and phrase it in a way that states that you can learn from each other. Everyone does have different knowledge and knowledge sharing is in general very powerful. The more that people understand how machine learning works the more they will understand the challenges you face and how their knowledge can improve the outcome.
At some large organisations, there is currently a movement where AI departments are seated close to the CEO and other decision-makers. It has been shown that having casual interactions with AI practitioners increases AI favourable decision making. This makes sense as AI remains top of mind if you are constantly interacting with AI practitioners. The takeaway here is that as a data scientist you should increase your interactions with the decision-makers as this results in a win-win situation for everyone involved.
An important tool to use when speaking to your client/manager is repetition. Repetition builds trust as mirroring the problem with their own words makes them feel like you have understood the problem. So whenever possible repeat the problem you are solving with the exact words they used to describe the problem. Emphasize that you understand the problem and if needed shift the conversation by saying something like; “however in data science we use...”. Remember that you should be leading the conversation and the meeting as a whole. Aim to be doing 80% of the talking and choose what you say wisely.
When dealing with other people, it is important to remain kind, positive and in control of your emotions. Especially when situations become difficult. Dealing with high demanding clients/managers can be extremely frustrating and stressful. It can be good to remember that the situation is probably more stressful for them as it is for you. Management also has targets to meet and they are ultimately responsible for the success of the project even though the execution is out of their control. Empathise with them and make them feel in control by involving them in the process as much as possible. It helps to always remain positive but realistic when speaking about your project as this gives hope without overselling your deliverables.
Whenever problems do occur don’t avoid them, they don’t go away just because you don’t discuss them. This is the main reason why projects fail, challenges aren’t communicated or addressed early on. Rather than see it as a problem, tackle it as a challenge, make a plan and present this new plan in a positive light. It is important to never go to your client/manager without a plan and expect them to help you figure out how your project can be saved. Your client/manager needs results and if the results they are expecting are impossible to obtain make a plan B and figure out how you could still provide value. Your client/manager wants to know that you are the best choice to work with and that you can overcome challenges and tackle them effectively. Tackling problems can allow you to show your true creativity, ensuring that you remain their first choice on future projects.
As a data scientist, you are your client’s/manager’s guide. You need to take responsibility for the entire project’s lifecycle. This process is in no way easy but if you define success correctly, plan, execute and communicate effectively then your project will succeed.