-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
27 lines (27 loc) · 25.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type" /><title>Taming Data with Science | Robin Gower, Equal Experts</title><meta charset="utf-8" /><meta content="ie=edge" http-equiv="x-ua-compatible" /><meta content="width=device-width, initial-scale=1" name="viewport" /><link href="shower/themes/material/styles/styles.css" rel="stylesheet" /><link href="https://fonts.googleapis.com/css?family=Lato|Montserrat:700&display=swap" rel="stylesheet" /><!-- %link(href="css/syntax-highlighting.css" rel="stylesheet")/ --><link href="css/extra.css" rel="stylesheet" /></head><body class="shower list"><header class="caption"><h1>Taming Data with Science</h1><h2>Lessons learned from analysis, explained with information theory</h2><p>Information Entropy is a way of measuring data in terms of the amount of uncertainty it resolves. We'll use this perspective to explore techniques for structuring and analysing your data. You will learn practical ideas for how to extract more value from your data and leave with a framework for understanding the value proposition of data-driven products.</p></header><section class="slide title clear"><img src="pictures/slide-rule-wide.png" height="100%" style="z-index: -1" class="place" /><h1>Taming Data with Science</h1><p>Lessons learned from analysis<br />explained with information theory</p><br /><br /><h2>Robin Gower, Equal Experts</h2></section><section class="slide"><img src="pictures/coat.svg" class="cover place right" /><br /><br /><br /><h2>Do I need to<br />bring a coat?</h2></section><section class="slide"><h2>What's the weather like?</h2><pre>
<code>
Location: Berlin
Date: 2019-09-24
Temperature: 15°
</code></pre><footer class="footer"><p>Let's imagine that before you left to come here you needed to figure out whether to wear a coat.</p><p>You might request the weather data to do that. There's 408 bits of information here (using UTF-8 encoding) and a lot of waste. We know where and when we are...</p></footer></section><section class="slide"><h2>Is it cold in Berlin today?</h2><pre>
<code>
Yes, you'll need a coat
</code></pre><footer class="footer"><p>If we rephrase the question, and remove the redundant content from the answer, we can cut the data down to 184 bits with no loss of information.</p></footer></section><section class="slide"><h2>Should I take a coat?</h2><pre>
<code style="font-size: 50px">
✓
</code></pre><footer class="footer"><p>Although all we really need is a single bit</p></footer></section><section class="slide"><h2 class="shout">Big Data !=<br />Big Information</h2><footer class="footer"><p>Pursuing or glorifying lots of data for it's own sake is a bit daft. It's like boasting about how much email you have in your inbox. You can increase the amount of data without increasing the amount of information or knowledge.</p><p>This is what information entropy is telling us. Information is not necessary proportional to data. We can decrease the amount of data without loosing information.</p><p>Indeed the whole purpose behind Big Data technologies is to extract small information from big data i.e. extract a structure from unstructured information. To find the vital email amongst the spam.</p></footer></section><section class="slide"><h2>Why should you care about taming your data?</h2><p class="next">Understand what information is contained in your data</p><p class="next">Discover what's really valuable about your data</p><p class="next">Learn how to structure and organise it effectively</p></section><section class="slide"><h2>Structure of the talk</h2><ol><li>Theory: Information entropy</li><li>Practice: Tame your data for analysis</li></ol></section><section class="slide"><h2 class="shout">Information Entropy</h2><footer class="footer">A way to measure the irreducible content of an information source</footer></section><section class="slide clear"><h2>Communication Theory</h2><figure><img alt="Shannon's communication system" src="pictures/Shannon_communication_system.svg" width="80%" class="cover place bottom" /><figcaption class="copyright bottom">Shannon (1948) A Mathematical Theory of Communication</figcaption></figure><footer class="footer"><p>Information theory is really communication theory. It's a mathematical, not physical theory. It's quite abstract.</p><p>The model defines data as a symbolic representation of an informative message that's used for transmission.</p><p>It's fortunate that it came about in the time of telegraph communication, it's harder to imagine such a simple abstraction being obvious today.</p></footer></section><section class="slide clear"><div class="columns two"><div style="text-align: center"><p>Source<br />↓<br />Encoder<br />↓<br />Symbol<br />↓<br />Decoder<br />↓<br />Destination</p></div><div class="next"><p>Freedom of expression<br /><br /><br /><br />Measurable quantity<br /><br /><br /><br />Thirst for knowledge</p></div></div><footer class="footer"><p>Communication theory concerns itself with choosing the optimum coding for a message. This should be chosen to maximise the information throughput of a channel with fixed capacity. This gives the sender the maximum freedom of expression and best quenches the receiver's thirst for knowledge.</p><p>We measure the amount of information in terms of how much the symbolic representation is able to convey.</p></footer></section><section class="slide clear"><div class="columns two"><div style="text-align: center"><p>Source<br />↓<br />Encoder<br />↓<br />Symbol<br />↓<br />Decoder<br />↓<br />Destination</p></div><div><p>Unobserved choices<br /><br /><br /><br />Measure of information entropy<br /><br /><br /><br />Resolved uncertainty</p></div></div><footer class="footer"><p>Information entropy invites us to make this measurement in terms of the number of choices that the sender could make (that weren't observed, and so need to be communicated) and equivalently in terms of the amount of uncertainty that is resolved upon receipt of the message.</p><p>If the source is letter of the alphabet, this involves much more choice than a traffic light which can only be one of three colours.</p><p>From this perspective, information is that which you otherwise don't know - that which needs to be conveyed.</p></footer></section><section class="slide"><p class="note">So how do we measure choice or uncertainty?</p><div class="next"><h2>What's more surprising?</h2><p>Correctly guessing the result of:</p><ul><li>tossing a coin or</li><li>rolling a dice?</li></ul></div><footer class="footer"><p>You'd be more surprised if I guess a dice roll correctly than a coin flip.</p><p>The outcome of a dice toss is more surprising as there are more possible outcomes (more options to choose, more uncertainty).</p><p>We can use probability as a measure of information content.</p></footer></section><section class="slide clear"><img src="pictures/fair_coin_3_4.png" class="cover place left" /><img src="pictures/fair_dice_3_4.png" class="cover place right" /><footer class="footer"><p>width shows the number of possible outcomes, height shows the probability of each</p><p>both uniform distributions - each outcome is equally possible, maximising surprise</p><p>***dice is lower than the coin...***</p><p>This is a fair coin, what about a biased one?</p></footer></section><section class="slide clear"><img src="pictures/fair_coin_2_4.png" class="cover place left" /><img src="pictures/biased_coin_2_4.png" class="cover place next" /><img src="pictures/trick_coin_2_4.png" class="cover place right next" /><footer class="footer"><p>Heads is the mode (the most common value where the distribution peaks) and is more likely than tails. We're expecting heads even before the coin lands. We'd be more surprised to see tails.</p><p>What if we pursue this futher, what would the probability distribution look like for a two headed coin?</p><p>The outcome is certain, there's no information conveyed by learning the outcome as you knew it all along.</p></footer></section><section class="slide"><h2>How can knowing the distribution of responses reduce the amount of data we need to communicate?</h2></section><section class="slide clear"><img src="pictures/morse_comparison.png" class="cover place" /><footer class="footner"><p>But how can knowing about the distribution of outcomes reduce the amount of information that needs to be transmitted?</p><p>The trick comes from our choice of coding system. If we know about the distribution of possible messages, we can assign the more common ones a shorter code, saving the longer codes for the less common messages. This reduces the average expected message length, allowing us to communicate more information within a limited channel capacity.</p><p>Morse code provides a good example of this. The more common letters get shorter codes - compare e.g. E and Z. Morse actually figured this out by visiting a mechanical print workshop and counting the number of letters in a typesetters toolbox.</p></footer></section><section class="slide"><h2>Sending one toss at a time</h2><div class="columns two"><div><p>Outcome Probabilities:$$\begin{align} P(H) &= 3/4 \\ P(T) &= 1/4\\ \end{align}$$</p></div><div class="next"><p>Simple binary coding:$$\begin{align} H &= 0 \\ T &= 1\\ \end{align}$$</p></div></div><p class="next">$$\begin{align} \text{Average bits per toss} = & P(H) \times 1 \text{ bit} + \\ & P(T) \times 1 \text{ bit} \\ = & 1 \text{ bit} \end{align}$$</p><footer class="footer">Let's imagine trying to communicate the outcome of tossing the biased coin. Here a coin that lands with heads up three quarters of the time. If we communicate the outcome of the coin toss one at a time, then we would could send a single bit each time. We can imagine a better encoding though, which makes use of what we know about the probability distribution to compress our messages. We can communicate the same amount of information, in less data (fewer bits). To do this we group up outcomes (coin tosses) into blocks, and send multiple results at a time.</footer></section><section class="slide clear"><h2>Block encoding (two tosses)</h2><div class="columns two"><div><p>Outcome Probabilities:$$\begin{align} P(HH) &= 9/16 \\ P(HT) &= 3/16 \\ P(TH) &= 3/16 \\ P(TT) &= 1/16 \\ \end{align}$$</p></div><div class="next"><p>Huffman coding:$$\begin{align} HH &= 0 \\ HT &= 10 \\ TH &= 110 \\ TT &= 111 \\ \end{align}$$</p></div></div><p class="next">$$\begin{align} \text{Average bits per sequence} = & P(HH) \times 1 \text{ bit } + P(HT) \times 2 \text{ bits } + \\ & P(TH) \times 3 \text{ bits } + P(TT) \times 3 \text{ bits} \\ = & 1.6875 \text{ bits} \\ \text{Average bits per toss} = & 1.6875/2 \\ = & 0.84375 \text{ bits} \\ \end{align}$$</p><footer class="footer">If we send the results of a sequence of two tosses at once, we have the distribution of outcomes you see on the left. A pair of heads is 9 times more common on this biased coin than a pair of tails. A huffman coding system assigns shorter codes to more common sequences and longer codes to less common ones. Instead of using 2 bits to communicate the results of two tosses, we sometimes send 1 and sometimes 3. On average though, the number of bits required to communicate the results of the toss is less than 1! If we think about the trick coin, with two heads, we only need to send one result ever - to communicate which way the coin is biased, and then this counts for an infinite number of tosses - averaging out to 0 information per toss.</footer></section><section class="slide"><h2>Information Entropy (motivation)</h2><p class="next">Tells us the theoretical limit for compression</p><p class="next">The irreducible information content of a source/ variable</p><p class="next">A way of thinking about how valuable data is</p><footer class="footer">So the choice of coding system can radically compress the amount of data needed to communicate a given amount of information. But how far can we go with this? How good is the best possible coding system. This is what information entry tells us.</footer></section><section class="slide"><h2>Information Content</h2><p>Bits of information required to communicate an outcome$$\begin{aligned} I(\text{outcome}) &= log_2 \frac{1}{P(\text{outcome})} \\ &= - log_2 P(\text{outcome}) \end{aligned}$$</p><footer class="footer">We begin with a definition of information content for a single outcome. In the previous example we knew from our coding system how many bits were required for each. We introduce the logarithm here as it tells us, for a given base, how many digits we need to communicate a given number of values. For binary digits, bits, we use base 2. For decimal digits, base 10. We can communicate a million values with 6 decimal digits, hence "log base 10 of a million is 6". We take the logarithm to find out how many bits we need to encode a given number of outcomes. We use the reciprocal (one over) the probability of the outcome to find out the number of possible outcomes that could have that probability.</footer></section><section class="slide"><h2>Information Entropy</h2><p>Average bits of information per outcome$$H(\text{source}) = \sum_{\text{outcome} \in \text{source}} {-P(\text{outcome}) \log_2 {P(\text{outcome})}}$$</p><footer class="footer">Then we can calculate the information entropy by summing over all of the possible outcomes in a source.</footer></section><section class="slide clear"><img src="pictures/password_strength.png" height="100%" class="cover place" /></section><section class="slide"><h2>Information Entropy (interpretation)</h2><ul><li class="next">more information requires more data</li><li class="next">more data doesn't mean more information</li><li class="next">rare things are more informative</li><li class="next">data is part of a conversation</li><li class="next">add value by resolving uncertainty</li></ul></section><section class="slide"><p class="shout">More information<br />requires<br />more data</p><footer class="footer">Most obviously that means more variables, more examples, longer history etc is all welcome.</footer></section><section class="slide"><h2>Retain precision</h2><p class="next">Keep more decimal places than you need to show</p><p class="next">Don't convert continuous variables (numbers) to discrete ones (categories)</p><p class="next">You can't unsimplify</p><footer class="footer">Categorisation is particularly risky if you're doing it before collection - you can't reverse this step, but you can always apply categories after. If you are categorising be clear on your purpose, is there a natural set of partitions (e.g. days, weeks, months) or do you need to see the distribution of the data first to identify equally sized buckets.</footer></section><section class="slide clear"><img src="pictures/EnhancedImage.gif" height="100%" class="cover place" /></section><section class="slide clear"><figure><img src="pictures/image_completion.gif" class="cover place top" /><figcaption class="copyright bottom">https://bamos.github.io/2016/08/09/deep-completion/</figcaption></figure><footer class="footer">Images can be completed because the model encodes expectations about faces from data it's already seen (and remembers). You can unsimplify, but you need a knowledge base to train on. This technique couldn't guess someone's phone number or license plate. The model learns a representation that acts as a coding of the data it's seen.</footer></section><section class="slide clear"><figure><img src="pictures/lod-cloud-sm.jpg" class="cover place top" /><figcaption class="copyright bottom">CC BY https://www.lod-cloud.net/</figcaption></figure><footer class="footer">Use identifiers! Link to other sources - like the whole web of data, standardising vocabularies etc. You can also declare facts and give them identifiers - e.g. adding classifications to data in order to handle them abstractly - e.g. customer lifetime value.</footer></section><section class="slide"><p class="shout">More data doesn't<br />mean more information</p></section><section class="slide"><h2>Normalisation</h2><p>A way to structure your data to reduce redundancy</p></section><section class="slide"><h3>1NF: one value per cell</h3><table><tbody><tr><th>Show</th><td>Actor</td></tr><tr><td>Brooklyn 99</td><td class="mark">Stephanie Beatriz, Terry Crews</td></tr></tbody></table><table class="next"><tbody><tr><th>Show</th><th>Actor</th></tr><tr><td>Brooklyn 99</td><td>Stephanie Beatriz</td></tr><tr><td>Brooklyn 99</td><td>Terry Crews</td></tr></tbody></table><footer class="footer">If we have more than one value in the cell, it requires other columns to apply to all actors and prevents us saying something specific about each actor. In the revised version we can distinguish each.</footer></section><section class="slide"><h3>2NF: cells depend on the whole key</h3><div class="columns three"><div></div><div><table><tbody><tr><th>Show</th><td>Rating</td><th>Language</th><td>Subtitles</td></tr><tr><td>Rick and Morty</td><td class="mark">4</td><td>English</td><td>Available</td></tr><tr><td>Rick and Morty</td><td class="mark">4</td><td>German</td><td>Unavailable</td></tr></tbody></table></div><div></div></div><div class="columns three next"><div><table><tbody><tr><th>Show</th><td>Rating</td></tr><tr><td>Rick and Morty</td><td>4</td></tr></tbody></table></div><div><br /></div><div><table><tbody><tr><th>Show</th><th>Language</th><td>Subtitles</td></tr><tr><td>Rick and Morty</td><td>English</td><td>Available</td></tr><tr><td>Rick and Morty</td><td>German</td><td>Unavailable</td></tr></tbody></table></div></div><footer class="footer">The ratings column only depends upon the show, whereas the subtitles column depends upon the show and the language. If the show's rating changes, it needs to be updated in two rows at once. This is redundant as the 2NF version shows.</footer></section><section class="slide"><h3>3NF: cells only depend on the key</h3><div class="columns three"><div><br /></div><div><table><tbody><tr><th>Show</th><td>Genre</td><td>Sub-Genre</td></tr><tr><td>Doctor Who</td><td class="mark">Fiction</td><td class="mark">Sci-Fi</td></tr></tbody></table></div><div></div></div><div class="columns three next"><div><table><tbody><tr><th>Show</th><td>Sub-Genre</td></tr><tr><td>Doctor Who</td><td>Sci-Fi</td></tr></tbody></table></div><div></div><div><table><tbody><tr><th>Sub-Genre</th><td>Genre</td></tr><tr><td>Sci-Fi</td><td>Fiction</td></tr></tbody></table></div></div><footer class="footer">While here the Genre and Subgenre columns depend on the key, but they're not independent. Genre depends on subgenre. If we update the subgenre of the show, the genre also needs to be updated. In the revised version, we only need to update the show's sub-genre in one place.</footer></section><section class="slide"><p class="note">What data scientists want...</p><h2>Tidy Data</h2><table><tbody><tr><th>Example ID</th><td>Feature A</td><td>Feature B</td><td>Feature C</td><td>Classification</td></tr><tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr><tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr><tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr></tbody></table><footer class="footer">This normalised form is what data scientists want: so-called tidy data. Each row is an example, each column provides an independent attribute with an explanatory feature. Ideally the data would also include a classification or some other attribute that is to be explained (although exploratory techinques exist where this is unknown - e.g. you're just looking for patterns not seeking to predict something).</footer></section><section class="slide"><h2>Orthongonality</h2><p class="next">Informative variables correlate with the objective<br /><span class="next">but not with each other</span></p><figure><img src="pictures/types-of-correlation.png" class="cover place right" /><figcaption class="copyright bottom right">via visme.co</figcaption></figure><footer class="footer">Mutual information is another source of redundancy. Multi-collinearity can lead to over-fitting/ model volatility</footer></section><section class="slide clear"><h2>MECE</h2><figure><img src="pictures/mece.png" class="cover" /></figure><footer class="footer">Likewise our categorical variables should capture reality into groups which have no overlap and leave no gaps. Overlap and gaps mean uncertainty.</footer></section><section class="slide"><h2 class="shout">Rarity is<br />informative</h2></section><section class="slide"><h2>Visualise your probability distributions</h2></section><section class="slide clear"><figure><img src="pictures/polish-exams.png" class="cover place" /><figcaption class="copyright bottom">Polish Central Examination Board Matura Test Scores 2013 cke.edu.pl</figcaption></figure><footer class="footer">Can you guess the minimum score to pass this exam?</footer></section><section class="slide clear"><figure><img src="pictures/anscombes-quartet.png" class="cover place" /><figcaption class="copyright bottom">Anscombe (1973) Graphs in Statistical Analysis reproduced in Anthony (2012) Qlik Design Blog</figcaption></figure></section><section class="slide clear"><img src="pictures/award-distribution.png" class="cover place" /><footer class="footer"><p>here we have a practical example I did a few years ago to analyse open data about charitable awards in the UK</p><p>the empirical cumulative density function plots steps with height `1/n` for each `n` data points</p><p>the median average is at 50%, half the height of the plot</p><p>here we're also using a log scale for the horizontal x axis, as the distribution is skewed with a long tail of very few, very high value grants</p><p>this is particularly useful for comparing distributions, you can the award value thresholds as vertical lines</p></footer></section><section class="slide"><h2 class="shout">Data is part of<br />a conversation</h2></section><section class="slide"><figure><img src="pictures/seidenstrasse_ccc_cropped.jpg" width="100%" class="place bottom" /><figcaption class="copyright bottom white">CC BY flikr.com/photos/der_robert</figcaption></figure><footer class="footer">Given the power of data and particular the astonishing advances from machine learning, we're at risk of fetishing data itself. Communication theory encourages us instead to think about data in the context of a conversation that should be informative. We can negotiate the nature of that conversation. This can radically affect the value and usefulness of the data that is created.</footer></section><section class="slide"><h2>Start by defining the question</h2><footer class="footer"><p>Before starting any data work or analysis, you need to define the question. This ensures the message will be informative.</p></footer></section><section class="slide"><p class="note">Defining the Question...</p><h2 class="next">What decision needs to be taken?</h2><p class="next">If in doubt, follow the money!</p><footer class="footer"><p>All too often people jump to a conclusion</p></footer></section><section class="slide"><h2>Incoporate Context</h2><p>e.g. Unitless measures</p></section><section class="slide clear"><figure><img src="pictures/more_than_fair_share.png" class="cover" /><figcaption class="copyright right"><a href="https://www.reddit.com/r/dataisbeautiful/comments/d1psti/oc_which_countries_produce_a_greater_proportion/ezp0mmh/?utm_source=reddit&utm_medium=usertext&utm_name=dataisbeautiful&utm_content=t1_ezp0rw7">reddit.com/u/UnrequitedReason on /r/DataIsBeautiful</a></figcaption></figure><footer class="footer"><p>Unitless measures are proportions or rates. Values that have been denominated. The denominator serves to transform the values to a common scale. This allows the value to incorporate the context that you'd otherwise need to know to interpret it.</p><p>In this example, we want to know which countries are polluting more than their fair share. We begin with a CO2 emissions, in tonnes. But is a million tonnes a lot? This is then denominated by the world total to give a percentage per country. This in turn can be improved further by incorporating further context - namely that different countries are different sizes. We end up with a simple metric - less than one is good, more than one is bad, 4x is twice as bad as 2x.</p></footer></section><section class="slide"><h2>Provide Metadata</h2><p>Describe and explain your data</p><p>Aid discovery and interpretation</p><p>Track provenance</p><p>Handles to manipulate your data</p></section><section class="slide"><h2>Don't put data in the keys</h2><pre>
<code>
{
"2017": 8000,
"2018": 10000,
"2019": 15000
}
</code></pre></section><section class="slide"><h2>Data frames are easier to manipulate</h2><pre>
<code>
[
{ "date": "2017", "pageviews": 8000 },
{ "date": "2018", "pageviews": 10000 },
{ "date": "2019", "pageviews": 15000 }
]
</code></pre></section><section class="slide"><h2 class="shout">Add value by<br />resolving<br />uncertainty</h2></section><section class="slide"><h2>Information Entropy</h2><ul><li>more information requires more data - retain precision, link-up data</li><li>more data doesn't mean more information - normalised tidy-data, mece, orthogonality</li><li>rare things are more informative - visualise distributions</li><li>data is part of a conversation - start by defining the question/ decision, incorporate context, use metadata</li><li>add value by resolving uncertainty</li></ul></section><section class="slide title clear"><img src="pictures/slide-rule-wide.png" height="100%" style="z-index: -1" class="place" /><h1>Taming Data with Science</h1><p><a href="https://github.com/robsteranium/tame-data-with-science">github.com/robsteranium/tame-data-with-science</a></p><p><a href="mailto:[email protected]">[email protected]</a></p><p><a href="http://twitter.com/robsteranium">@robsteranium</a></p></section><div class="progress"></div><script src="shower/shower.min.js"></script><script src="mathjax/tex-chtml.js" id="MathJax-Script"></script></body></html>