forked from Coleridge-Initiative/big-data-and-social-science
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-ChapterBiasFairness.Rmd
1793 lines (1489 loc) · 93.1 KB
/
11-ChapterBiasFairness.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!-- % add reference **** -->
Bias and Fairness {#chap:bias}
================
**Kit T. Rodolfa, Pedro Saleiro, and Rayid Ghani**
Interest in algorithmic fairness and bias has been growing recently (for good reason), but it's easy to get lost in the large number of definitions and metrics. There are many different, often competing, ways to measure whether a given model is statistically "fair" but it's important to remember to start from the social and policy goals for equity and fairness and map those to the statistical properties we want in our models to help achieve those goals. In this chapter, we provide an overview of these statistical metrics along with some concrete examples to help navigate these concepts and understand the trade-offs involved in choosing to optimize to one metric over others, focusing on the metrics relevant to binary classification methods used frequently in risk-based models for policy settings.
Introduction
------------
In Chapter [Machine Learning](#chap:ml), you learned about several of the
concepts, tools, and approaches used in the field of machine learning and
how they can be used in the social sciences. In chapter
[Machine Learning](#chap:ml), we focused on evaluation
metrics such as precision (positive predictive value), recall
(sensitivity), area-under-curve (AUC), and accuracy, that are often used
to measure the performance of machine learning methods. In most (if not
every) public policy problems, a key goal for the analytical systems
being developed is to help achieve equitable outcomes.
When machine learning models are being used to make decisions, they
cannot be separated from the social and ethical context in which they
are applied, and those developing and deploying these models must take
care to do so in a manner that accounts for both accuracy and fairness.
In this chapter, we will discuss sources of potential bias in your
modeling pipeline, as well as some of the ways that bias introduced by a
model can be measured, with a particular focus on classification
problems. Unfortunately, just as there is no single machine learning
algorithm that is best suited to every application, no one fairness
metric will fit every situation. However, we hope this chapter will
provide you with a grounding in the available ways of measuring
algorithmic fairness that will help you navigate the trade-offs involved
putting these into practice in your own applications.
Sources of Bias {#sec:biassources}
---------------
Bias may be introduced into a machine learning project at any step along
the way and it is important to carefully think through each potential
source and how it may affect your results. In many cases, some sources
may be difficult to measure precisely (or even at all), but this doesn’t
mean these potential biases can be readily ignored when developing
interventions or performing analyses.
### Sample Bias
You’re likely familiar with sampling issues as a potential source of
bias in the contexts of causal inference and external validity in the
social science literature. A biased sample can be just as problematic
for machine learning as it can be for inference, and predictions made on
individuals or groups not represented in the training set are likely to
be unreliable. As such, any application of machine learning should start
with a careful understanding of data generating process for the training
and test sets. What is the relevant population for the project and how
might some individuals be incorrectly excluded or included from the data
available for modeling or analysis?
If there is a mismatch between the available training data and the
population to whom the model will be applied, you may want to consider
whether it is possible to collect more representative data. A model to
evaluate the risk of health violations at restaurants may be of limited
applicability if the only training data available is based on
inspections that resulted from reported complaints. In such a case, an
initial trial of randomized inspections might provide a more
representative dataset. However, this may not always be possible. For
instance, in the case of bail determinations, labeled data will only be
available for individuals who are released under the existing system.
How does the available training data relate to the population that the
model will be applied to? If there is a mismatch here, is it possible to
collect more appropriate data? Example of bail determination – only have
subsequent outcome data for individuals who were released in the past
Even if the training data matches the population, are their underlying
systemic biases involved in defining that population in general? For
instance, over-policing of black neighborhoods
For data with a time component or models that will be deployed to aid
future decisions, are there relevant policy changes in the past that may
make data from certain periods of time less relevant? Pending policy
changes going forward that may affect the modeling population?
Measurement here might be difficult, but helpful to think through each
of these questions in detail. Often, other sources of data (even in
aggregate form) can provide some insight on how representative your data
may be, including census data, surveys, and academic studies in the
relevant area.
### Label(Outcome) Bias
Regardless of whether your dataset reflects a representative sample of
the relevant population for your intervention or analysis, there may
also be bias inherent in the labels (that is, the measured outcomes)
associated with individuals in that data.
One mechanism by which bias may be introduced is in how the label/outcome itself
is defined. For instance, a study of recidivism might use a new arrest
as an outcome variable when it really cares about committing a new
crime. However, if some groups are policed more heavily than others,
using arrests to define the outcome variable may introduce bias into the
system’s decisions. Similarly, a label that relies on the number of days
an individual has been incarcerated would reflect known biases in
sentence lengths between black and white defendants.
A related mechanism is measurement error. Even when the outcome of
interest is well-defined and can be measured directly, bias may be
introduced through differential measurement accuracy across groups. For
instance, data collected through survey research might suffer from
language barriers or cultural differences in social desirability that
introduce measurement errors across groups.
### Machine Learning Pipeline Bias {#sec:mlbiasexamples}
Biases can be introduced by the handling and transformation of data
throughout the machine learning pipeline as well, requiring careful
consideration as you ingest data, create features, and model outcomes of
interest. Below are a few examples at each stage of the process, but
these are far from exhaustive and intended to help motivate thinking
about how bias might be introduced in your own projects.
**Ingesting Data:** The process of loading, cleaning, and reconciling
data from a variety of data sources (often referred to as ETL) can
introduce a number of errors that might have differential downstream
impacts on different populations:
- Are your processes for matching individuals across data sources
equally accurate across different populations? For instance, married
vs maiden names may bias match rates against women, while
inconsistencies in handling of multi-part last names may make
matching less reliable for hispanic individuals.
- Nickname dictionaries used in record reconciliation might be derived
from different populations than your population of interest.
- A data loading process that drops records with “special characters”
might inadvertently exclude names with accents or tildes.
**Feature Engineering:** Biases are easy to introduce during the process
of constructing features, both in the handling of features that relate
directly to protected classes as well as information that correlates
with these populations (such as geolocation). A few examples include:
- Dictionaries to infer age or gender from name might be derived from
a population that is not relevant to your problem.
- Handling of missing values and combining “other” categories can
become problematic, especially for multi-racial individuals or
people with non-binary gender.
- Thought should be given to how race and ethnicity indicators are
collected – are these self-reported, recorded by a third party, or
inferred from other data? The data collection process may inform the
accuracy of the data and how errors differ across populations.
- Features that rely on geocoding to incorporate information based on
distances or geographic aggregates may miss homeless individuals or
provide less predictive power for more mobile populations.
**Modeling:** The model itself may introduce bias into decisions made
from its scores by performing worse on some groups relative to others
(many examples have been highlighted in popular press recently, such as
racial biases in facial recognition algorithms and gender biases in
targeting algorithms for job advertisement on social media). Because of
the complex correlation structure of the data, it generally isn’t
sufficient to simply leave out the protected attributes and assume this
will result in fair outcomes. Rather model performance across groups
needs to be measured directly in order to understand and address any
biases. However, there are many (often incompatible) ways to define
fairness and Section [metrics](#sec:metrics) will take a closer look at these
options in much more detail.
Much of the remainder of this chapter focuses on how we might define and
measure fairness at the level of the machine learning pipeline itself.
In Section [metrics](#sec:metrics), we will introduce several of the metrics
used to measure algorithmic fairness and in Section [applications](#sec:applications) we discuss how these can be used in the
process of evaluating and selecting machine learning models.
### Application Bias
A final potential source of bias worth considering is how the model or
analysis might be put into use in practice. One way this might happen is
through heterogeneity in the effectiveness of an intervention across
groups. For instance, imagine a machine learning model to identify
individuals most at risk for developing diabetes in the next 3 years for
a particular preventive treatment. If the treatment is much more
effective for individuals with a certain genetic background relative to
others, the overall outcome of the effort might be to exacerbate
disparities in diabetes rates even if the model itself is modeling risk
in an unbiased way.
Likewise, it is important to be aware of the risk of discriminatory
applications of a machine learning model. Perhaps a model developed to
screen out unqualified job candidates is only “trusted” by a hiring
manager for female candidates but often ignored or overridden for men.
In a perverse way, applying an unbiased model in such a context might
serve to increase inequities by giving bad actors more information with
which to (wrongly) justify their discriminatory practices.
While there may be relatively little you can do to detect or mitigate
these types of bias at the modeling stage, performing a trial to compare
current practice with a deployed model can be instructive where doing so
is feasible. Keep in mind, of course, that the potential for machine
learning systems to be applied in biased ways shouldn’t be construed as
an argument against developing these systems at all any more than it
would be reasonable to suggest that current practices are likely to be
free of bias. Rather, it is an argument for thinking carefully about
both the status quo and how it may change in the presence of such a
system, putting in place legal and technical safeguards to help ensure
that these methods are applied in socially responsible ways.
### Considering Bias When Deploying Your Model
Ultimately, what we care about is some global idea of how putting a
model into practice will affect some overall concept of social welfare
and fairness influenced by all of these possible sources of bias. While
this is generally impossible to measure in a quantitative way, it can
provide a valuable framework for qualitatively evaluating the potential
impact of your model. For most of the remainder of this chapter, we
consider a set of more quantitative metrics that can be applied to the
predictions of a machine learning pipeline specifically, but it is
important to keep in mind that these metrics only apply to the sample
and labels you have and ignoring other sources of bias that may be at
play in the underlying data generating process could result in unfair
outcomes even when applying a model that appears to be “fair” by your
chosen metric.
Dealing with Bias
-----------------
### Define Bias {#sec:metrics}
Section [bias examples](#sec:mlbiasexamples) provided some examples
for how bias might be introduced in the process of using machine learning
to work with a dataset. While far from exhaustive as a source of potential
bias in an overall application, these biases can be more readily measured
and addressed through choices made during data preparation, modeling,
and model selection. This section focuses on detecting and understanding
biases introduced at this stage of the process.
One key challenge, however, is that there is no universally-accepted
definition of what it means for a model to be fair. Take the example of
a model being used to make bail determinations. Different people might
consider it “fair” if:
- It makes mistakes about denying bail to an equal number of white and
black individuals
- The chances that a given black or white person will be wrongly
denied bail is equal, regardless of race
- Among the jailed population, the probability of having been wrongly
denied bail is independent of race
- For people who should be released, the chances that a given black or
white person will be denied bail is equal
In different contexts, reasonable arguments can be made for each of
these potential definitions, but unfortunately, not all of them can hold
at the same time. The remainder of this section explores these competing
options and how to approach them in more detail.
### Definitions
Most of the metrics used to assess model fairness relate either to the
types of errors a model might make or how predictive the model is across
different groups. For binary classification models (which we focus on
here), these are generally derived from values in the *confusion matrix*
(see Figure 6.9 and Section 6.6.2 for more details):
- **True Positives ($TP$)** are individuals for whom both the model
prediction and actual outcome are positive labels.
- **False Positives ($FP$)** are individuals for whom both the model
predicts a positive label, but the actual outcome is a negative
label.
- **True Negatives ($TN$)** are individuals for whom both the model
prediction and actual outcome are negative labels.
- **False Negatives ($FN$)** are individuals for whom both the model
predicts a negative label, but the actual outcome is a positive
label.
Based on these four categories, we can calculate several ratios that are
instructive for thinking about the equity of a model’s predictions in
different situations (Sections [punitive example](#sec:punitiveexample) and [assistive example](#sec:assistiveexample) provide some detailed examples here):
%can we move the following definitions or the equations to the side margin?
- **False Positive Rate ($FPR$)** is the fraction of individuals with
negative actual labels who the model misclassifies with a positive
predicted label: $FPR = FP / (FP+TN)$
- **False Negative Rate ($FNR$)** is the fraction of individuals with
positive actual labels who the model misclassifies with a negative
predicted label: $FNR = FN / (FN+TP)$
- **False Discovery Rate ($FDR$)** is the fraction of individuals who
the model predicts to have a positive label but for whom the actual
label is negative: $FDR = FP / (FP+TP)$
- **False Omission Rate ($FOR$)** is the fraction of individuals who
the model predicts to have a negative label but for whom the actual
label is positive: $FOR = FN / (FN+TN)$
- **Precision** is the fraction of individuals who the model predicts
to have a positive label about whom this prediction is correct:
$\textrm{precision} = TP / (FP+TP)$
- **Recall** is the fraction of individuals with positive actual
labels who the model has correctly classified as such:
$\textrm{recall} = TP / (FN+TP)$
For the first two metrics ($FPR$ and $FNR$), notice that the denominator
is based on actual outcomes (rather than model predictions), while in
the next two ($FDR$ and $FOR$) the denominator is based on model
predictions (whether an individual falls above or below the threshold
used to turn model scores into 0/1 predicted classes). The final two
metrics relate to correct predictions rather than errors, but are
directly related to error measurements (that is,
$\textrm{recall} = 1-FNR$ and $\textrm{precision} = 1-FDR$) and may
sometimes have better properties for calculating model bias.
Notice that the metrics defined here require the use of a threshold to
turn modeled scores into 0/1 predicted classes and are therefore most
useful when either a threshold is well-defined for the problem (e.g.,
when available resources mean a program can only serve a given number of
individuals) or where calculating these metrics at different threshold
levels might be used (along with model performance metrics) to choose a
threshold for application. In some cases, it may also be of interest to
think about equity across the full distribution of the modeled score.
Common practices in these situations are to look at how model
performance metrics such as the area under the receiver-operator curve
($AUC-ROC$) or model calibration compare across subgroups (such as by
race, gender, age). Or, in cases where the underlying causal
relationships are well-known, counterfactual methods may also be used to
assess a model’s bias (these methods may also be useful when you suspect
label bias to be an issue in your data). We don’t explore these topics
deeply here, but refer you out to the relevant references if you would
like to learn more ****.
### Choosing Bias Metrics
Any of the metrics defined above can be used to calculate disparities
across groups in your data and (unless you have a perfect model) many of
them cannot be balanced across subgroups at the same time. As a result,
one of the most important — and frequently most challenging — aspects of
measuring bias in your machine learning pipeline is simply understanding
how “fairness” should be defined for your particular case.
In general, this requires consideration of the project’s goals and a
detailed discussion between the data scientists, decision makers, and
those who will be affected by the application of the model. Each
perspective may have a different concept of fairness and a different
understanding of harm involved in making different types of errors, both
at individual and societal levels. Importantly, data scientists have an
critical role in this conversation, both as the experts in understanding
how different concepts of fairness might translate into metrics and
measurement and as individuals with experience deploying similar models.
While there is no universally correct definition of fairness, nor one
that can be learned from the data, this doesn’t excuse the data
scientists from responsibility for taking part in the conversation
around fairness and equity in their models and helping decision makers
understand the options and trade-offs involved.
Practically speaking, coming to an agreement on how fairness should be
measured in a purely abstract manner is likely to be difficult. Often it
can be instructive instead to explore different options and metrics
based on preliminary results, providing tangible context for potential
trade-offs between overall performance and different definitions of
equity and helping guide stakeholders through the process of deciding
what to optimize. The remainder of this section looks at some of the
metrics that may be of particular interest in different types of
applications:
- If your intervention is punitive in nature (e.g., determining whom
to deny bail), individuals may be harmed by intervening on them in
error so you may care more about metrics that focus on false
positives. Section [punitive example](#sec:punitiveexample) provides
an example to guide you through what some of these metrics mean in this case.
- If your intervention is assistive in nature (e.g., determining who
should receive a food subsidy), individuals may be harmed by failing
to intervene on them when they have need, so you may care more about
metrics that focus on false positives. Section
[assistive example](#sec:assistiveexample) provides an example to guide you
through metrics that may be applicable in this case.
- If your resources are significantly constrained such that you can
only intervene on a small fraction of the population at need, some
of the metrics described here may be of limited use and Section
[constrained assistive](#sec:constrainedassistive) describes this
case in more detail.
### Punitive Example {#sec:punitiveexample}
When the application of a risk model is punitive in nature, individuals
may be harmed by being incorrectly included in the “high risk”
population that receives an intervention. In an extreme case, we can
think of this as incorrectly detaining an innocent person in jail.
Hence, with punitive interventions, we focus on bias and fairness
metrics based on false positives.
We might naturally think about the number of people wrongly jailed from
each group as reasonable place to start for assessing whether our model
is biased. Here, we are concerned with statements like “twice as many
people from Group A were wrongly convicted as from Group B.”
\^(In probabilistic terms, we could express this as:
$$P(\textrm{wrongly jailed, group $i$}) = C~~~\forall~i$$ Where $C$ is a
constant value. Or, alternatively,
$$\frac{FP_i}{FP_j} = 1~~~\forall~i,j$$ Where $FP_i$ is the number of
false positives in group $i$.)
However, it is unclear whether differences in the number of false
positives across groups reflect unfairness in the model. For instance,
if there are twice as many people in Group A as there are in Group B,
some might deem the the situation described above as fair from the
standpoint that the composition of the false positives reflects the
composition of the groups. This brings us to our second metric:
By accounting for differently sized groups, we ask the question, “just
by virtue of the fact that an individual is a member of a given group,
what are the chances they’ll be wrongly convicted?”
\^(in terms of probability,
$$P(\textrm{wrongly jailed $\mid$ group $i$}) = C~~~\forall~i$$ Where
$C$ is a constant value. Or, alternatively,
$$\frac{FP_i}{FP_j} = \frac{n_i}{n_j}~~~\forall~i,j$$ Where $FP_i$ is
the number of false positives and $n_i$ the total number of individuals
in group $i$.)
While this metric might feel like it meets a reasonable criteria of
avoiding treating groups differently in terms of classification errors,
there are other sources of disparities we might care about as well. For
instance, suppose there are 10,000 individuals in Group A and 30,000 in
Group B. Suppose further that 100 individuals from each group are jail,
with 10 Group A people wrongly convicted and 30 Group B people wrongly
convicted. We’ve balanced the number of false positives by group size
(0.1% for both groups) so there are no disparities as far as this metric
is concerned, but note that 10% of the jailed Group A individuals are
innocent compared to 30% of the jailed Group B individuals. The next
metric is concerned with unfairness in this way:
The False Discovery Rate ($FDR$) focuses specifically on the people who
are affected by the intervention — in the example above, among the 200
people in jail, what are the group-level disparities in rates of wrong
convictions. The jail example is particularly instructive here as we
could imagine the social cost of disparities manifesting directly
through inmates observing how frequently different groups are wrongly
convicted.
\^(In probabilistic terms,
$$P(\textrm{wrongly jailed $\mid$ jailed, group $i$}) = C~~~\forall~i$$
Where $C$ is a constant value. Or, alternatively,
$$\frac{FP_i}{FP_j} = \frac{k_i}{k_j}~~~\forall~i,j$$ Where $FP_i$ is
the number of false positives and $k_i$ the total number of *jailed*
individuals in group $i$.)
The False Positive Rate ($FPR$) focuses on a different subset,
specifically, the individuals who should **not** be subject to the
intervention. Here, this would ask, “for an *innocent* person, what are
the chances they will be wrongly convicted by virtue of the fact that
they’re a member of a given group?”
^(\In probabilistic terms,
$$P(\textrm{wrongly jailed $\mid$ innocent, group $i$}) = C~~~\forall~i$$
Where $C$ is a constant value. Or, alternatively,
$$\frac{FP_i}{FP_j} = \frac{n_i \times (1-p_i)}{n_j \times (1-p_j)}~~~\forall~i,j$$
Where $FP_i$ is the number of false positives, $n_i$ the total number of
individuals, and $p_i$ is the prevalence (here, rate of being truly
guilty) in group $i$.)
The difference between the choosing to focus on the $FPR$ and group
size-adjusted false positives is somewhat nuanced and warrants
highlighting:
- Having no disparities in group size-adjusted false positives implies
that, if I were to choose a random person from a given group
(regardless of group-level crime rates or their individual guilt or
innocence), I would have the same chance of picking out a wrongly
convicted person across groups.
- Having no disparities in $FPR$ implies that, if I were to choose a
random *innocent* person from a given group, I would have the same
chance of picking out a wrongly convicted person across groups.
By way of example, imagine you have a society with two groups (A and B)
and a criminal process with equal $FDR$ and group-size adjusted false
positives with:
- Group A has 1000 total individuals, of whom 100 have been jailed
with 10 wrongfully convicted. Suppose the other 900 are all guilty.
- Group B has 3000 total individuals, of whom 300 have been jailed
with 30 wrongfully convicted. Suppose the other 2700 are all
innocent.
\^(In this case, $$\begin{aligned}
&\frac{FP_A}{n_A} = \frac{10}{1000} = 1.0\% \\
&FDR_A = \frac{10}{100} = 10.0\% \\
&FPR_A = \frac{10}{10} = 100.0\%\end{aligned}$$
while, $$\begin{aligned}
&\frac{FP_B}{n_B} = \frac{30}{3000} = 1.0\% \\
&FDR_B = \frac{30}{300} = 10.0\% \\
&FPR_B = \frac{30}{2730} = 1.1\%\end{aligned}$$)
That is,
- A randomly chosen individual has the same chance (1.0%) of being
wrongly convicted regardless of which group they belong to
- In both groups, a randomly chosen person who is convicted has the
same chance (10.0%) of actually being innocent
- HOWEVER, an innocent person in Group A is certain to be wrongly
convicted, nearly 100 times the rate of an innocent person in Group
B
While this is an exaggerated case for illustrative purposes, there is a
more general principle at play here, namely: when prevalences differ
across groups, disparities cannot be eliminated from both the $FPR$ and
group-size adjusted false positives at the same time (in the absence of
perfect prediction).
While there is no universal rule for choosing a bias metric (or set of
metrics) to prioritize, it is important to keep in mind that there are
both theoretical and practical limits on the degree to which these
metrics can be jointly optimized.
Balancing these trade-offs will generally require some degree of
subjective judgment on the part of policy makers. For instance, if there
is uncertainty in the quality of the labels (e.g., how well can we truly
measure the size of the innocent population?), it may make more sense in
practical terms to focus on the group-size adjusted false positives than
$FPR$.
### Assistive Example {#sec:assistiveexample}
By contrast to the punitive case, when the application of a risk model
is assistive in nature, individuals may be harmed by being incorrectly
excluded from the “high risk” population that receives an intervention.
Here, we use identifying families to receive a food assistance benefit
as a motivating example. Where the punitive case focused on errors of
inclusion through false positives, most of the metrics of interest in
the assistive case focus on analogues that measure errors of omission
through false negatives.
#### Count of False Negatives
A natural starting point for understanding whether a program is being
applied fairly is to count how many people it is missing from each
group, focusing on statements like “twice as many families with need for
food assistance from Group A were missed by the benefit as from Group
B.”
\^(In probabilistic terms, we could express this as:
$$P(\textrm{missed by benefit, group $i$}) = C~~~\forall~i$$ Where $C$
is a constant value. Or, alternatively,
$$\frac{FN_i}{FN_j} = 1~~~\forall~i,j$$ Where $FN_i$ is the number of
false negatives in group $i$.)
Differences in the number of false negatives by group, however, may be
relatively limited in measuring equity when the groups are very
different in size. If there are twice as many families in Group A as in
Group B in the example above, the larger number of false negatives might
not be seen as inequitable, which motivates our next metric:
#### Group Size-Adjusted False Negatives
To account for differently sized groups, one way of phrasing the
question of fairness is to ask, “just by virtue of the fact that an
individual is a member of a given group, what are the chances they will
be missed by the food subsidy?”
\^(That is, in terms of probability,
$$P(\textrm{missed by benefit $\mid$ group $i$}) = C~~~\forall~i$$ Where
$C$ is a constant value. Or, alternatively,
$$\frac{FN_i}{FN_j} = \frac{n_i}{n_j}~~~\forall~i,j$$ Where $FN_i$ is
the number of false negatives and $n_i$ the total number of families in
group $i$.)
While avoiding disparities on this metric focuses on the reasonable goal
of treating different groups similarly in terms of classification
errors, we may also want to directly consider two subsets within each
group: (1) the set of families not receiving the subsidy, and (2) the
set of families who would benefit from receiving the subsidy. We take a
closer look at each of these cases below.
#### False Omission Rate
The False Omission Rate ($FOR$) focuses specifically on people on whom
the program doesn’t intervene – in our example, the set of families not
receiving the food subsidy. Such families will either be true negatives
(that is, those not in need of the assistance) or false negatives (that
is, those who did need assistance but were missed by the program), and
the $FOR$ asks what fraction of this set fall into the latter category.
\^(In probabilistic terms,
$$P(\textrm{missed by program $\mid$ no subsidy, group $i$}) = C~~~\forall~i$$
Where $C$ is a constant value. Or, alternatively,
$$\frac{FN_i}{FN_j} = \frac{n_i-k_i}{n_j-k_j}~~~\forall~i,j$$ Where
$FN_i$ is the number of false negatives, $k_i$ the number of families
receiving the subsidy, and $n_i$ is the total number of families in
group $i$.)
In practice, the $FOR$ can be a useful metric in many situations,
particularly because need can often be more easily measured among
individuals not receiving a benefit than among those who do (for
instance, when the benefit affects the outcome on which need is
measured). However, when resources are constrained such that a program
can only reach a relatively small fraction of the population, its
utility is more limited. See [constrained assistive](#sec:constrainedassistive)
for more details on this case.
#### False Negative Rate
The False Negative Rate ($FNR$) focuses instead on the set of people
with need for the intervention. In our example, this asks the question,
“for a family that needs food assistance, what are the chances they will
be missed by the subsidy by virtue of the fact they’re a member of a
given group?”
\^(In probabilistic terms,
$$P(\textrm{missed by subsidy $\mid$ need assistance, group $i$}) = C~~~\forall~i$$
Where $C$ is a constant value. Or, alternatively,
$$\frac{FN_i}{FN_j} = \frac{n_i \times p_i}{n_j \times p_j}~~~\forall~i,j$$
Where $FN_i$ is the number of false negatives, $n_i$ the total number of
individuals, and $p_i$ is the prevalence (here, rate of need for food
assistance) in group $i$.)
As with the punitive case, there is some nuance in the difference
between choosing to focus on group-size adjusted false negatives and the
$FNR$ that are worth pointing out:
- Having no disparities in group size-adjusted false negatives implies
that, if I were to choose a random family from a given group
(regardless of group-level nutritional outcomes or their individual
need), I would have the same chance of picking out a family missed
by the program person across groups.
- Having no disparities in $FNR$ implies that, if I were to choose a
random family *with need for assistance* from a given group, I would
have the same chance of picking out one missed by the subsidy across
groups.
- Unfortunately, disparities in both of these metrics cannot be
eliminated at the same time, except where the level of need is
identical across groups or in the generally unrealist case of
perfect prediction.
### Special Case: Resource-Constrained Programs {#sec:constrainedassistive}
In many real-world applications, programs
may only have sufficient resources to serve a small fraction of
individuals who might benefit. In these cases, some of the metrics
described here may prove less useful. For instance, where the number of
individuals served is much smaller than the number of individuals with
need, the false omission rate will converge on the overall prevalence,
and it will prove impossible to balance $FOR$ across groups.
In such cases, group-level recall may provide a useful metric for
thinking about equity, asking the question, “given that the program
cannot serve everyone with need, is it at least serving different
populations in a manner that reflects their level of need?”
\^(In probabilistic terms,
$$P(\textrm{received subsidy $\mid$ need assistance, group $i$}) = C~~~\forall~i$$
Where $C$ is a constant value. Or, alternatively,
$$\frac{TP_i}{TP_j} = \frac{n_i \times p_i}{n_j \times p_j}~~~\forall~i,j$$
Where $TP_i$ is the number of true positives, $n_i$ the total number of
individuals, and $p_i$ is the prevalence (here, rate of need for food
assistance) in group $i$.)
Note that, unlike the metrics described above, using recall as an equity
metric doesn’t explicitly focus on the mistakes being made by the
program, but rather on how it is addressing need within each group.
Nevertheless, balancing recall is equivalent to balancing the false
negative rate across groups (note that $recall = 1-FNR$), but may be a
more well-behaved metric for resource-constrained programs in practical
terms. When the number of individuals served is small relative to need,
$FNR$ will approach 1 and ratios between group-level $FNR$ values will
not be particularly instructive, while ratios between group-level recall
values will be meaningful.
As an aside, a focus on recall can also provide a lever that a program
can use to consider options for achieving programmatic or social goals.
For instance, if underlying differences in prevalence across groups is
believed to be a result of social or historical inequities, a program
may want to go further than balancing recall across groups, focusing
even more heavily on historically under-served groups. One rule of thumb
we have used in these cases is to balance recall relative to prevalence
(however, there’s no theoretically “right” choice here and it’s
generally best to consider a range of options):
\^($$\frac{recall_i}{recall_j} = \frac{p_i}{p_j}~~~\forall~i,j$$)
The idea here is that (assuming the program is equally effective across
groups), balancing recall will seek to improve outcomes at an equal rate
across groups without impacting underlying disparities while a heavier
focus on previously under-served groups might seek to both improve
outcomes across groups while attempting to close these gaps as well.
Mitigating Bias {#sec:applications}
---------------
The metrics described in this chapter can be put to use in a variety of
ways: auditing existing models and processes for equitable results, in
the process of choosing a model to deploy, or in making choices about
how a chosen model is put into use. This section provides some details
about how you might approach each of these tasks.
### Auditing Model Results
Because the metrics describe here rely only on the predicted and actual
labels, no specific knowledge of the process by which the predicted
labels are determined is needed to make use of them to assess bias and
fairness in the results. Given this sort of labeled outcome data for any
existing or proposed process, these tools can be applied to help
understand whether that process is yielding equitable results (for the
various possible definitions of “equitable” described above).
Note that the existing process need not be a machine learning model:
these equity metrics can be calculated for any set of decisions and
outcomes, regardless of whether the decisions are derived from a model,
judge, case worker, heuristic rule, or other process. And, in fact, it
will generally be useful to make measures of equity in any existing
processes which a model might augment or replace to help understand
whether application of the model might improve, degrade, or leave
unchanged the fairness of the existing system.
### Model Selection
As described in Chapter 6, many different types of models (each in turn
with many tune-able hyperparameters) can be brought to bear on a given
machine learning problem, making the task of selecting a specific model
to put into use an important step in the process of model development.
Chapter 6 described how this might be done by considering a model’s
performance on various evaluation metrics, as well as how consistent
that performance is across time or random splits of the data. This
framework for model selection can naturally be extended to incorporate
equity metrics, however doing so introduces a layer of complexity in
determining how to evaluate trade-offs between overall performance and
predictive equity.
Just as there is no one-size-fits-all metric for measuring equity that
works in all contexts, you might choose to incorporate fairness in the
model selection process in a variety of different ways. Here are a
couple of options we have considered (though certainly not an exhaustive
list):
- If many models perform similarly on overall evaluation metrics of
interest (say, above some reasonably threshold), how do they vary in
terms of equitability?
- How much “cost” in terms of performance do you have to pay to reach
various levels of fairness? Think of this as creating a menu of
options to explicitly show the trade-offs involved. For instance,
imagine your best-performing model has a precision of 0.75 but FDR
ratio of 1.3, but you can reach an FDR ratio of 1.2 by selecting a
model with precision of 0.73, or a ratio of 1.1 at a precision of
0.70, or FDR parity at a precision of 0.64.
- You may want to consider several of the equity metrics described
above and might look at the model that performs best on each metric
of interest (perhaps above some overall performance threshold) and
consider choosing between these options.
- If you are concerned about fairness across several subgroups (e.g.,
multiple categories of race/ethnicity, different age groups, etc),
you might consider exploring the models that perform best for each
subgroup in addition to those that perform similarly across groups
- Another option might be to develop a single model selection
parameter that penalizes performance by how far a model is from
equity and explore how model choice changes based on how heavily you
weight the equity parameter. Note, however, that when you are
comparing equity across more than two groups, you will need to find
a means of aggregating these to a single value (e.g., you might look
at the average disparity, largest disparity, or use some weighting
scheme to reflect different costs of disparities favoring different
groups)
In most cases, this process will yield a number of options for a final
model to deploys: some with better overall performance, some with better
overall equity measures, and some with better performance for specific
subgroups. Unlike model selection based on performance metrics alone,
the final choice between these will generally involve a judgment call
that reflects the project’s dual goals of balancing accuracy and equity.
As such, the final choice of model from this narrowed menu of options is
best treated as a discussion between the data scientists and
stakeholders in the same manner as choosing how to define fairness in
the first place.
### Other Options for Mitigating Bias
Beyond incorporating measurements of equity into your model selection
process, they can also inform how you put the model you choose into
action. In general, disparities will vary as you vary the threshold used
for turning continuous scores into 0/1 predicted classes. While many
applications will dictate the total number of individuals who can be
selected for intervention, it may still be useful to consider lower
thresholds. For instance, in one project we saw large $FDR$ disparities
across age and race in our models when selecting the top 150 individuals
for an intervention (a number dictated by programmatic capacity), but
these disparities were mitigated by considering the top 1000 with
relatively little cost in precision. This result suggested a strategy
for deployment: use the model to select the 1000 highest risk
individuals and randomly select 150 individuals from this set to stay
within the program’s capacity while balancing equity and performance.
Another approach that can work well in some situations is to consider
using different thresholds across groups to achieve more equitable
results. This is perhaps most robust where the metric of interest in
monotonically increasing or decreasing with the number of individuals
chosen for intervention (such as recall). This can be formulated in two
ways:
- For programs that have a target scale but may have some flexibility
in budgeting, you can look at to what extent the overall size of the
program would need to increase to achieve equitable results (or
other fairness goals such as those described in
[constrained assistive](#sec:constrainedassistive).
In this case, interventions don’t
need to be denied to any individuals in the interest of fairness,
but the program would incur some additional cost in order to achieve
a more equitable result.
- If the program’s scale is a hard constraint, you may still be able
to use subgroup-specific thresholds to achieve more equitable
results by selecting fewer of some groups and more of others
relative to the single threshold. In this case, the program would
not need additional costs of expansion, but some individuals who
might have received the intervention based just on their score would
need to be substituted for individuals with somewhat lower scores of
under-represented subgroups.
As you’re thinking about equity in the application of your machine
learning models, it’s also particularly important to keep in mind that
measuring fairness in a model’s predictions is only a proxy for what you
fundamentally care about: fairness in outcomes in the presence of the
model. As a model is put into practice, you may find that the program
itself is more effective for some groups than others, motivating either
additional changes in your model selection process or customizing
interventions to the specific needs of different populations (or
individuals). Incorporating fairness into decisions about who is chosen
to receive an intervention is an important first step, but shouldn’t be
mistaken for a comprehensive solution to disparities in a program’s
application and outcomes.
%cite LA ACM FAT* paepr
Some work is also being done investigating means for incorporating bias
and fairness more directly in the process of model development itself.
For instance, in may cases different numbers of examples across groups
or unmeasured variables may contribute to a model having higher error
rates on some populations than others and additional data collection
(either more examples or new features) may help mitigate these biases
where doing so is feasible ****. Other work is being done to explore the
results of incorporating equity metrics directly into the loss functions
used to train some classes of machine learning models, making balancing
accuracy and equity an aspect of model training itself ****. While we
don’t explore these more advanced topics in depth here, we refer you to
the cited articles for more detail.
Further Considerations
----------------------
### Compared to What?
While building machine learning models that are completely free of bias
is an admirable goal, it may not always be an achievable one.
Nevertheless, even an imperfect model may provide an improvement over
current practices depending on the degree of bias involved in existing
processes. It’s important to be cognizant of the existing context and
make measurements of equity for current practices as well as new
algorithms that might replace or augment them. The status quo shouldn’t
be assumed to be free of bias because it is familiar any more than
algorithms should be assumed capable of achieving perfection simply
because they are complex. In practice, a more nuanced view is likely to
yield better results: new models should be rigorously compared with
current results and implemented when they are found to yield
improvements but continually refined to improve on their outcomes over
time as well.
### Costs to Both Errors
In the examples in Section [metrics](#sec:metrics), we focused on programs that
could be considered purely assistive or purely punitive to illustrate
some of the relevant metrics for such programs. While this
classification may work for some real-world applications, in many others
there will be costs associated with both errors of inclusion and errors
of exclusion that need to be considered together in deciding both on how
to think about fairness and how to put those definitions into practice
through model selection and deployment. For the bail example, there are
of course real costs to society both of jailing innocent people and
releasing someone who does, in fact, commit a subsequent crime. In many
programs where individuals may be harmed by being left out, errors of
inclusion may mean wasted resources or even political backlash about
excessive budgets.
In theory, you might imagine being able to assign some cost to each type
of error — as well as to disparities in these errors across groups — and
make a single, unified cost-benefit calculation of the net result of
putting a given model into application in a given way. In practice, of
course, making an even reasonable quantitative estimate of the
individual and social costs of these different types of errors is likely
infeasible in most cases. Instead, a more practical approach generally
involves exploring a number of different options through different
choices of models and parameters and using these options to motivate a
conversation about the program’s goals, philosophy, and constraints.
### What is the Relevant Population?
Related to the sample bias discussed in [bias sources](#sec:biassources),
understanding the relevant population for your machine learning problem
is important both to the modeling itself and to your measures of equity.
Calculation of metrics like the group-size adjusted false positive rate
or false negative rate will vary depending on who is included in the
denominator.
For instance, imagine modeling who should be selected to receive a given
benefit using data from previous applicants and looking at racial equity
based on these metrics. What population is actually relevant to thinking
about equity in this case? It might be the pool of applicants available
in your data, but it just as well might be the set of people who might
apply if they had knowledge of the program (regardless of whether or not
they actually do), or perhaps even the population at large (for
instance, as measured by the census). Each of those choices could
potentially lead to different conclusions about the fairness of the
program’s decisions (either in the presence or absence of a machine
learning model), highlighting the importance of understanding the
relevant population and who might potentially be left out of your data
as an element of how fairness is defined in your context. Keep in mind
that determining (or at least making a reasonable estimate of) the
correct population may at times require collecting additional data.
### Continuous Outcomes
For the sake of simplicity, we focused here on binary classification
problems to help illustrate the sorts of considerations you might
encounter when thinking about fairness in the application of machine
learning techniques. However, these considerations do of course extend
to other types of problems, such as regression models of continuous
outcomes.
In these cases, bias metrics can be formulated around aggregate
functions of the errors a model makes on different types of individuals
(such as the mean squared error and mean absolute error metrics you are
likely familiar with from regression) or tests for similarity of the
distributions of these errors across subgroups. Working with continuous
outcomes adds an additional layer of complexity in terms of defining
fairness to account for the magnitude of the errors being made (e.g.,
how do you choose between a model that makes very large errors on a
small number of individuals vs one that makes relatively small errors on
a large number of individuals?).
If you would like to learn more about understanding bias and fairness in
machine learning problems with continuous outcomes, we suggest
consulting **** for a useful overview.
### Considerations for Ongoing Measurement
The role of a data scientist is far from over when their machine
learning model gets put into production. Making use of these models
requires ongoing curation, both to guard against degradation in terms of
performance or fairness as well as to constantly improve outcomes. The
vast majority of models you put into production will make mistakes, and
a responsible data scientist will seek to look closely at these mistakes
and understand — on both individual and population levels — how to learn
from them to improve the model. Ensuring errors are balanced across
groups is a good starting point, but seeking to reduce these errors over
time is an important aspect of fairness as well.
One challenge you may face in making these ongoing improvements to your
model is with measuring outcomes in the presence of a program that seeks
to change them. In particular, the measurement of true positives and
false positives in the absence of knowledge of a counterfactual (that
is, what would have happened in the absence of intervention) may be
difficult or impossible. For instance, among families who have improved
nutritional outcomes after receiving a food subsidy, you may not be able
to ascertain which families’ outcomes were actually helped by the
program versus which would have improved on their own, obfuscating any
measure of recall you might use to judge performance or equity.
Likewise, for individuals denied bail, you cannot know if they actually
would have fled or committed a crime had they been released, making
metrics like false discovery rate impossible to calculate.
During a model’s pilot phase, initially making predictions without
taking action or using the model in parallel with existing processes can
help mitigate some of these measurement problems over the short term.
Likewise, when resources are limited such that only a fraction of
individuals can receive an intervention, using some degree of randomness
in the decision-making process can help establish the necessary
counterfactual. However, in many contexts, this may not be practical or
ethical, and you will need to consider other means for ongoing