-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathlove_lro.tex
1967 lines (1645 loc) · 135 KB
/
love_lro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Author template for Operations Reseacrh (opre) for articles with no e-companion (EC)
%% Mirko Janc, Ph.D., INFORMS, [email protected]
%% ver. 0.95, December 2010
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\documentclass[opre,blindrev]{informs3}
\documentclass[opre,nonblindrev]{informs3} % current default for manuscript submission
\OneAndAHalfSpacedXI % current default line spacing
%%\OneAndAHalfSpacedXII
%%\DoubleSpacedXII
%%\DoubleSpacedXI
% If hyperref is used, dvi-to-ps driver of choice must be declared as
% an additional option to the \documentclass. For example
%\documentclass[dvips,opre]{informs3} % if dvips is used
%\documentclass[dvipsone,opre]{informs3} % if dvipsone is used, etc.
%%% OPRE uses endnotes. If you do not use them, put a percent sign before
%%% the \theendnotes command. This template does show how to use them.
\usepackage{endnotes}
\let\footnote=\endnote
\let\enotesize=\normalsize
\def\notesname{Endnotes}%
\def\makeenmark{$^{\theenmark}$}
\def\enoteformat{\rightskip0pt\leftskip0pt\parindent=1.75em
\leavevmode\llap{\theenmark.\enskip}}
% Private macros here (check that there is no clash with the style)
% Natbib setup for author-year style
\usepackage{natbib}
\bibpunct[, ]{(}{)}{,}{a}{}{,}%
\def\bibfont{\small}%
\def\bibsep{\smallskipamount}%
\def\bibhang{24pt}%
\def\newblock{\ }%
\def\BIBand{and}%
%% Setup of theorem styles. Outcomment only one.
%% Preferred default is the first option.
\TheoremsNumberedThrough % Preferred (Theorem 1, Lemma 1, Theorem 2)
%\TheoremsNumberedByChapter % (Theorem 1.1, Lema 1.1, Theorem 1.2)
\ECRepeatTheorems
%% Setup of the equation numbering system. Outcomment only one.
%% Preferred default is the first option.
\EquationsNumberedThrough % Default: (1), (2), ...
%\EquationsNumberedBySection % (1.1), (1.2), ...
% In the reviewing and copyediting stage enter the manuscript number.
%\MANUSCRIPTNO{} % When the article is logged in and DOI assigned to it,
% this manuscript number is no longer necessary
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{setspace}
\usepackage{paralist}
\usepackage{graphicx}
\usepackage{url}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{multicol}
\usepackage{csvsimple}
\allowdisplaybreaks[4]
% Frequently used general mathematics
\newcommand{\R}{{\mathbb{R}}}
\newcommand{\Rp}{\R^+}
\newcommand{\Z}{{\mathbb{Z}}}
\newcommand{\Zp}{\Z^+}
\newcommand{\Q}{\mathbb{Q}}
\newcommand{\N}{\mathbb{N}}
% Commands for probability
\renewcommand{\P}{\mathbb{P}}
\newcommand{\E}{\mathbb{E}}
%\renewcommand{\P}[1]{\P \left[ #1 \right]}
\newcommand{\e}[1]{\E \left[ #1 \right]}
\renewcommand{\ee}[2]{\E_{#1} \left[ #2 \right]}
% Definitions of variables
\newcommand{\X}{X}
\newcommand{\x}{\mathbf{x}}
\newcommand{\xh}{\hat{\x}}
\newcommand{\lh}{\hat{\lambda}}
\newcommand{\mh}{\hat{\mu}}
\newcommand{\xs}{\x^*}
\newcommand{\xit}{\tilde{\mathbf{\xi}}}
\newcommand{\zt}{\tilde{z}}
\newcommand{\zs}{z^*}
\newcommand{\bpi}{\mathbf{\pi}}
\newcommand{\bpih}{\hat{\bpi}}
% Further variables
\newcommand{\y}{\mathbf{y}}
\renewcommand{\c}{\mathbf{c}}
\newcommand{\A}{\mathbf{A}}
\renewcommand{\b}{\mathbf{b}}
\newcommand{\g}{\mathbf{g}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\B}{\mathbf{B}}
\renewcommand{\d}{\mathbf{d}}
\newcommand{\T}{\mathbf{T}}
\newcommand{\M}{\mathbf{M}}
\renewcommand{\h}{\mathbf{h}}
% Probability vectors
\newcommand{\q}{\mathbf{q}}
\newcommand{\p}{\mathbf{p}}
% Convergence of \plp
\newcommand{\qtrue}{\q^{\text{true}}}
\newcommand{\ob}{\bar{\omega}}
% Useful mathematics functions
\newcommand{\keywords}[1]{\par\noindent\enspace\ignorespaces\textbf{Keywords:} #1}
% \newcommand{\keywords}[1]{\par\addvspace\baselineskip\noindent\keywordname\enspace\ignorespaces #1}
% \DeclareMathOperator*{\argmin}{argmin}
% \theoremstyle{plain}
% \newtheorem{theorem}{Theorem}
% \newtheorem{lemma}[theorem]{Lemma}
% \newtheorem{proposition}[theorem]{Proposition}
% \newtheorem{corollary}[theorem]{Corollary}
%
% \theoremstyle{definition}
% \newtheorem{definition}[theorem]{Definition}
%
% \theoremstyle{remark}
% \newtheorem{remark}[theorem]{Remark}
\newtheorem{property}{Property}
\newcommand{\st}{\mbox{s.t.}}
% Naming shortcuts
\newcommand{\plp}{$\phi$LP-2}
\bibliographystyle{ormsv080}
%%%%%%%%%%%%%%%%
\begin{document}
%%%%%%%%%%%%%%%%
% Outcomment only when entries are known. Otherwise leave as is and
% default values will be used.
%\setcounter{page}{1}
%\VOLUME{00}%
%\NO{0}%
%\MONTH{Xxxxx}% (month or a similar seasonal id)
%\YEAR{0000}% e.g., 2005
%\FIRSTPAGE{000}%
%\LASTPAGE{000}%
%\SHORTYEAR{00}% shortened year (two-digit)
%\ISSUE{0000} %
%\LONGFIRSTPAGE{0001} %
%\DOI{10.1287/xxxx.0000.0000}%
% Author's names for the running heads
% Sample depending on the number of authors;
% \RUNAUTHOR{Jones}
% \RUNAUTHOR{Jones and Wilson}
% \RUNAUTHOR{Jones, Miller, and Wilson}
% \RUNAUTHOR{Jones et al.} % for four or more authors
% Enter authors following the given pattern:
%\RUNAUTHOR{Love and Bayraksan}
\RUNAUTHOR{\ }
% Title or shortened title suitable for running heads. Sample:
% \RUNTITLE{Bundling Information Goods of Decreasing Value}
% Enter the (shortened) title:
%\RUNTITLE{Phi-Divergence Constrained Ambiguous Stochastic Programs}
\RUNTITLE{\ }
% Full title. Sample:
% \TITLE{Bundling Information Goods of Decreasing Value}
% Enter the full title:
\TITLE{Phi-Divergence Constrained Ambiguous Stochastic Programs for Data-Driven Optimization}
% Block of authors and their affiliations starts here:
% NOTE: Authors with same affiliation, if the order of authors allows,
% should be entered in ONE field, separated by a comma.
% \EMAIL field can be repeated if more than one author
\ARTICLEAUTHORS{%
\AUTHOR{David K.\ Love}
\AFF{American Express, New York, NY,
\EMAIL{[email protected]}
%\AFF{Graduate Program in Applied Mathematics, University of Arizona, %\EMAIL{[email protected]}%, \url{http://math.arizona.edu/~dlove/}
}
\AUTHOR{G\"{u}zin~Bayraksan}
\AFF{Department of Integrated Systems Engineering, The Ohio State University, \EMAIL{[email protected]}%, %\url{http://www-iwse.eng.ohio-state.edu/biosketch\_GBayraksan.cfm}
}
% Enter all authors
} % end of the block
\ABSTRACT{% should be <= 200 words. Exactly at that!
This paper investigates the use of $\phi$-divergences in ambiguous (or distributionally robust) two-stage stochastic programs.
Classical stochastic programming assumes the distribution of uncertain parameters are known.
However, the true distribution is unknown in many applications.
Especially in cases where there is little data or not much trust in the data, an ambiguity set of distributions can be used to hedge against the distributional uncertainty.
$\phi$-divergences (e.g., Kullback-Leibler divergence, $\chi^2$ distance, etc.) provide a natural way to create an ambiguity set of distributions that are centered around a nominal distribution.
The nominal distribution can be obtained by using observed data, expert opinions, simulations, and so forth.
In this paper, we present a classification of $\phi$-divergences to elucidate their use for models with different properties and sources of data.
We illustrate our classification on $\phi$-divergences that result in common risk optimization models.
A condition for assessing the value of collecting additional data is derived, and we demonstrate that the $\phi$-divergence-based ambiguous program behaves essentially the same as the associated non-ambiguous stochastic program as more data is collected.
We present a decomposition-based solution algorithm to solve the resulting model.
Finally, we demonstrate the behavior of $\phi$-divergences in an optimization setting for a numerical example.
}%
% Sample
%\KEYWORDS{deterministic inventory theory; infinite linear programming duality;
% existence of optimal policies; semi-Markov decision process; cyclic schedule}
% Fill in data. If unknown, outcomment the field
\KEYWORDS{Ambiguous stochastic programming, distributionally robust optimization, phi-divergences, data-driven optimization}
%Optimization under uncertainty, water resources management, ambiguous stochastic programming, robust optimization, environmental sustainability}
%\HISTORY{}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Samples of sectioning (and labeling) in IJOC
% NOTE: (1) \section and \subsection do NOT end with a period
% (2) \subsubsection and lower need end punctuation
% (3) capitalization is as shown (title style).
%
%\section{Introduction.}\label{intro} %%1.
%\subsection{Duality and the Classical EOQ Problem.}\label{class-EOQ} %% 1.1.
%\subsection{Outline.}\label{outline1} %% 1.2.
%\subsubsection{Cyclic Schedules for the General Deterministic SMDP.}
% \label{cyclic-schedules} %% 1.2.1
%\section{Problem Description.}\label{problemdescription} %% 2.
\section{Introduction}
Many optimization problems can be modeled by stochastic programs minimizing the expected value of an uncertain objective function.
However, if the distribution of the uncertain parameters used in the model is incorrect, the stochastic program can give highly suboptimal results.
This concern has led to the development of a modeling technique that replaces the probability distribution by a set of distributions.
Then, the model optimizes the worst-case expectation with respect to the distributions in this set to hedge against the distributional uncertainty.
This set of distributions is referred to as the {\it ambiguity set} of distributions, sometimes called the {\it uncertainty set}.
This approach is not new, with early results dating back to \cite{scarf1958min} and Dupa{\v{c}}ov{\'{a}} as \cite{zackova1966minimax}.
This type of model has been referred to as an {\it ambiguous stochastic program}; see, for instance, \cite{pflug2007ambiguity} and \cite{erdogan2006ambiguous}.
More recently, this approach has been called {\it distributionally robust optimization} \citep{delage2010distributionally,goh_sim_10,mehrotra_papp_14,hanasusanto_etal_15}.
There are different ways to form the ambiguity set of distributions.
% (see Section~\ref{sec:lit}).
One recent approach that has been proposed by \citet{bental2013robust} uses a set of distributions that are sufficiently close to a given ``nominal'' distribution according to a $\phi$-divergence.
$\phi$-divergences quantify distances between probability distributions; we will shortly review them in Section \ref{sec:phi_divergences}.
Of particular interest is the case when the nominal distribution takes the form of the empirical distribution determined by direct observation of data.
However, this is not the only means of obtaining data.
In addition to direct observation, data can come from simulations, forecasts, or expert opinions.
% that the decision maker would especially like to be robust against.
%\subsection{Contributions}
In this paper\footnote{This work is largely derived from the dissertation \cite{love_13}.
An earlier version of Section \ref{sec:classification} has appeared in the conference paper \cite{love2014classification}, and select results have been summarized in the tutorial \cite{bayraksan_love_15}.
This paper contains more results, full derivations and proofs, and further numerical illustration of results.
Parts of \cite{bayraksan_love_15} have been included by permission.}
we adapt the $\phi$-divergence based ambiguity sets to two-stage stochastic linear programs with recourse.
We call the resulting model two-stage $\phi$-divergence constrained ambiguous stochastic linear program with recourse and denote it as \plp.
% and examine its properties.
While we focus on \plp, some of our results apply to a broad class of distributionally robust optimization problems using $\phi$-divergences---we will point to these shortly.
%Throughout the paper, we consider distributions with finite support.
Data-driven models in the literature typically use empirical probabilities obtained through direct observations.
In our modeling framework, we also allow unobserved data points (e.g., those given by expert opinions) to be represented in the model with zero nominal probabilities.
We examine this case in more detail in the paper.
%\subsection{Related Literature}
Stochastic programs with uncertain objective functions have long been studied by applying the minimax approach to an expected cost; see, e.g., \cite{zackova1966minimax} and \cite{dupacova_87}.
Two seminal papers by \cite{shapiro2002minimax} and \cite{shapiro2004class} developed methods for converting stochastic minimax problems into equivalent stochastic programs with a certain distribution, laying the foundation for a commonly used reformulation technique.
In recent years, there has been a growing interest in distributionally robust methods.
One common method for forming the ambiguity set is moment based, where all distributions that have the same moments (mean, variance, covariance, etc.) are admitted into the set.
An early example comes from \citet{scarf1958min}, who provided a distributionally robust model for the newsvendor problem.
More recent works using moment-based ambiguity sets include \cite{delage2010distributionally} and \cite{wiesemann2013distributionally}.
Probability metrics, including the Kantorovich or Wasserstein metric \citep{pflug2007ambiguity,eskuhn_15}, Prokhorov metric \citep{erdogan2006ambiguous}, and $\zeta$-structure metrics \citep{zhao2015}, have also been used. % to form ambiguity sets.
\cite{hanasusanto_etal_15} provide a comprehensive review of different types of ambiguity sets.
We refer the readers to this paper and references therein for more details on different types of ambiguity sets.
As mentioned above, \cite{bental2013robust} first systematically studied the $\phi$-divergence based models and their computational tractability.
\cite{jiang2012data} investigated $\phi$-divergence based ambiguous chance-constrained programs, providing an exact approach to solve them; see also \cite{yanikoglu2012}.
Specific $\phi$-divergences---including the $\chi^2$-distance \citep{klabjan2013robust}, Kullback-Leibler divergence \citep{calafiore2007ambiguous,hukullback,wang2010likelihood} and the variation distance \citep{jiang2015variation}---were also studied.
\citet{hukullback} and \citet{jiang2015variation} differ from this work and the others by considering continuous distributions.
Close to our work, \cite{bertsimas_gupta_kallus_14} study robust problems with ambiguity sets formed via goodness-of-fit test statistics. Some of their results include $\phi$-divergences, but they consider other tests as well.
Our work unites these papers with various $\phi$-divergences in the finite support case, providing insight into conditions where each $\phi$-divergence should be used.
To the best of our knowledge, this is the first paper examining the behavior of different $\phi$-divergences in an optimization setting.
%\citet{bental2010soft} presents a ``soft'' robust model that allows for changing the level of robustness across the uncertainty set.
%Although it is not a distributionally robust stochastic program, this soft robust model, like the distributionally robust models, takes the form of a convex risk measure.
The contributions of this paper, along with a motivation to study the corresponding research questions, are as follows:
\begin{itemize}
\item[(i)]
%One of the open problems identified by \citet{bental2013robust} was to study the performance of different $\phi$-divergences.
Given that there are many $\phi$-divergences, a decision maker is left with the question of how each divergence behaves for his/her problem and which one to choose.
\begin{itemize}
\item In this paper we provide a classification of $\phi$-divergences that begins to answer this open question.
Our classification is based on the types of distributions admitted into the ambiguity set.
This analysis provides insight into which class of $\phi$-divergence may be most useful to which type of data and decision maker.
\item Our main classification is a general feature of $\phi$-divergences, and it applies to a broader class of $\phi$-divergence constrained distributionally robust problems than the ones presented in this paper.
\end{itemize}
\item[(ii)] In a data-driven setting, several important questions arise.
What happens as we add one more data?
Will our solution change, and if so, will the overall cost decrease?
Can we determine sampling from which scenarios result in a better (lower-cost) solution?
Can we characterize the behavior of the problem as we add more data?
We provide answers to these questions.
\begin{itemize}
\item First, we provide a simple condition to determine if sampling from a particular scenario will lower the cost, which again can be generalized beyond the \plp\ setting.
We refer to this as the {\it value of additional data}.
\item Next, in a data-driven setting---where random data is collected to form the ambiguity set of distributions using $\phi$-divergences---we show that asymptotically, \plp\ behaves essentially the same as a stochastic program with the (unknown) true distribution.
\end{itemize}
\item[(iii)] Stochastic programs often become quite large, which raises questions of computational tractability.
We devise a modified Bender's decomposition that can be used to solve \plp\ efficiently by solving only linear problems.
\item[(iv)] Finally, we present examples of $\phi$-divergences that result in commonly used risk models and illustrate our classification on these models.
We also numerically illustrate our results on a small electricity generation example.
\end{itemize}
\smallskip
%\subsection{Organization}
The rest of the paper is organized as follows.
Section \ref{sec:phi_divergences} introduces $\phi$-divergences and lists several useful properties that are used throughout the paper.
Section \ref{sec:plp2} presents the derivation of $\phi$-divergence constrained ambiguous two-stage stochastic programs with recourse and discusses their basic properties. %, which will be used in later sections.
Data-driven properties of \plp\ are explored in Section \ref{sec:properties}.
Section \ref{sec:classification} presents a classification of $\phi$-divergences and illustrates different classes using risk models.
Then, Section \ref{sec:soln_algorithm} discusses a decomposition method for solving the \plp.
Section \ref{sec:comp_results} numerically illustrates the results of the paper, and finally
%some of the properties of the \plp\ model.
Section \ref{sec:plp_conclusions} concludes with a summary and future work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{$\phi$-Divergences} %Introduction to
\label{sec:phi_divergences}
In this section we define the concept of a $\phi$-divergence and review several properties of $\phi$-divergences that will be used throughout the paper.
\citet{pardo2005statistical} provides a good overview of much of the known properties of $\phi$-divergences.
We refer the readers to this book for further details.
Many results in this section can be also found in \cite{bental1991certainty,bental2013robust}.
In the finite case, $\phi$-divergences are used to measure the distance between two non-negative vectors $\p = (p_1, \dots, p_n)^T$ and $\q = (q_1, \dots, q_n)^T$.
Specifically, when $\p$ and $\q$ are probability vectors (i.e., satisfying $\sum_{\omega=1}^n p_\omega = \sum_{\omega=1}^n q_\omega = 1$), $\phi$-divergences are used to quantify the distance between two discrete distributions with finite support.
The $\phi$-divergence is defined by
\[
I_\phi(\p,\q) = \sum_{\omega=1}^n q_\omega \phi\left(\frac{p_\omega}{q_\omega}\right),
\]
where $\phi(t)$, called the {\it $\phi$-divergence function}, is a convex function on $t \geq 0$ with $\phi(1) = 0$.
Additionally, it is defined that $0 \phi(a/0) = a \lim_{t \rightarrow \infty} \frac{\phi(t)}{t}$ for $a>0$ and $0 \phi(0/0) = 0$.
When both $\p$ and $\q$ are probability vectors---which is our setup throughout this paper---we can further assume that $\phi(t) \geq 0$ without loss of generality.
Observe that the function $\phi(t)$ can be modified as $\psi(t) = \phi(t) + c(t-1)$ with an appropriately chosen constant $c$ such that $\psi(t) \geq 0$ for all $t\geq 0$ and $I_\psi(\p,\q) = I_\phi(\p,\q)$ for all probability vectors $\p,\q$.
If $\phi(t)$ is differentiable at $t = 1$, this modification can be done by selecting $c = -\phi'(1)$.
%; see, e.g., the Likelihood divergence and Burg entropy in Table~\ref{tb:phi_definitions}.
Throughout the remainder of this paper, we assume $\phi(t) \geq 0$ for $t\geq 0$, although we give an example of a $\phi$-divergence that does not satisfy this condition in Table~\ref{tb:phi_definitions} below.
%We will discuss this case shortly.
We extend $\phi(t)$ to the set of reals by setting $\phi(t)=+\infty$ for $t<0$.
We make a technical---but not restrictive---assumption
%because we will use $\phi$-divergences in an optimization setting.
%We assume $\phi$ is a closed function.
that $\phi$ is a closed function because we will use $\phi$-divergences in an optimization setting.
For a proper convex function, closedness (i.e., its epigraph being closed) is the same as lower semi-continuity \citep{rockafellar_70}.
(Recall that $\phi$ is a proper function because for at least one $t$ ($t=1$), $\phi(t)=0<\infty$, and $\phi(t)>-\infty$, for all $t\in\mathbb{R}$.)
%Lower semi-continuity is a natural assumption because we will use $\phi$-divergences in an optimization setting.%, and it is satisfied by common $\phi$-divergences (see Table~\ref{tb:phi_definitions}).
%This assumption is naturally satisfied by $\phi$-divergences that are commonly used in the literature (see Table~\ref{tb:phi_definitions}).
Lower semi-continuity is a desirable property in our setting, and it is satisfied by common $\phi$-divergences (see Table~\ref{tb:phi_definitions}).
$\phi$-divergences are not, in general, metrics.
Most $\phi$-divergences do not satisfy the triangle inequality and many are not symmetric in the sense that $I_\phi(\p,\q) \neq I_\phi(\q,\p)$.
One exception is the variation distance, which is equivalent to the $L^1$-distance between the vectors (see Table~\ref{tb:phi_definitions}).
A $\phi$-divergence has an {\it adjoint}, defined by
\begin{equation} \label{eq:adjoint}
\tilde{\phi}(t) = t \phi\left(\frac{1}{t}\right),
\end{equation}
which satisfies all criteria for a $\phi$-divergence \citep{bental1991certainty} and has the property that $I_{\tilde{\phi}}(\p,\q) = I_\phi(\q,\p)$.
Divergences that are symmetric with respect to the input vectors are known as {\it self-adjoint}.
An important function related to the $\phi$-divergence function is its {\it convex conjugate}, which is used, for instance, in the dual problem formulation (Section \ref{ssec:form}).
We will also use the properties of the conjugate for our classification (Section \ref{sec:classification}).
The conjugate $\phi^* : \R \rightarrow \R \cup \{\infty\}$ is defined as
\begin{equation} \label{eq:conjugate}
\phi^*(s) = \sup_{t \geq 0} \{st - \phi(t)\}.
\end{equation}
It is a nondecreasing convex function, which may be undefined above some upper bound $\bar{s}$.
Because $\phi$ is a proper closed convex function, $\phi^{**}=\phi$ and $t \in \partial \phi^*(s)$ if and only if $s \in \partial \phi(t)$ \citep[Corollary 23.5.1]{rockafellar_70}.
We will use the latter property in our analysis.
Table \ref{tb:phi_definitions} lists some common examples of $\phi$-divergences, along with their adjoints and conjugates.
The value of the conjugate is listed only in its domain, i.e., $\{s : \phi^*(s) < \infty\}$.
Most of these $\phi$-divergences are widely used in statistics and information theory.
%Using $\phi$-divergences to model ambiguous probability distributions is an attractive approach because it uses the data directly---only those data points or scenarios of interest are used in the calculations.
%%These scenarios can come from direct observation, results of simulation, or from expert opinion that the decision maker would especially like to be robust against.
%Because the \plp\ depends only on these scenarios, the size of the problem is polynomial in the sample size, making the \plp\ computationally tractable.
Because many $\phi$-divergences are commonly used in statistics---e.g., to conduct goodness-of-fit tests \citep{pardo2005statistical}---they provide natural ways to deal with data and distributions.
Consequently, using $\phi$-divergences can be more data driven.
For instance, many $\phi$-divergences use more distributional information than moments.
Another advantage is that they form convex ambiguity sets.
This opens up the tools of convex analysis and allows computationally tractable models.
Finally, they encompass a fairly large class of problems, including some important risk-averse optimization problems.
In Section \ref{ssec:special_phi}, we present $\phi$-divergences that assign a distance of either $0$ or $\infty$, which result in commonly used risk models.
%; we will provide some examples in Section~\ref{ssec:special_phi}.
Table \ref{tb:phi_definitions} lists a divergence, labeled ``Likelihood,'' that is somewhat different from the others.
The Likelihood divergence is equivalent to the Burg entropy when comparing probability vectors, but it does not satisfy the normalizing condition $\phi(t) \geq 0$.
This divergence is included because \citet{wang2010likelihood} use it to formulate a distributionally robust problem so that the ambiguity set of distributions have a sufficiently high empirical likelihood.
They refer to this method as the Likelihood Robust Optimization.
We also note that \cite{calafiore2007ambiguous} and \cite{hukullback} use a different naming convention than the one given here, referring to the Burg entropy as the Kullback-Leibler (KL) divergence---reversing the order of the arguments $\p$ and $\q$ relative to the notation presented here.
In this paper, $\q$ denotes the nominal distribution.
%, which is the distribution found directly via data or a distribution that is believed to represent the 'true' distribution but with some uncertainty.
\begin{table}
\TABLE
{
Definitions of some common $\phi$-divergences, along with their adjoints $\tilde{\phi}(t)$ and conjugates $\phi^*(s)$
\label{tb:phi_definitions}
}
{\begin{tabular}{lccccc}
\hline \\
Divergence & $\phi(t)$ & $\tilde{\phi}(t)$ & $\phi(t), t \geq 0$ & $I_\phi(p,q)$ & $\phi^*(s)$ \\
\hline
Kullback-Leibler & $\phi_{kl}$ & $\phi_b$ & $t\log t - t + 1$ & $\sum p_\omega \log\left(\frac{p_\omega}{q_\omega}\right)$ & $e^s - 1$ \\
Burg Entropy & $\phi_b$ & $\phi_{kl}$ & $-\log t + t - 1$ & $\sum q_\omega \log\left(\frac{q_\omega}{p_\omega}\right)$ & $-\log(1-s),\ s < 1$ \\
J-Divergence & $\phi_j$ & $\phi_j$ & $(t-1)\log t$ & $\sum (p_\omega - q_\omega) \log\left(\frac{p_\omega}{q_\omega}\right)$ & No closed form \\
Likelihood & $\phi_l$ & $t\log t $ & $-\log t$ & $\sum q_\omega \log\left(\frac{q_\omega}{p_\omega}\right)$ & $-\log(-s) - 1,\ s < 0$ \\
$\chi^2$-Distance & $\phi_{\chi^2}$ & $\phi_{m\chi^2}$ & $\frac{1}{t} (t-1)^2$ & $\sum \frac{(p_\omega-q_\omega)^2}{p_\omega}$ & $2 - 2\sqrt{1-s},\ s \leq 1$ \\
Modified $\chi^2$-Dist. & $\phi_{m\chi^2}$ & $\phi_{\chi^2}$ & $(t-1)^2$ & $\sum \frac{(p_\omega - q_\omega)^2}{q_\omega}$ & $\begin{cases} -1 & s < -2 \\ s + \frac{s^2}{4} & s \geq -2 \end{cases}$ \\
% $\chi$-div, $\theta > 1$ & $\phi_\chi^\theta$ & $t^{1-\theta}\phi_\chi^\theta$ & $|t-1|^\theta$ & $\sum q_\omega |1-\frac{p_\omega}{q_\omega}|^\theta$ & $\begin{cases} -1 & s \leq -\theta \\ s + (\theta-1)\left(\frac{|s|}{\theta}\right)^\frac{\theta}{\theta-1} & s \geq -\theta \end{cases}$ \\
Variation Distance & $\phi_v$ & $\phi_v$ & $|t-1|$ & $\sum |p_\omega - q_\omega|$ & $\begin{cases} -1 & s \leq -1 \\ s & -1 \leq s \leq 1 \end{cases}$ \\
Hellinger Distance & $\phi_h$ & $\phi_h$ & $(\sqrt{t} - 1)^2$ & $\sum (\sqrt{p_\omega} - \sqrt{q_\omega})^2$ & $\frac{s}{1-s},\ s < 1$ \\
\hline
\end{tabular}}
{}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{$\phi$-Divergence Constrained Ambiguous Stochastic Program}
\label{sec:plp2}
%In this section we provide primal and dual formulations and basic properties of two-stage ambiguous stochastic linear programs constructed via $\phi$-divergences.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Formulation}
\label{ssec:form}
We begin with a two-stage stochastic linear program with recourse (SLP-2).
%, which we will soon present its corresponding \plp.
Let $\x$ be a vector of first-stage decision variables with cost vector $\c$, constraint matrix $\A$, and right-hand-side vector $\b$.
We assume the random parameters of the second-stage problem have a finite distribution with realizations indexed by $\omega = 1, \dots, n$ and probabilities denoted by $q_\omega$, $\omega = 1, \dots, n$.
%and probabilities denoted by...
We refer to realizations interchangeably as scenarios.
The SLP-2 is given by
\begin{equation}
\min_\x \ \left\{ \c\x + \sum_{\omega=1}^n q_\omega h_\omega(\x) : \ \A\x = \b, \x \geq 0 \right\}, \label{eq:slp_first_stage}
\end{equation}
where
\begin{align}
h_\omega(\x) = \min_{\y^\omega} \ & \left\{ \g^\omega \y^\omega : \ \D^\omega \y^\omega = \B^\omega \x + \d^\omega, \y^\omega \geq 0 \right\}, \ \ \omega = 1, \dots, n. \label{eq:slp_second_stage}
\end{align}
For a given scenario $\omega$, the second-stage decision variables $\y^\omega$ are optimized with respect to cost vector $\g^{\omega}$.
The second-stage constraints with recourse matrix $\D^{\omega}$ depend on the first-stage variables $\x$ through the technology matrix $\B^{\omega}$, which appear on the right-hand side of constraints along with $\d^{\omega}$.
Throughout the rest of the paper, we denote the first-stage feasible region as $X = \{\x : \ \A\x = \b, \x \geq 0\}$.
The first stage of a prototypical SLP-2 allocates capacities to different electricity generators before demand and reliability of the generators are known.
Then, once the electricity demand and generator reliabilities become known, the second stage provides electricity to demand sites in a least-costly fashion.
We will consider this problem for our numerical results in Section ~\ref{sec:comp_results}.
%We assume relatively complete recourse; i.e., the second-stage problems (\ref{eq:slp_second_stage}) are feasible for every feasible solution $\x$ of the first-stage problem (\ref{eq:slp_first_stage}).
%We also assume that the second-stage problems are dual feasible for every feasible solution $\x$ of the first-stage problem.
The SLP-2 formulation assumes that $\q$ is known.
However, in many applications, the distribution is itself unknown, and there is no reliable way to obtain the probabilities of scenarios $\omega$.
By replacing the specific distribution in SLP-2 with a set of distributions sufficiently close to the nominal distribution $\q$ with respect to a $\phi$-divergence, we create the \plp\ model.
In the \plp, the objective function is minimized with respect to the worst-case distribution selected from the ambiguity set of distributions.
The resulting minimax formulation of \plp\ is
\begin{equation}
\min_{\x \in X} \max_{\p \in \mathcal{P}} \left\{ \c\x + \sum_{\omega=1}^{n} p_\omega h_\omega(\x) \right\}, \label{eq:plp_primal}
\end{equation}
where the ambiguity set is
\begin{align}
\mathcal{P} = & \Bigg\{ \sum_{\omega = 1}^{n} q_\omega \phi\left(\frac{p_\omega}{q_\omega}\right) \leq \rho, \label{eq:plp_primal_divergence} \\
& \ \sum_{\omega=1}^{n} p_\omega = 1, \label{eq:plp_primal_probability} \\
& \ p_\omega \geq 0,\ \forall \omega \Bigg\}. \label{eq:nonneg}
\end{align}
We refer to (\ref{eq:plp_primal_divergence}) as the $\phi$-divergence constraint, and (\ref{eq:plp_primal_probability}) and (\ref{eq:nonneg}) simply ensure a probability measure.
We will discus how to determine $\rho$ in (\ref{eq:plp_primal_divergence}) shortly in Section \ref{ssec:robust_level}.
Taking the Lagrangian dual of the inner maximization problem, with dual variables $\lambda$ and $\mu$, of constraints (\ref{eq:plp_primal_divergence}) and (\ref{eq:plp_primal_probability}), respectively, and combining the two minimizations gives \plp\ in dual form
\begin{align}
\min_{\x,\lambda,\mu} \ & \c\x + \mu + \rho \lambda + \lambda \sum_{\omega=1}^{n} q_\omega \phi^*\left(\frac{h_\omega(\x) - \mu}{\lambda}\right) \label{eq:plp_two_stage} \\
\st \ & \x \in X \nonumber \\
& h_\omega(\x) - \mu \leq \left( \lim_{t \rightarrow \infty} \frac{\phi(t)}{t} \right) \lambda, \ \forall \omega \label{eq:plp_feas_constraint}\\
& \lambda \geq 0. \nonumber
\end{align}
When $\lambda =0$, the last term in the objective function (\ref{eq:plp_two_stage}) has the following interpretations: $0\phi^*(b/0)=0$ if $b\leq 0$ and $0\phi^*(b/0)=+\infty$ if $b > 0$.
%where $h_\omega(\x)$ and the second-stage problems are as given in (\ref{eq:slp_second_stage}).
When $\rho>0$, $\q$ strictly satisfies the $\phi$-divergence constraint $I_{\phi}(\q,\q)=0<\rho$.
So, the Slater condition holds, and we have strong duality.
Some $\phi$-divergences, like the J-divergence, have no closed-form representation of $\phi^*$.
However, they can be expressed as the sum of other $\phi$-divergences with closed-form conjugates.
For example, sum of Burg Entropy and KL divergence gives the J-divergence.
In this case, the dual can be formed similarly; see \cite{bental2013robust} for details.
Theorem 1 of \cite{bental2013robust} contains a derivation of the dual problem, which is reprinted as part of the proof of Proposition \ref{prop:pop}.
The right-hand side of (\ref{eq:plp_feas_constraint}) contains a limit.
This constraint results from a dual feasibility consideration.
When this limit is finite, i.e., $\lim_{t \rightarrow \infty} \frac{\phi(t)}{t}= \bar{s}<\infty$, then for any $s> \bar{s}$, $\phi^*(s)=\infty$.
Therefore, $\phi$-divergences with a finite limit (like the variation distance) induce this constraint.
In (\ref{eq:plp_feas_constraint}), we moved $\lambda$ to the right-hand side to allow for $\lambda=0$.
On the other hand, when this limit is $\infty$ (like the KL divergence), $\phi^*(s)<\infty$ for any finite value of $s$.
In this case,
%all values are dual feasible, and
constraint (\ref{eq:plp_feas_constraint}) can be removed from the formulation.
Observe, in particular, that the dual formulation is accurate even for $q_\omega = 0$ for some $\omega$.
%(see proof of Proposition \ref{prop:pop} and Section~\ref{sec:classification} for more details).
We will discuss this case in more detail in Section~\ref{sec:classification} (e.g., proof of Proposition \ref{prop:pop}).
%For some $\phi$-divergences, dant.
%However, other $\phi$-divergences, like the variation distance, have a finite limit, inducing this constraint.
%When this limit is $\infty$, on the other hand, $\phi^*(s)<\infty$ for any finite value of $s$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Basic Properties of \plp}
\label{ssec:basicprop}
In this section we list some basic properties of \plp.
Some of these properties have already been noted earlier (e.g., by \citet{bental2013robust,bental_teboulle_07} and by others for specific $\phi$-divergences), but we list them for completeness.
These properties help with our specialized solution method and our classification of $\phi$-divergences, and we refer to them in later sections.
Throughout the rest of the paper, we use the notation
\begin{equation}
s_\omega = \frac{h_\omega(\x) - \mu}{\lambda}. \label{eq:s_omega_definition}
\end{equation}
Furthermore, $(\x^*, \p^*)$ denotes the optimal primal solution, and $(\x^*, \lambda^*, \mu^*)$ denotes the optimal dual solution.
Optimal $s_\omega^*$ can then be found by plugging in the respective optimal solutions in (\ref{eq:s_omega_definition}).
We assume $X \neq \emptyset$ and compact, and second-stage problems (\ref{eq:slp_second_stage}) are primal and dual feasible for all $\omega$ and all $\x \in X$.
This ensures that $h_\omega(\x)$ are (finite) real-valued, and both problems SLP-2 and \plp\ have finite optimal solutions.
The basic properties of \plp\ are as follows.
\begin{property}
\label{property:convex}
\plp\ is a convex program.
\end{property}
\begin{property}
\label{property:coherent_risk_measure}
\plp\ is equivalent to minimizing a coherent risk measure.
\end{property}
\begin{property}
\label{property:time_structure}
\plp\ preserves the time structure of SLP-2.
\end{property}
\begin{property} {\sc (Primal-Dual Relation.)}
\label{property:primal_dual_relation}
The (optimal) worst-case probabilities $p^*_\omega$ can be calculated with the equations
% \begin{align}
\begin{equation}\label{eq:p_worst}
\frac{p_\omega^*}{q_\omega} \in \partial \phi^*\left(s_\omega^*\right), \ \ \ \ \ \sum_{\omega=1}^n q_\omega \phi\left(\frac{p_\omega^*}{q_\omega}\right) = \rho, \ \ \ \ \ \sum_{\omega=1}^n p_\omega^* = 1
\end{equation}
when $\lambda^*>0$ and $q_\omega >0$.
With $\lambda^*>0$ and $q_\omega =0$, $p_\omega^* \in \partial \phi^*\left(s_\omega^*\right) q_\omega$ (i.e., $p_\omega^* = 0$) when $s_\omega^* < \bar{s}$; otherwise (i.e., when $s_\omega^* = \bar{s}<\infty$) use the last two equations in (\ref{eq:p_worst}).
With $\lambda^*=0$, set $p_\omega^* = 0$ when $h_\omega(\x^*) - \mu^* < 0$; otherwise (when $h_\omega(\x^*) - \mu^* = 0$), use the last two equations in (\ref{eq:p_worst}), but with the regular $\phi$-divergence constraint $I_{\phi}(\p,\q)\leq \rho$ instead of $I_{\phi}(\p,\q)=\rho$.
%
%$\phi$-divergence constraint (\ref{eq:plp_primal_divergence}) and $\sum_\omega p^*_\omega = 1$.
\end{property}
A coherent risk measure---first proposed by \cite{Artzner_et_al_1999}, and later refined by several authors including \cite{rockafellar2007coherent,shaDR:09}---has desired properties including convexity and monotonicity.
These properties, combined with the facts that $h_\omega(\x)$ is convex over $X$ and $X$ is a convex set, implies Property \ref{property:convex}.
See also Proposition \ref{prop:convex} in Section \ref{sec:soln_algorithm} and \cite{bental2013robust}.
Observe that Property \ref{property:coherent_risk_measure} is valid even when we have $q_\omega=0$ for some $\omega$.
The case of $q_\omega = 0$ plays an important role in the classification presented in Section \ref{sec:classification}.
Therefore, we provide a proof of Property \ref{property:coherent_risk_measure} in this case in Appendix \ref{sec:apx_proof}.
%In Section \ref{ssec:special_phi}, we present $\phi$-divergences that result in commonly used coherent risk measures.
Property \ref{property:time_structure} helps with our decomposition-based solution method described in Section \ref{sec:soln_algorithm}.
The preservation of time structure can be seen in (\ref{eq:plp_two_stage}).
Rewriting it slightly, we obtain
\begin{equation}
\label{eq:dec}
\min_{\x \in X,\lambda \geq 0,\mu} \left\{ \c\x + \mu + \rho \lambda + \sum_{\omega=1}^{n} q_\omega h_\omega^{\dagger}(\x, \lambda, \mu) \colon \ s_\omega \leq \bar{s} \right\}.
\end{equation}
The above formulation preserves the two-stage structure of the SLP-2.
The first-stage variables can now be viewed as $\x, \lambda$, and $\mu$.
The expectation is taken using the nominal probability vector $\q$.
Finally, $ h^{\dagger}_\omega(\x, \lambda, \mu) = \lambda \phi^*\left(\frac{h_\omega(\x) - \mu}{\lambda} \right)$,
%or equivalently $\lambda\phi^*(s_\omega)$,
where $h_\omega(\x)$ are defined as before.
%In Section \ref{sec:soln_algorithm}, we decompose the problem by scenario and convert (sub-)gradients of $h_\omega(\x)$ to (sub-)derivatives of $\phi^*\left(s_\omega\right)$.
Property \ref{property:primal_dual_relation} lists the first order necessary conditions for optimality.
The appearance of the conjugate $\phi^*$ in the objective of (\ref{eq:plp_two_stage}) gives a method for retrieving the worst-case distribution from the dual problem.
It uses the fact that $\frac{p^*_\omega}{q_\omega} \in \partial \phi^*(s^*_\omega)$ if and only if $s^*_\omega \in \partial \phi\left(\frac{p^*_\omega}{q_\omega}\right)$.
In many cases, the first equation in (\ref{eq:p_worst}) is sufficient to calculate $p_\omega^*$.
In addition, $\phi^*$ is often differentiable, and so we have the relationship $p_\omega^* = q_\omega \phi^{* \prime}(s_\omega)$.
Observe that because $\phi$ is a proper closed convex function, so is $\phi^*$ \citep[Theorem 12.2]{rockafellar_70}.
Hence, $\phi^*$ is subdifferentiable on the relative interior of its domain \citep[Theorem 23.4]{rockafellar_70}.
The boundary of the domain of $\phi^*$---that is, at $s=\bar{s}$ when $\bar{s}<\infty$---might require special care, where we need the primal feasibility conditions (\ref{eq:plp_primal_divergence}) and (\ref{eq:plp_primal_probability}).
The $\phi$-divergence constraint in (\ref{eq:p_worst}) is written as an equality because with $\lambda^*>0$, the complementary slackness dictates that this constraint must be active.
With $\lambda^*=0$, we have the regular $\phi$-divergence constraint (\ref{eq:plp_primal_divergence}).
While Property \ref{property:primal_dual_relation} summarizes how to obtain $p^*_\omega$ for the cases when $\lambda^* = 0$ or $q_\omega = 0$, we will discuss these special cases in more detail in Section \ref{sec:classification}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Level of Robustness}
\label{ssec:robust_level}
The literature on $\phi$-divergences provides some insight on choosing a reasonable asymptotic value of $\rho$ in the data-driven setting.
In this setting, $\q$ is generated from observations, where scenario $\omega$ has been observed $N_\omega$ times with $N = \sum_{\omega=1}^n N_\omega$ total observations.
So, the nominal probability of scenario $\omega$ is set to be $q_\omega = \frac{N_\omega}{N}$.
When $\phi$ is twice continuously differentiable around $1$ with $\phi^{\prime \prime}(1)>0$, \citet[Theorem 3.1]{pardo2005statistical} shows that the statistic
\[
T^\phi_N(\q^N,\qtrue) = \frac{2N}{\phi''(1)} I_{\phi}(\q^N, \qtrue)
%\sum_{\omega=1}^n \qtrue_\omega \phi\left(\frac{q^N_\omega}{\qtrue_\omega}\right)
\]
converges in distribution to a $\chi^2$-distribution with $n-1$ degrees of freedom, where $\q^N$ denotes the empirical distribution ($q^N_\omega = N_\omega/N$) and $\qtrue$ denotes the underlying true distribution.
Most $\phi$-divergences in Table~\ref{tb:phi_definitions} satisfy this differentiability condition.
\citet{bental2013robust} then use this result to suggest the asymptotic value
\begin{equation} \label{eq:asymptotic_rho}
\rho = \frac{\phi''(1)}{2N} \chi^2_{n-1,1-\alpha},
\end{equation}
where $\chi^2_{n-1,1-\alpha}$ is the $1-\alpha$ percentile of a $\chi^2_{n-1}$ distribution.
This choice of $\rho$ produces an approximate $1-\alpha$ confidence region on the true distribution.
To correct for small sample sizes and for more details, we refer the readers to \cite{pardo2005statistical} and \cite{bental2013robust}.
%We are now ready to present the main contributions of this paper.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Data-Driven Properties}
\label{sec:properties}
In this section we assume the nominal probabilities $\q$ are the empirical probabilities $\q^{N}$---i.e., $q_\omega=q_\omega^N = {N_\omega}/{N}$---and provide insight into how the \plp\ changes as data is added.
First, we investigate how \plp\ might change with a single additional observation in Section \ref{ssec:value}.
Next, we examine what happens as more and more data is gathered with asymptotic results in Section \ref{ssec:epiconvergence}.
This analysis must consider how the level of robustness $\rho$ changes as additional observations are obtained.
Therefore, in this section, we use the notation $\rho_N$ to emphasize the dependence of $\rho$ on the number observations.
To be consistent with the known $\phi$-divergence results stated in Section \ref{ssec:robust_level}, we set $\rho_N = \frac{\rho_0}{N}$.
Observe that $\rho_0=\frac{\phi''(1)}{2} \chi^2_{n-1,1-\alpha}$ in (\ref{eq:asymptotic_rho}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Value of Additional Data} \label{ssec:value}
With a data-driven formulation such as \plp,
%it is natural to ask how the optimal value and solution changes as more data is gathered.
%In particular, %for robust formulations like \plp\
one might be concerned about being overly conservative in the problem formulation, and thus, missing the opportunity to find a better solution to the true distribution.
For \plp, this means that the initial model is likely to be more conservative in an effort to be robust, while the new information could make the model less conservative.
This would happen, for instance, when the new information removes the current worst-case distribution from the ambiguity set.
Below, we present a simple method of determining if taking an additional observation will allow for a lower-cost solution.
%Below, we present a simple method of determining if taking an additional sample will eliminate the old worst-case distribution and allow for a lower-cost solution.
%We also provide a way of estimating the probability of sampling such an observation.
%Our main goal is to come up with simple conditions on how an additional data affects the problem by using the current solution.
Our main goal is to come up with simple conditions by using the current solution.
In particular, we would like to use only the current optimal worst-case probabilities $p_\omega^*$, nominal probabilities $q_\omega$, and the number of observations $N$.
One could, of course, solve the problem with an additional observation of $\omega$, see if the optimal value is lowered, and check this for every scenario $\omega$.
We would like to avoid resolving the problem.
Toward this end, we first provide a general result.
Using this general result, we then provide simpler conditions for a subset of the $\phi$-divergences by using only $\q$, $\p^*$ and $N$ in Corollary \ref{cor:cost_decrease_trick}.
\begin{proposition}
\label{prop:value}
Let $(\x^*_N,\mu^*_N,\lambda^*_N)$ solve the $N$-observation (dual) problem with $q_\omega = \tfrac{N_\omega}{N}$.
Suppose $s^*_\omega = \dfrac{h_\omega(\x^*_N) - \mu^*_N}{\lambda^*_N}$ is finite and $\phi^*$ is subdifferentiable at $s^*_\omega $ for all $\omega=1,\ldots,n$.
An additional observation of scenario $\hat{\omega}$ will result in a decrease in the worst-case expected cost of \plp\ if the following condition is satisfied
\begin{equation} \label{eq:cost_decrease_cond}
\sum_{\omega=1}^n q_\omega \phi^{*\prime}\left(\frac{N}{N+1}s^*_\omega\right) \left(\frac{N}{N+1}s^*_\omega\right) > \phi^*\left(\frac{N}{N+1}s^*_{\hat{\omega}}\right).
\end{equation}
In (\ref{eq:cost_decrease_cond}), $\phi^{*\prime}(s)$ denotes the derivative of $\phi^{*}$ at $s$ if it is differentiable, and it denotes a subgradient of $\phi^{*}$ at $s$ otherwise.
\end{proposition}
\begin{proof}{\sc Proof of Proposition \ref{prop:value}.}
For ease of exposition, assume $\phi^*$ is differentiable.
% The proof works without this assumption by considering a subgradient.
We begin the proof with the change of variables $\kappa = \frac{\lambda}{N}$ and note that $N\rho_N = (N+1)\rho_{N+1} =\rho_0$ is constant.
With this change of variables, the objective function of the $N$-observation problem is given by
\[
f_N(\x,\mu,\kappa) = c\x + \mu + \rho_0 \kappa + \sum_{\omega = 1}^n N_\omega \left[ \kappa \phi^*\left(\frac{h_\omega(\x) - \mu}{\kappa N} \right) \right].
\]
Let $z_N^*$ be the optimal value and $(\x^*_N,\mu^*_N,\kappa^*_N)$ be the optimal solution of the $N$-observation dual problem (\ref{eq:plp_two_stage}) with the change of variables.
% and objective function $f_N(\x,\mu,\kappa)$.
%$ = \min_{\x \in X,\mu,\kappa \geq 0} f_N(\x,\mu,\kappa)$.
We wish to find a simple estimate of the decrease in the optimal cost $z_N^* - z_{N+1}^*$ associated with taking an additional observation of a specific scenario $\hat{\omega}$.
In particular, we look for a condition under which $z_N^* - z_{N+1}^* > 0$.
% Let $(\x^*_N,\mu^*_N,\kappa^*_N)$ minimize $f_N$.
Notice that $(\x^*_N,\mu^*_N,\kappa^*_N)$ is a feasible but not necessarily an optimal solution to the $(N+1)$-observation problem.
%with an additional observation of scenario $\hat{\omega}$.
Then, $z_N^* - f_{N+1}(\x^*_N,\mu^*_N,\kappa^*_N)$ provides a lower bound on the decrease in optimal cost $z_N^* - z_{N+1}^*$.
We will find scenarios $\hat{\omega}$ such that $z_N^* - f_{N+1}(\x^*_N,\mu^*_N,\kappa^*_N) > 0$.
The objective of the $(N+1)$-observation problem for a given $(\x,\mu,\kappa)$ is $ f_{N+1}(\x,\mu,\kappa) = \c\x + \mu + \rho_0 \kappa + \sum_{\omega = 1}^n N'_\omega \left[ \kappa \phi^*\left(\frac{h_\omega(\x) - \mu}{\kappa (N+1)} \right) \right]$, where $N'_\omega$ is the number of observations of scenario $\omega$ after $N+1$ total observations (e.g., $N_{\hat{\omega}}=N_{\hat{\omega}}+1$ and $N'_\omega = N_\omega$ for others).
Then, $z_N^* - z_{N+1}^*$ is bounded by $\kappa \sum_{\omega=1}^n \left[ N_\omega \phi^*\left(\frac{h_\omega(\x) - \mu}{\kappa N} \right) - N'_\omega \phi^*\left(\frac{h_\omega(\x) - \mu}{\kappa (N+1)} \right) \right]$, which must be positive to guarantee a drop in optimal cost.
If $\hat{\omega}$ is the next scenario observed, we can rewrite this condition as
\begin{equation} \label{eq:raw_cond}
\kappa \sum_{\omega=1}^n N_\omega \left[ \phi^*\left(\frac{h_\omega(\x) - \mu}{N\kappa} \right) - \phi^*\left(\frac{h_\omega(\x) - \mu}{(N+1)\kappa} \right) \right] - \kappa \phi^*\left(\frac{h_{\hat{\omega}}(x) - \mu}{(N+1)\kappa}\right) > 0.
\end{equation}
Let $s^N_\omega = \frac{h_\omega(\x) - \mu}{\kappa N}$ and $s^{N+1}_\omega = \frac{h_\omega(\x) - \mu}{\kappa (N+1)}$, and note that $s^{N+1}_\omega = \tfrac{N}{N+1} s^N_\omega$.
% Let $|\Delta s| = | s^{N+1}_\omega - s^N_\omega | = \frac{|s^{N+1}_\omega|}{N}$.
The difference $\phi^*(s^N_\omega) - \phi^*(s^{N+1}_\omega)$ will be approximated by the derivative.
% Note that $\phi^*(s)$ is nondecreasing because $0 \leq t = \phi^{*\prime}(s)$ for some $t$.
% First, for $s^N_\omega > 0$, $\phi^*(s^N_\omega) - \phi^*(s^{N+1}_\omega) \geq \phi^{*\prime}(s^{N+1}_\omega) |\Delta s|$.
% Then for $s^N_\omega < 0$, $\phi^*(s^{N+1}_\omega) - \phi^*(s^N_\omega) \leq \phi^{*\prime}(s^{N+1}_\omega) |\Delta s|$ and thus $\phi^*(s^N_\omega) - \phi^*(s^{N+1}_\omega) \geq -\phi^{*\prime}(s^{N+1}_\omega) |\Delta s|$.
% Then both cases reduce to $\phi^*(s^N_\omega) - \phi^*(s^{N+1}_\omega) \geq \frac{1}{N} \phi^{*\prime}(s^{N+1}_\omega) s^{N+1}_\omega$.
Because $\phi^*(s)$ is convex, the gradient inequality gives $\phi^*(s^N_\omega) - \phi^*(s^{N+1}_\omega) \geq \frac{1}{N} \phi^{*\prime}(s^{N+1}_\omega) s^{N+1}_\omega$.
Using this inequality, we can guarantee (\ref{eq:raw_cond}) is satisfied with the condition $\kappa \sum_{\omega=1}^n \frac{N_\omega}{N} \phi^{*\prime}(s^{N+1}_\omega) s^{N+1}_\omega - \kappa \phi^*\left(\frac{h_{\hat{\omega}}(\x) - \mu}{(N+1)\kappa}\right) > 0$.
Rearranging and dividing by $\kappa > 0$
\begin{equation*} %\label{eq:main_value_derivation}
\sum_{\omega=1}^n \frac{N_\omega}{N} \phi^{*\prime}(s^{N+1}_\omega) s^{N+1}_\omega > \phi^*(s^{N+1}_{\hat{\omega}}).
\end{equation*}
Finally, we return to the original variables with the substitution $s^{N+1}_\omega = \frac{N}{N+1} s^*_\omega, \forall \omega$.
\Halmos
\end{proof}
%\end{proposition}
We can interpret \eqref{eq:cost_decrease_cond} as follows.
If an additional observation is taken from the unknown distribution and the resulting observed scenario $\hat{\omega}$ satisfies (\ref{eq:cost_decrease_cond}) with the current solution, then the $(N+1)$-observation problem will have a lower cost than the $N$-observation problem that was already solved.
%This is equivalent to saying that an additional observation of $\hat{\omega}$ will rule out the computed worst-case distribution given by $\{p_\omega\}_{\omega=1}^{n}$ in \eqref{eq:p_worst}.
Observe that the statement of Proposition \ref{prop:value} eliminates the case $\lambda_N^* = 0$.
A closer look at (\ref{eq:raw_cond}) and the version of (\ref{eq:raw_cond}) after the use of the gradient inequality, and recalling that $\kappa = \lambda /N$, reveals that the condition (\ref{eq:cost_decrease_cond}) is not satisfied at $\lambda_N^* = 0$.
%\bigskip
% Hence, $\phi^*$ is subdifferentiable on the relative interior of its domain. %%%\citep[Theorem xxx]{rockafellar_70}.
% The assumption that $s^*_\omega$ is finite together with constraint (\ref{eq:plp_feas_constraint}) ensures that $\frac{N}{N+1}s^*_\omega$ is in the relative interior of the domain of $\phi^{*}$; so a subgradient exists.
%\bigskip
It is possible to simplify condition \eqref{eq:cost_decrease_cond} for some $\phi$-divergences, and we detail this in the corollary below.
Condition \eqref{eq:cost_decrease_cond} uses the dual formulation and (sub)gradient $\phi^{* \prime}$.
This allows us to utilize the primal-dual relationship $\left(p_{\hat{\omega}}^* = \phi^{* \prime}(s_{\hat{\omega}}^*) q_{\hat{\omega}}\right)$ to provide simplified conditions using $\q,\p^*$ and $N$.
%\begin{remark}
% \label{rmk:cost_decrease_trick}
\begin{corollary}
\label{cor:cost_decrease_trick}
Let $\p^*=(p_1^*, p_2^*,\ldots, p_n^*)$ solve the $N$-observation (primal) problem with $q_\omega = \tfrac{N_\omega}{N}$.
An additional observation of scenario $\hat{\omega}$ will result in a decrease in the worst-case expected cost of \plp\ if the following condition is satisfied for:\vspace*{-0.1in}
\begin{multicols}{2}
\begin{description}
\item[Burg entropy:] $\frac{p_{\hat{\omega}}^*}{q_{\hat{\omega}}} < \frac{N}{N+1}$, %(or, $p_{\hat{\omega}}<\frac{N_{\hat{\omega}}}{N}$)
\item[$\chi^2$-distance:] $\sum_\omega \frac{(q_\omega)^2}{p_\omega^*} + \sqrt{\frac{N+1}{N}} < 2 \frac{q_{\hat{\omega}}}{p_{\hat{\omega}}^*}$,
\item[Hellinger:] $\sum_\omega q_\omega \sqrt{\frac{p_\omega^*}{q_\omega}} + \sqrt{\frac{p_{\hat{\omega}}^*}{q_{\hat{\omega}}}} < 2 \frac{N}{N+1}$,
\item[Modified $\chi^2$:] $2 \sum_\omega \frac{(p_\omega^*)^2}{q_\omega} > \left(\frac{p_{\hat{\omega}}^*}{q_{\hat{\omega}}}\right)^2 + \left(\frac{N+1}{N}\right)^2$.
\end{description}
\end{multicols}
\end{corollary}
\begin{proof}{\sc Proof of Corollary \ref{cor:cost_decrease_trick}.}
% We use a small trick to transform condition (\ref{eq:cost_decrease_cond}) into a form that is easier to work with.
%Recall that
For any real number $c$, we can define $\phi_c(t) = \phi(t) + c(t-1)$, which satisfies $I_{\phi_c}(\p,\q) = I_\phi(\p,\q)$ for probability vectors $\p$ and $\q$.
This changes the conjugate as $\phi_c^*(s) = \phi^*(s-c) + c$.
For some $\phi$, we can choose $c$ such that $\phi_c^{*\prime}(s)$ is separable, i.e., $\phi_c^{*\prime}(as) = f(a) \phi_c^{*\prime}(s)$ for some function $f$.
Using this separability, we can simplify (\ref{eq:cost_decrease_cond}) for some $\phi$ by choosing:\vspace*{-0.15in}
\begin{multicols}{2}
\begin{description}
\item[Burg entropy:] $c = -1$, so $\phi^{*\prime}_{b,c}(s) = -\frac{1}{s}$, $s<0$,
\item[$\chi^2$-distance:] $c = -1$, so $\phi^{*\prime}_{\chi^2, c}(s) = \frac{1}{\sqrt{-s}}$, $s<0$,
\item[Hellinger:] $c = -1$ so $\phi^{*\prime}_{h,c}(s) = \frac{1}{s^2}$, $s<0$,
\item[Modified $\chi^2$:] $c = 2$, so
\[
\phi^{*\prime}_{m\chi^2, c}(s) = \
\begin{cases}
0 & s < 0 \\
\frac{1}{2} s & s \geq 0.
\end{cases}
\]
\end{description}
\end{multicols}
\noindent We illustrate the rest of the steps using Burg entropy.
Because $I_{\phi_c}(\p,\q) = I_\phi(\p,\q)$, we can equivalently solve \plp\ by using $\phi_c$ instead.
Applying (\ref{eq:cost_decrease_cond}) to $\phi_{b,c}$, after some algebra, we obtain the simplified condition $- \left( \frac{N}{N+1} \right)s_{\hat{\omega}}^* >1$ for Burg entropy.
Property \ref{property:primal_dual_relation} yields the relationship $\frac{p^*_{\hat{\omega}}}{q_{\hat{\omega}}}=\frac{-1}{s_{\hat{\omega}}^*}$.
Substituting this into the simplified condition gives the desired result.
Modified $\chi^{2}$ must consider the split at $0$.
This can be easily handled.
The left-hand side of (\ref{eq:cost_decrease_cond}) contains the term $\phi^{*\prime}_{m\chi^2, c}(s_{\omega}^*)s_{\omega}^*$, which is equal to $2\left(\frac{p_{\omega}^*}{q_\omega}\right)^2$.
Both terms equal to $0$ if $s_{\omega}^* < 0$ and $(s_{\omega}^*)^2/2$ otherwise.
The right-hand side can be handled similarly, resulting in the final condition presented above.
\Halmos
\end{proof}
Let us take a closer look into Corollary~\ref{cor:cost_decrease_trick}'s condition for Burg entropy.
Recall that using Burg entropy, we obtain the likelihood robust optimization of \cite{wang2010likelihood}.
Furthermore, it has been used by some authors as the KL divergence because of the change between $\p$ and $\q$.
In this case, we have the condition $p_{\hat{\omega}}^* < \frac{N}{N+1}q_{\hat{\omega}}$.
This means that the \plp\ has assigned a worst-case probability $p_{\hat{\omega}}^*$ that is less than the slightly adjusted observed frequency $q_{\hat{\omega}}=\frac{N_{\hat{\omega}}}{N}$ of scenario $\hat{\omega}$.
The \plp\ focuses on the worst-case cost within the ambiguity set.
Therefore, it tends to assign higher probabilities $p_\omega^*$ to costly
scenarios.
Because $p_{\hat{\omega}}^* < \frac{N}{N+1}q_{\hat{\omega}}$, the condition in Corollary~\ref{cor:cost_decrease_trick} suggests that $\hat{\omega}$ might not be a very costly scenario.
If we observe one more from this scenario, we believe it is more likely in the nominal (or true) distribution.
Consequently, a lower-cost scenario being more likely would decrease the optimal cost.
The simple conditions in Proposition~\ref{prop:value} and Corollary~\ref{cor:cost_decrease_trick} provide insight into different scenarios for a decision maker.
Let $L = \left\{ \hat{\omega} : \sum_{\omega=1}^n q_\omega \phi^{*\prime}\left(\frac{N}{N+1}s^*_\omega\right) \left(\frac{N}{N+1}s^*_\omega\right) > \phi^*\left(\frac{N}{N+1}s^*_{\hat{\omega}}\right) \right\}$.
%That is, $L$ gives the set of scenarios that, if sampled one more observation, would result in a decrease in the optimal cost in \plp.
Set $L$ divides the scenarios into two---the ones in $L$ guarantee a drop in the overall cost if sampled one more.
Therefore, these scenarios can be considered `good' or `optimistic' scenarios.
Note that scenarios not in $L$ can also result in the cost decrease.
The numerical experiments in Section \ref{sec:comp_results} suggest that $L$ is an adequate indicator of `good' scenarios for our test problem.
Finally, one might be interested in obtaining a lower bound on the probability that the next sample will decrease the optimal cost.
An approximate lower bound on the probability of selecting a sample in $L$ can be found by solving
\begin{equation}
\min_{r} \left\{ \sum_{\omega \in L} r_\omega \colon\ r \in \mathcal{P} \right\}. \label{eq:lb_probability}
\end{equation}
Because we do not know the true distribution, we find the minimum probability of the scenarios in $L$ within the ambiguity set defining \plp.
One can solve (\ref{eq:lb_probability}) by taking its dual.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Asymptotic Analysis}
\label{ssec:epiconvergence}
We now show that \plp\ behaves essentially the same as the corresponding SLP-2 with the (unknown) true distribution $\qtrue$ as $N\rightarrow \infty$.
%We now wish to show that the optimal value and solution of \plp\ converges to the optimal value and solution of the corresponding SLP-2 with the (unknown) true distribution $\qtrue$.
%I HAD THIS BELOW SENTENCE BUT REMOVED IT.
This requires that the sequence of nominal probabilities $\q$ converge to $\qtrue$ with probability one (w.p.1) uniformly in $\omega$---a situation that is satisfied by the assumed empirical probabilities under mild conditions.
To emphasize the nominal distribution's dependence on $N$, we use $\q^N$ in this section.
%In the proof, we assume $q^N = q^N^N$.
We begin by showing that the worst-case probabilities $\p^*$ obtained by solving \plp\ have a similar asymptotic behavior as $\q^N$.
%%converge to the true distribution $\qtrue$ as $N \rightarrow \infty$ uniformly w.p.1.
%This result shows that $\p^*$ also obeys a Uniform Strong Law of Large Numbers (USSLN).
\begin{proposition} \label{prop:weak_conv}
Suppose $\phi(t) \geq 0$ has a unique root at $t = 1$ and the observations are independent and identically distributed from a distribution with probability mass function $\qtrue$.
%in a way that $q^N_\omega$ obeys the Strong Law of Large Numbers (SLLN) for each $\omega$.
Then, w.p.1, $\sup_\omega |p^*_{\omega} - q^{\text{true}}_\omega| \rightarrow 0$ as $N \rightarrow \infty$.
%Let $p \neq \qtrue$.
\end{proposition}
\begin{proof}{\sc Proof of Proposition \ref{prop:weak_conv}.}
Because $q_\omega^N=\frac{N_\omega}{N}$ obeys the strong law of large numbers and $\omega =1,...,n$ is a finite set, we have uniform convergence over $\omega$.
That is, $\sup_\omega |q^N_{\omega} - q^{\text{true}}_\omega| \rightarrow 0$ as $N \rightarrow \infty$, w.p.1.
Now, on a sample path that this convergence occurs, we will show that for all $\epsilon > 0$, there exists $N'$ (depending on the sample path) such that $\forall N \geq N'$ (on that path), $I_{\phi}(\p^*,\q^N) \leq \frac{\rho_0}{N}$ implies $\sup_\omega |p^*_\omega - q^{\text{true}}_\omega| \leq \epsilon$.
Because such sample paths have measure 1, our desired result will occur w.p.1.
Below, for simplicity of notation we skip the dependence on a particular sample path.
First, note that $\sup_\omega |p_\omega^* - q^{\text{true}}_\omega| \leq \sup_\omega |p_\omega^* - q^N_\omega| + \sup_\omega |q^N_\omega - q^{\text{true}}_\omega|$.
Assume, again for simplicity, $\epsilon$ is chosen so that $\min_{\omega} q^{\text{true}}_\omega > \frac{\epsilon}{2}$.
Let $N''$ be such that $\sup_\omega |q^N_\omega - q^{\text{true}}_\omega| \leq \frac{\epsilon}{2}$ for all $N \geq N''$.
This implies $q^N_\omega>0$ for all $N \geq N''$ for all $\omega$.
Now suppose $\sup_\omega |p_\omega^* - q^N_\omega| > \frac{\epsilon}{2}$.
Then, for at least one $\omega$---let's denote it $\ob$---we have either $p_{\ob}^* >q_{\ob}^N + \frac{\epsilon}{2}$ or $p_{\ob}^* < q_{\ob}^N - \frac{\epsilon}{2}$.
In either case, because $\phi(t)\geq 0$ is a convex function with a root at $t=1$, we can say for $\ob$
$$
\phi \left( \frac{p_{\ob}^*}{q^N_{\ob}} \right) \geq \min\left\{ \phi\left( \frac{q^N_{\ob}+\tfrac{\epsilon}{2}}{q^N_{\ob}} \right), \phi\left( \frac{q^N_{\ob}-\tfrac{\epsilon}{2}}{q^N_{\ob}} \right) \right\} \geq \min\left\{ \phi\left( 1+\frac{\epsilon}{2}\right), \phi\left( 1-\frac{\epsilon}{2}\right) \right\}.
$$
The last inequality follows from the fact that $\frac{a+\frac{\epsilon}{2}}{a} \geq 1+ \frac{\epsilon}{2}$ and $\frac{a-\frac{\epsilon}{2}}{a} \leq 1- \frac{\epsilon}{2}$ for $0<a\leq 1$ and again the properties of $\phi$.
Putting this all together,
% under the set forth assumptions, we have
\begin{align}
I_{\phi}(\p^*,\q^N) & = \sum_{\omega=1}^n q^N_\omega \phi\left( \frac{p_\omega^*}{q^N_\omega} \right) \nonumber \\
% & = \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \sum_{\omega \notin Z} q^N_\omega \phi\left( \frac{p_\omega}{q^N_\omega} \right) \nonumber \\
& \geq \min_{\omega} \{q^N_\omega\} \cdot \phi \left( \frac{p_{\ob}^*}{q^N_{\ob}} \right) \nonumber \\
% & \geq \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \min_{\omega \notin Z} \{q^N_\omega\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\} \label{eq:asymptotic_proof_phi_substitution} \\
& \geq \min_{\omega} \left\{ q^{\text{true}}_\omega - \frac{\epsilon}{2} \right\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\} \label{eq:last}.
\end{align}
The right-hand side of (\ref{eq:last}) is positive because $\phi$ has a unique root at $t=1$.
By choosing $N'\geq N''$ to satisfy $\min_{\omega} \left\{ q^{\text{true}}_\omega - \frac{\epsilon}{2} \right\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\}\geq \frac{\rho_0}{N'}$, we see that $\sup_\omega |p_\omega^* - q^N_\omega| > \frac{\epsilon}{2}$ implies $I_\phi(\p^*,\q) > \frac{\rho_0}{N}$ for all $N \geq N'$.
Because the $\phi$-divergence constraint ensures $I_\phi(\p^*,\q) \leq \frac{\rho_0}{N}$, we must have $\sup_\omega |p_\omega^* - q^N_\omega| \leq \frac{\epsilon}{2}$ for all $N \geq N'$, and the desired result follows.
%\bigskip
%\bigskip
%\bigskip
%
%
% This means that for all $N \geq N'$, $\sup_\omega |p_\omega - q^N_\omega| > \frac{\epsilon}{2}$ implies that $I_\phi(\p,\q) > \frac{\rho_0}{N}$.
%
%\bigskip
%
% For simplicity, we assume $\epsilon$ is chosen so that $\max_{\omega \notin Z} \qtrue_\omega > \frac{\epsilon}{2}$ and drop the dependence on $\xi \in \Xi'$.
%
% To complete the proof, we will show that one can choose $N' \geq N''$ such that $\forall N \geq N'$, $\max_\omega |p_\omega - q^N_\omega| > \frac{\epsilon}{2} \Rightarrow I_\phi(p,q) > \frac{\rho_0}{N}$.
% First, bound the divergence by
% \begin{align}
% I_{\phi}(p,q^N) & = \sum_{\omega=1}^n q^N_\omega \phi\left( \frac{p_\omega}{q^N_\omega} \right) \nonumber \\
% & = \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \sum_{\omega \notin Z} q^N_\omega \phi\left( \frac{p_\omega}{q^N_\omega} \right) \nonumber \\
% & \geq \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \min_{\omega \notin Z} \{q^N_\omega\} \cdot \max_{\omega \notin Z} \left\{ \phi \left( \frac{p_\omega}{q^N_\omega} \right) \right\} \nonumber \\
% & \geq \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \min_{\omega \notin Z} \{q^N_\omega\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\} \label{eq:asymptotic_proof_phi_substitution} \\
% & \geq \bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \min_{\omega \notin Z} \left\{ \qtrue_\omega - \frac{\epsilon}{2} \right\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\} \nonumber,
% \end{align}
% where $\bar{s}\mathbb{I}_{\bar{s} < \infty}$ is the indicator function taking value $\bar{s}$ if $\bar{s} < \infty$ (i.e., if $\phi$ can pop scenarios---please see Section~\ref{ssec:suppressandpop} for details) and zero otherwise.
% Inequality (\ref{eq:asymptotic_proof_phi_substitution}) is true because $\phi \left( \frac{p_\omega}{q^N_\omega} \right) \geq \min\left\{ \phi\left( \frac{q^N_\omega+\tfrac{\epsilon}{2}}{q^N_\omega} \right), \phi\left( \frac{q^N_\omega-\tfrac{\epsilon}{2}}{q^N_\omega} \right) \right\}$ for at least one $\omega$ and applying the inequalities $\frac{a+\eta}{a} \geq 1 + \eta$ and $\frac{a-\eta}{a} \leq 1-\eta$.
%
% Finally, choose $N'$ to satisfy $\bar{s} \mathbb{I}_{\bar{s} < \infty} \sum_{\omega \in Z} p_\omega + \min_{\omega \notin Z} \left\{ \qtrue_\omega - \frac{\epsilon}{2} \right\} \cdot \min\left\{ \phi\left(1+\frac{\epsilon}{2}\right), \phi\left(1-\frac{\epsilon}{2}\right) \right\} \geq \frac{\rho_0}{N'}$.
\Halmos
%
%
% This is the stuff that was before the proposition
% Let $(\Xi,{\cal F},\P^\infty)$ be the probability space associated with taking infinitely many random samples from the distribution $\qtrue$.
%Let $\Xi' \subset \Xi$ be a measure 1 set such that $\Vert q^N(\xi) - \qtrue \Vert_\infty \rightarrow 0$.
%\bigskip
%
% For all $\epsilon > 0$ and $\xi \in \Xi'$, there exists $N'$ such that $\forall N \geq N'$, $I_{\phi}(p,q^N(\xi)) \leq \frac{\rho_0}{N}$ implies $\max_\omega |p_\omega - \qtrue_\omega| \leq \epsilon$.
%\bigskip
%
%
% Let $Z = \{\omega : \qtrue_\omega = 0\}$ be the set of impossible scenarios.
%
\end{proof}
We are now ready to present the main result on the asymptotic behavior of \plp.
\begin{theorem}
\label{thm:epiconvergence}
% Assume $X \neq \emptyset$ is compact and problem (\ref{eq:slp_second_stage}) is primal and dual feasible for all $\omega$ and $\x \in X$.
Suppose $\phi(t) \geq 0$ has a unique root at $t = 1$ and the observations are independent and identically distributed from a distribution with probability mass function $\qtrue$.
Then, the optimal value of \plp\ given in (\ref{eq:plp_two_stage}) converges to that of SLP-2 given in (\ref{eq:slp_first_stage}) with $\qtrue$, and all limit points of the solutions $\x^*$ of \plp\ solve SLP-2 with $\qtrue$ as $N \rightarrow \infty$, w.p.1.
\end{theorem}
\begin{proof}{\sc Proof of Theorem \ref{thm:epiconvergence}.}
Assumptions on problems (\ref{eq:slp_second_stage}) and set $X$ stated in Section \ref{ssec:basicprop} ensure $\sup_{\omega, \x \in X}|h_\omega(\x)|<C$ for some constant $C<\infty$ and that SLP-2 with $\qtrue$ and \plp\ w.p.1 have finite optimal solutions.
Let $f(\x,\omega)=\c\x + h_\omega(\x)$.
View the objective of \plp\ as $\ee{\p^*}{f(\x,\omega)}=\sum_{\omega=1}^{n}p^*_{\omega,N}(\x) f(\x,\omega)$ and the objective of SLP-2 with $\qtrue$ as $\ee{\qtrue}{f(\x,\omega)}=\sum_{\omega=1}^{n}q^{\text{true}}_\omega f(\x,\omega)$.
Here, $\p^*$ depends on the number of observations $N$, the actual observations collected, and also $\x$.
So, we use the longer notation $p^*_{\omega,N}(\x)$ inside the summation for clarity.
Following the same arguments as in the proof Proposition \ref{prop:weak_conv}, we have for each $\x \in X$, $\sup_\omega |p^*_{\omega, N}(\x) - q^{\text{true}}_\omega| \rightarrow 0$ as $N \rightarrow \infty$, w.p.1.
Using this result, we can obtain pointwise strong law of large numbers (SLLN)---that is, for each $\x \in X$, $|\ee{\p^*}{f(\x,\omega)}-\ee{\qtrue}{f(\x,\omega)}|\rightarrow 0$ as $N\rightarrow \infty$ w.p.1.
Because $X$ is convex and compact, $f(\x,\omega)$ is convex and continuous on $X$ for all $\omega$, and $\ee{\qtrue}{f(\x,\omega)}$ is finite valued and continuous on $X$, the desired result follows (see, e.g., Theorem 4 of \cite{shapiro_03}).
\Halmos
\end{proof}
The non-negativity condition on $\phi$ in Proposition \ref{prop:weak_conv} and Theorem \ref{thm:epiconvergence} are satisfied by every divergence in Table \ref{tb:phi_definitions} except Likelihood (which, however, can be rewritten as Burg Entropy).
The unique root requirement, on the other hand, is violated for the special cases to be introduced in Section \ref{ssec:special_phi}.
%Proposition \ref{prop:weak_conv} implies that the worst-case distributions of (\ref{eq:plp_primal}) converge weakly to $\qtrue$, which is used to show the desired result below.
%In the next theorem, we establish the proof that the optimal value and solution of \plp\ converges to that of the SLP-2 with distribution $\qtrue$ by establishing the epiconvergence of \plp\ to SLP-2.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{A Classification of $\phi$-Divergences}
\label{sec:classification}
Given that there are many $\phi$-divergences to choose from, it is important to study how $\phi$-divergences act within an ambiguous (or, distributionally robust) stochastic optimization model.
We present a classification of $\phi$-divergences into four types, resulting from an examination of the limiting behavior of $\phi(t)$ as $t \searrow 0$ and $t \nearrow \infty$.
We begin in Section~\ref{ssec:suppressandpop} by defining two behaviors---suppressing and popping of scenarios---that result in our main classification.
Additional details on these behaviors along with a subclassification follow in Sections~\ref{ssec:suppress} and \ref{ssec:pop}.
Different classifications may be suitable to different problem types and desired qualities in the ambiguous model.
We discuss modeling considerations with respect to our classification in Section \ref{ssec:modeling}, and we demonstrate the classification on risk models in Section \ref{ssec:special_phi}.
Throughout this section, we assume $0<\rho<\infty$, unless otherwise stated.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Suppressing and Popping of Scenarios: A Main Classification}
\label{ssec:suppressandpop}
Recall the definition of the ambiguity set $\mathcal{P}$, in particular, the $\phi$-divergence constraint
\[
\sum_{\omega = 1}^{n} q_\omega \phi\left(\frac{p_\omega}{q_\omega}\right) \leq \rho.
\]
Above, $\phi$ has arguments given by ratios of probabilities, $\tfrac{p_\omega}{q_\omega}$.
The case $q_\omega > 0$, $p_\omega = 0$ means that a scenario with a positive nominal probability has been assigned zero probability by the ambiguous counterpart problem.
This case corresponds to the limit of $\phi$ as $t \searrow 0$.
On the other hand, the case $q_\omega = 0$, $p_\omega > 0$ could mean that scenario $\omega$ has never been observed before---so it has a zero probability in the nominal distribution---but the ambiguous counterpart problem has assigned it a positive probability.
Recall that, by definition, $0 \phi(a/0) = a \lim_{t \rightarrow \infty} \frac{\phi(t)}{t}$, and so in the latter case, we need to examine $\lim_{t \nearrow \infty} \frac{\phi(t)}{t}$.
Consider each of the limiting cases in more detail:
\begin{itemize}
\item {\sc Case 1:} ($q_\omega > 0$ but $p_\omega = 0$)
We call this the ``{\bf Suppressing}'' behavior because a scenario with a positive probability in the nominal distribution can take zero probability in the ambiguous problem.