forked from matthewjdenny/GERGM
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgergm.Rd
302 lines (262 loc) · 15.7 KB
/
gergm.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/gergm.R
\name{gergm}
\alias{gergm}
\title{A Function to estimate a GERGM.}
\usage{
gergm(formula, covariate_data = NULL, normalization_type = c("log",
"division"), network_is_directed = TRUE, use_MPLE_only = c(FALSE, TRUE),
transformation_type = c("Cauchy", "LogCauchy", "Gaussian", "LogNormal"),
estimation_method = c("Metropolis", "Gibbs"),
maximum_number_of_lambda_updates = 10,
maximum_number_of_theta_updates = 10,
number_of_networks_to_simulate = 500, thin = 1, proposal_variance = 0.1,
downweight_statistics_together = TRUE, MCMC_burnin = 100, seed = 123,
convergence_tolerance = 0.5, MPLE_gain_factor = 0,
acceptable_fit_p_value_threshold = 0.05, force_x_theta_updates = 1,
force_x_lambda_updates = 1, output_directory = NULL, output_name = NULL,
generate_plots = TRUE, verbose = TRUE,
hyperparameter_optimization = FALSE, stop_for_degeneracy = FALSE,
target_accept_rate = 0.25, theta_grid_optimization_list = NULL,
beta_correlation_model = FALSE, weighted_MPLE = FALSE,
fine_grained_pv_optimization = FALSE, parallel = FALSE,
parallel_statistic_calculation = FALSE, cores = 1,
use_stochastic_MH = FALSE, stochastic_MH_proportion = 0.25,
estimate_model = TRUE, slackr_integration_list = NULL, ...)
}
\arguments{
\item{formula}{A formula object that specifies the relationship between
statistics and the observed network. Currently, the user may specify a model
using any combination of the following statistics: `out2stars(alpha = 1)`,
`in2stars(alpha = 1)`, `ctriads(alpha = 1)`, `mutual(alpha = 1)`,
`ttriads(alpha = 1)`, `absdiff(covariate = "MyCov")`,
`sender(covariate = "MyCov")`,
`reciever(covariate = "MyCov")`, `nodematch(covariate)`,
`nodemix(covariate, base = "MyBase")`, `netcov(network)` and
`edges(alpha = 1, method = c("regression","endogenous"))`. If the user
specifies `nodemix(covariate, base = NULL)`, then all levels of the
covariate will be matched on. Note that the `edges` term must be specified if
the user wishes to include an intercept
(strongly recommended). The user may select the "regression" method
(default) to include an intercept in the lambda transformation of the
network, or "endogenous" to include the intercept as in a traditional ERGM
model. To use exponential downweighting for any of the network level terms,
simply specify a value for alpha less than 1. The `(alpha = 1)` term may be
omitted from the structural terms if no exponential downweighting is
required. In this case, the terms may be provided as: `out2star`, `in2star`,
`ctriads`, `recip`, `ttriads`. If the network is undirected the user may only
specify the following terms: `twostars(alpha = 1)`, `ttriads(alpha = 1)`,
`absdiff(covariate = "MyCov")`,
`sender(covariate = "MyCov")`, `nodematch(covariate)`,
`nodemix(covariate, base = "MyBase")`, `netcov(network)` and
`edges(alpha = 1, method = c("regression","endogenous"))`. In some cases,
the user may only wish to calculate endogenous statistics for edges between
some subset of the nodes in the network. For each of the endogenous
statistics, the user may optionally specify a `covariate` and `base` field
such as in `in2stars(covariate = "Type", base = "C")`. This will add an
in-2star statistic for the subnetwork defined by actors who match each level
of the categorical variable "Type" (in this example), and exclude the
subnetwork for type "C", if the `base` argument is provided. If the `base`
argument is excluded, then terms will be added to the model for all levels of
the statistic. This can be a useful option if the user believes that a
network property varies with some property of nodes.}
\item{covariate_data}{A data frame containing node level covariates the user
wished to transform into sender or reciever effects. It must have row names
that match every entry in colnames(raw_network), should have descriptive
column names. If left NULL, then no sender or reciever effects will be
added.}
\item{normalization_type}{If only a raw_network is provided the function
will automatically check to determine if all edges fall in the [0,1] interval.
If edges are determined to fall outside of this interval, then a trasformation
onto the interval may be specified. If "division" is selected, then the data
will have a value added to them such that the minimum value is atleast zero
(if necessary) and then all edge values will be divided by the maximum to
ensure that the maximum value is in [0,1]. If "log" is selected, then the data
will have a value added to them such that the minimum value is atleast zero
(if necessary), then 1 will be added to all edge values before they are logged
and then divided by the largest value, again ensuring that the resulting
network is on [0,1]. Defaults to "log" and need not be set to NULL if
providing covariates as it will be ignored.}
\item{network_is_directed}{Logical specifying whether or not the observed
network is directed. Default is TRUE.}
\item{use_MPLE_only}{Logical specifying whether or not only the maximum pseudo
likelihood estimates should be obtained. In this case, no simulations will be
performed. Default is FALSE.}
\item{transformation_type}{Specifies how covariates are transformed onto the
raw network. When working with heavly tailed data that are not strictly
positive, select "Cauchy" to transform the data using a Cauchy distribution.
If data are strictly positive and heavy tailed (such as financial data) it is
suggested the user select "LogCauchy" to perform a Log-Cauchy transformation
of the data. For a tranformation of the data using a Gaussian distribution,
select "Gaussian" and for strictly positive raw networks, select "LogNormal".
The Default value is "Cauchy".}
\item{estimation_method}{Simulation method for MCMC estimation. Default is
"Metropolis", which allows for the most flexible model specifications, but
may also be set to "Gibbs", if the user wishes to use Gibbs sampling.}
\item{maximum_number_of_lambda_updates}{Maximum number of iterations of outer
MCMC loop which alternately estimates transform parameters and ERGM
parameters. In the case that data_transformation = NULL, this argument is
ignored. Default is 10.}
\item{maximum_number_of_theta_updates}{Maximum number of iterations within the
MCMC inner loop which estimates the ERGM parameters. Default is 100.}
\item{number_of_networks_to_simulate}{Number of simulations generated for
estimation via MCMC. Default is 500.}
\item{thin}{The proportion of samples that are kept from each simulation. For
example, thin = 1/200 will keep every 200th network in the overall simulated
sample. Default is 1.}
\item{proposal_variance}{The variance specified for the Metropolis Hastings
simulation method. This parameter is inversely proportional to the average
acceptance rate of the M-H sampler and should be adjusted so that the average
acceptance rate is approximately 0.25. Default is 0.1.}
\item{downweight_statistics_together}{Logical specifying whether or not the
weights should be applied inside or outside the sum. Default is TRUE and user
should not select FALSE under normal circumstances.}
\item{MCMC_burnin}{Number of samples from the MCMC simulation procedure that
will be discarded before drawing the samples used for estimation.
Default is 100.}
\item{seed}{Seed used for reproducibility. Default is 123.}
\item{convergence_tolerance}{Threshold designated for stopping criterion. If
the difference of parameter estimates from one iteration to the next all have
a p -value (under a paired t-test) greater than this value, the parameter
estimates are declared to have converged. Default is 0.5, which is quite
conservative.}
\item{MPLE_gain_factor}{Multiplicative constant between 0 and 1 that controls
how far away the initial theta estimates will be from the standard MPLEs via
a one step Fisher update. In the case of strongly dependent data, it is
suggested to use a value of 0.10. Default is 0.}
\item{acceptable_fit_p_value_threshold}{A p-value threshold for how closely
statistics of observed network conform to statistics of networks simulated
from GERGM parameterized by converged final parameter estimates. Default value
is 0.05.}
\item{force_x_theta_updates}{Defaults to 1 where theta estimation is not
allowed to converge until thetas have updated for x iterations . Useful when
model is not degenerate but simulated statistics do not match observed network
well when algorithm stops after first y updates.}
\item{force_x_lambda_updates}{Defaults to 1 where lambda estimation is not
allowed to converge until lambdas have updated for x iterations . Useful when
model is not degenerate but simulated statistics do not match observed network
well when algorithm stops after first y updates.}
\item{output_directory}{The directory where you would like output generated
by the GERGM estimation proceedure to be saved (if output_name is specified).
This includes, GOF, trace, and parameter estimate plots, as well as a summary
of the estimation proceedure and an .Rdata file containing the GERGM object
returned by this function. May be left as NULL if the user would prefer all
plots be printed to the graphics device.}
\item{output_name}{The common name stem you would like to assign to all
objects output by the gergm function. Default value of NULL will not save any
output directly to .pdf files, it will be printed to the console instead. Must
be a character string or NULL. For example, if "Test" is supplied as the
output_name, then 4 files will be output: "Test_GOF.pdf", "Test_Parameter_Estim
ates.pdf", "Test_GERGM_Object.Rdata", "Test_Estimation_Log.txt", and
"Test_Trace_Plot.pdf"}
\item{generate_plots}{Defaults to TRUE, if FALSE, then no diagnostic or
parameter plots are generated.}
\item{verbose}{Defaults to TRUE (providing lots of output while model is
running). Can be set to FALSE if the user wishes to see less output.}
\item{hyperparameter_optimization}{Logical indicating whether automatic
hyperparameter optimization should be used. Defaults to FALSE. If TRUE, then
the algorithm will automatically seek to find an optimal burnin and number of
networks to simulate, and if using Metropolis Hasings, will attempt to select
a proposal variance that leads to a acceptance rate within +-0.05 of
target_accept_rate. Furthermore, if degeneracy is detected, the algorithm
will attempt to adress the issue automatically. WARNING: This feature is
experimental, and may greatly increase runtime. Please monitor console
output!}
\item{stop_for_degeneracy}{When TRUE, automatically stops estimation when
degeneracy is detected, even when hyperparameter_optimization is set to TRUE.
Defaults to FALSE.}
\item{target_accept_rate}{The target Metropolis Hastings acceptance rate.
Defaults to 0.25}
\item{theta_grid_optimization_list}{Defaults to NULL. This highly
experimental feature may allow the user to address model degeneracy arising
from a suboptimal theta initialization. It performs a grid search around the
theta values calculated via MPLE to select a potentially improved
initialization. The runtime complexity of this feature grows exponentially in
the size of the grid and number of parameters -- use with great care. This
feature may only be used if hyperparameter_optimization = TRUE, and if a list
object of the following form is provided: list(grid_steps = 2,
step_size = 0.5, cores = 2, iteration_fraction = 0.5). grid_steps indicates
the number of steps out the grid search will perform, step_size indicates the
fraction of the MPLE theta estimate that each grid search step will change by,
cores indicates the number of cores to be used for parallel optimization, and
iteration_fraction indicates the fraction of the number of MCMC iterations
that will be used for each grid point (should be set less than 1 to speed up
optimization). In general grid_steps should be smaller the more structural
parameters the user wishes to specify. For example, with 5 structural
parameters (mutual, ttriads, etc.), grid_steps = 3 will result in a (2*3+1)^5
= 16807 parameter grid search. Again this feature is highly experimental and
should only be used as a last resort (after playing with exponential
downweighting and the MPLE_gain_factor).}
\item{beta_correlation_model}{Defaults to FALSE. If TRUE, then the beta
correlation model is estiamted. A correlation network must be provided, but
all covariates and undirected statistics may be supplied as normal.}
\item{weighted_MPLE}{Defaults to FALSE. Should be used whenever the user is
specifying statistics with alpha downweighting. Tends to provide better
initialization when downweight_statistics_together = FALSE.}
\item{fine_grained_pv_optimization}{Logical indicating whether fine grained
proposal variance optimization should be used. This will often slow down
proposal variance optimization, but may provide better results. Highly
recommended if running a correlation model.}
\item{parallel}{Logical indicating whether the weighted MPLE objective and any
other operations that can be easily paralllelized should be calculated in
parallel. Defaults to FALSE. If TRUE, a significant speedup in computation
may be possible.}
\item{parallel_statistic_calculation}{Logical indicating whether network
statistics should be calculated in parallel. This will tend to be slower for
networks with les than ~30 nodes but may provide a substantial speedup for
larger networks.}
\item{cores}{Numeric value defaulting to 1. Can be set to any number up to the
number of threads/cores available on your machine. Will be used to speed up
computations if parllel = TRUE.}
\item{use_stochastic_MH}{A logical indicating whether a stochastic approximation
to the h statistics should be used under Metropolis Hastings in-between
thinned samples. This may dramatically speed up estimation. Defualts to FALSE.
HIGHLY EXPERIMENTAL!}
\item{stochastic_MH_proportion}{Percentage of dyads/triads to use for
approximation, defaults to 0.25.}
\item{estimate_model}{Logical indicating whether a model should be estimated.
Defaults to TRUE, but can be set to FALSE if the user simply wishes to return
a GERGM object containing the model specification. Useful for debugging.}
\item{slackr_integration_list}{An optional list object that contains
information necessary to provide updates about model fitting progress to a
Slack channel (https://slack.com/). This can be useful if models take a long
time to run, and you wish to receive updates on their progress (or if they
become degenerate). The list object must be of the following form:
list(model_name = "descriptive model name", channel = "#yourchannelname",
incoming_webhook_url = "https://hooks.slack.com/services/XX/YY/ZZ"). You
will need to set up incoming webhook integration for your slack channel and
then paste in the URL you get from slack into the incoming_webhook_url field.
If all goes well, and the computer you are running the GERGM estimation on
has internet access, your slack channel will receive updates when you start
estimation, after each lambda/theta parameter update, if the model becomes
degenerate, and when it completes running.}
\item{...}{Optional arguments, currently unsupported.}
}
\value{
A gergm object containing parameter estimates.
}
\description{
The main function provided by the package.
}
\examples{
\dontrun{
set.seed(12345)
net <- matrix(rnorm(100,0,20),10,10)
colnames(net) <- rownames(net) <- letters[1:10]
formula <- net ~ mutual + ttriads
test <- gergm(formula,
normalization_type = "division",
network_is_directed = TRUE,
use_MPLE_only = FALSE,
estimation_method = "Metropolis",
number_of_networks_to_simulate = 40000,
thin = 1/10,
proposal_variance = 0.5,
downweight_statistics_together = TRUE,
MCMC_burnin = 10000,
seed = 456,
convergence_tolerance = 0.01,
MPLE_gain_factor = 0,
force_x_theta_updates = 4)
}
}