Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster solutions issues #20

Open
dill opened this issue Dec 8, 2015 · 15 comments
Open

Cluster solutions issues #20

dill opened this issue Dec 8, 2015 · 15 comments

Comments

@dill
Copy link
Contributor

dill commented Dec 8, 2015

@erex reports the following problem with the cluster example data in #19:

[1] "C:/Users/eric/readdst-workshop/D70Cluster solutions"
Loading required package: RODBC
Error in data.frame(Region.Label = names(Effort.by.region), CoveredArea = CoveredArea,  : 
  arguments imply differing number of rows: 2, 0
Called from: data.frame(Region.Label = names(Effort.by.region), CoveredArea = CoveredArea, 
    Effort = as.vector(Effort.by.region))
@erex
Copy link
Member

erex commented Dec 8, 2015

The pellet survey project has 8 strata plus global-level multipliers.

Cluster solutions is exactly the minke data with the exception of an invented column containing categorical cluster size (groups of size 1, size 2 and more than size 2)

@erex
Copy link
Member

erex commented Dec 8, 2015

I'm now leaving to catch the evening bus for home, can check back later.

@dill dill changed the title Cluster solutions issue Cluster solutions issues Dec 8, 2015
@dill
Copy link
Contributor Author

dill commented Dec 8, 2015

The issue with the cluster data here was that the merge used to get the units conversion is case sensitive and within a project Distance stores the string for nautical miles as both "Nautical miles" and "nautical miles ".

This should be fixed in 8820620.

Although tests can run, the results are rather different:

         Statistic Distance_value   mrds_value  Rel_diff Pass
1                n    8.80000e+01  8.80000e+01 0.0000000    ✓
2       parameters    2.00000e+00  2.00000e+00 0.0000000    ✓
3              AIC    3.18723e+02  4.86369e+01 0.8474008
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303
5              P_a    6.03467e-01  6.22440e-01 0.0314400
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710
8          density    4.99112e-02  6.71855e-02 0.3460998
9      CV(density)    2.73226e-01  3.32031e-01 0.2152260
10     density lcl    2.90415e-02  3.28856e-02 0.1323657
11     density ucl    8.57783e-02  1.37260e-01 0.6001767
12      density df    4.19452e+01  1.06757e+01 0.7454848
13     individuals    3.57020e+04  4.80589e+04 0.3461117
14 CV(individuals)    2.73226e-01  3.32031e-01 0.2152260
15 individuals lcl    2.07740e+04  2.35236e+04 0.1323586
16 individuals ucl    6.13590e+04  9.81846e+04 0.6001655
17  individuals df    4.19452e+01  1.06757e+01 0.7454848

(For example for the first analysis therein.)

I think this is connected to #17.

(Not sure what you mean by the "pellet survey".)

@erex
Copy link
Member

erex commented Dec 8, 2015

Pellet survey--dung survey. Line transect sampling of deer pellets -> density estimate of pellets -> decomposition and deposition rates (multipliers) -> deer density estimate

@erex
Copy link
Member

erex commented Dec 8, 2015

sorry to bounce around among projects (difference between home and office machines)

here is a NOT cluster size example, stratify solutions from our intro workshop, swallowed fine by convert_project

Here are results of 3 analyses inside that project:

[[1]]
[[1]][[1]]
[[1]][[1]]$`Full geog stratification`
         Statistic Distance_value mrds_value Rel_diff     Pass
1                n      8.800e+01  8.800e+01  0.00000 <U+2713>
2       parameters      4.000e+00  2.000e+00  0.50000         
3              AIC      3.158e+02  4.864e+01  0.84598         
4   log-likelihood     -1.539e+02 -2.232e+01  0.85498         
5          density      2.091e-02  2.935e-02  0.40409         
6      CV(density)      2.779e-01  2.787e-01  0.00305         
7      density lcl      1.177e-02  1.615e-02  0.37189         
8      density ucl      3.713e-02  5.336e-02  0.43704         
9       density df      1.741e+01  1.172e+01  0.32708         
10     individuals      1.495e+04  2.100e+04  0.40412         
11 CV(individuals)      2.779e-01  2.787e-01  0.00305         
12 individuals lcl      8.420e+03  1.155e+04  0.37194         
13 individuals ucl      2.656e+04  3.817e+04  0.43704         
14  individuals df      1.741e+01  1.172e+01  0.32708         

[[1]][[1]]$`Pooled f(0)`
         Statistic Distance_value   mrds_value  Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.0000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.0000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.8474008         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303         
5              P_a    6.03467e-01  6.22440e-01 0.0314400         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710         
8          density    2.28327e-02  2.93538e-02 0.2856038         
9      CV(density)    3.08211e-01  2.78722e-01 0.0956775         
10     density lcl    1.20496e-02  1.61491e-02 0.3402187         
11     density ucl    4.32656e-02  5.33558e-02 0.2332140         
12      density df    1.58317e+01  1.17172e+01 0.2598875         
13     individuals    1.63330e+04  2.09973e+04 0.2855731         
14 CV(individuals)    3.08211e-01  2.78722e-01 0.0956775         
15 individuals lcl    8.61900e+03  1.15517e+04 0.3402598         
16 individuals ucl    3.09490e+04  3.81663e+04 0.2331985         
17  individuals df    1.58317e+01  1.17172e+01 0.2598875         

[[1]][[1]]$`No stratification`
         Statistic Distance_value   mrds_value  Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.0000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.0000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.8474008         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303         
5              P_a    6.03467e-01  6.22440e-01 0.0314400         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710         
8          density    2.63774e-02  2.93538e-02 0.1128394         
9      CV(density)    2.63180e-01  2.78722e-01 0.0590549         
10     density lcl    1.56077e-02  1.61491e-02 0.0346851         
11     density ucl    4.45785e-02  5.33558e-02 0.1968968         
12      density df    3.62001e+01  1.17172e+01 0.6763205         
13     individuals    1.88680e+04  2.09973e+04 0.1128506         
14 CV(individuals)    2.63180e-01  2.78722e-01 0.0590549         
15 individuals lcl    1.11640e+04  1.15517e+04 0.0000000 <U+2713>
16 individuals ucl    3.18880e+04  3.81663e+04 0.1968847         
17  individuals df    3.62001e+01  1.17172e+01 0.6763205    

Hope to see the back of my marking tomorrow. Perhaps at that point, we might discuss what types of analyses CANNOT be resolved between DisWin and mrds (perhaps the cluster size issue falls therein). Then we can draft a list of pertinent DisWin projects to put through the grinder looking for specific problems; at the moment it is a scattergun approach.

Perhaps it might be adequate to check up to density estimates, and ignore abundance estimates and their measures of precision.

@dill
Copy link
Contributor Author

dill commented Dec 9, 2015

53863b8 includes support for estimating cluster size in abundance and density estimates in the same ways as Distance for Windows.

I think the remaining issues with the data set are down to stratification. This needs to be resolved by building a wrapper around dht that selects the data correcting and estimates abundance at the correct levels. The re-write of the model definition parsing in fe51030 makes this much easier.

@erex
Copy link
Member

erex commented Dec 9, 2015

Getting easier to run projects through readdst so perhaps a clearer picture is emerging. Current install is indeed DistanceDevelopment-readdst-53863b8

Through it, I have run CovarWhaleSim-solutions, and as you have previously seen, that project has 3 analyses that checkout quite well, but last 4 analyses (hn with hour, hn with mstdo and hour ,hn with mstdo with 10% right and hn with hour and mstdo with 10% right 1) perform poorly.


[[1]][[1]]$`hn with hour`
         Statistic Distance_value   mrds_value Rel_diff     Pass
1                n     60.0000000  6.00000e+01 0.000000 <U+2713>
2       parameters      2.0000000  2.00000e+00 0.000000 <U+2713>
3              AIC    125.0299988  1.54867e+02 0.238638         
4          Chi^2 p      0.6395112  1.56671e-05 0.999976         
5              P_a      0.4945715  1.00000e+00 1.021952         
6          CV(P_a)      0.0941000  2.14482e-02 0.772070         
7   log-likelihood    -60.5149803 -7.54335e+01 0.246526         
8            K-S p      0.7072740  1.67876e-06 0.999998         
9           C-vM p      0.8000000  5.52476e-07 0.999999         
10         density      0.0347093  1.71662e-02 0.505428         
11     CV(density)      0.1401000  1.06083e-01 0.242803         
12     density lcl      0.0259801  1.34206e-02 0.483428         
13     density ucl      0.0463713  2.19572e-02 0.526492         
14      density df     21.4410152  7.60773e+00 0.645179         
15     individuals    347.0000000  1.71662e+02 0.505297         
16 CV(individuals)      0.1401000  1.06083e-01 0.242803         
17 individuals lcl    260.0000000  1.34206e+02 0.483823         
18 individuals ucl    464.0000000  2.19572e+02 0.526785         
19  individuals df     21.4410152  7.60773e+00 0.645179         

[[1]][[1]]$`hn with mstdo and hour`
         Statistic Distance_value   mrds_value Rel_diff     Pass
1                n     60.0000000  6.00000e+01 0.000000 <U+2713>
2       parameters      3.0000000  3.00000e+00 0.000000 <U+2713>
3              AIC    113.1439972  1.56867e+02 0.386436         
4          Chi^2 p      0.5712746  5.21006e-06 0.999991         
5              P_a      0.4219032  1.00000e+00 1.370212         
6          CV(P_a)      0.1159000  1.71946e-02 0.851642         
7   log-likelihood    -53.5720100 -7.54335e+01 0.408076         
8            K-S p      0.3851634  1.67876e-06 0.999996         
9           C-vM p      0.6000000  5.52476e-07 0.999999         
10         density      0.0406876  1.71662e-02 0.578097         
11     CV(density)      0.1556000  1.05306e-01 0.323228         
12     density lcl      0.0296601  1.34266e-02 0.547317         
13     density ucl      0.0558150  2.19473e-02 0.606784         
14      density df     29.6221104  7.38805e+00 0.750590         
15     individuals    407.0000000  1.71662e+02 0.578226         
16 CV(individuals)      0.1556000  1.05306e-01 0.323228         
17 individuals lcl    297.0000000  1.34266e+02 0.547925         
18 individuals ucl    558.0000000  2.19473e+02 0.606679         
19  individuals df     29.6221104  7.38805e+00 0.750590         

[[1]][[1]]$`hn with mstdo with 10% right`
         Statistic Distance_value  mrds_value   Rel_diff     Pass
1                n     54.0000000  54.0000000 0.00000000 <U+2713>
2       parameters      2.0000000   2.0000000 0.00000000 <U+2713>
3              AIC     76.0550537  76.2739950 0.00287872         
4          Chi^2 p      0.1665688   0.3637715 1.18391200         
5              P_a      0.6138144   0.6096259 0.00682376         
6          CV(P_a)      0.0994000   0.1228850 0.23626810         
7   log-likelihood    -36.0275192 -36.1369975 0.00303874         
8            K-S p      0.7076659   0.6911267 0.02337157         
9           C-vM p      0.7000000   0.6564949 0.00000000 <U+2713>
10         density      0.0396030   0.0396757 0.00183616         
11     CV(density)      0.1409000   0.1609168 0.14206390         
12     density lcl      0.0296574   0.0286044 0.03550722         
13     density ucl      0.0528838   0.0550322 0.04062503         
14      density df     24.4481945  28.6593354 0.17224750         
15     individuals    396.0000000 396.7571762 0.00000000 <U+2713>
16 CV(individuals)      0.1409000   0.1609168 0.14206390         
17 individuals lcl    297.0000000 286.0437720 0.00000000 <U+2713>
18 individuals ucl    529.0000000 550.3222664 0.00000000 <U+2713>
19  individuals df     24.4481945  28.6593354 0.17224750         

[[1]][[1]]$`hn with hour and mstdo with 10% right 1`
         Statistic Distance_value  mrds_value Rel_diff     Pass
1                n     54.0000000  54.0000000 0.000000 <U+2713>
2       parameters      3.0000000   3.0000000 0.000000 <U+2713>
3              AIC     78.0150070  93.3695673 0.196816         
4          Chi^2 p      0.1115609   0.0127706 0.885528         
5              P_a      0.6130095   1.0000000 0.631296         
6          CV(P_a)      0.1009000   0.0126617 0.874512         
7   log-likelihood    -36.0075111 -43.6847837 0.213213         
8            K-S p      0.7579985   0.0118290 0.984394         
9           C-vM p      0.7000000   0.0048213 0.993112         
10         density      0.0396550   0.0241873 0.390056         
11     CV(density)      0.1420000   0.1007692 0.290357         
12     density lcl      0.0296395   0.0190993 0.355614         
13     density ucl      0.0530549   0.0306309 0.422656         
14      density df     24.9746819   7.2261257 0.710662         
15     individuals    397.0000000 241.8734451 0.390747         
16 CV(individuals)      0.1420000   0.1007692 0.290357         
17 individuals lcl    296.0000000 190.9925272 0.354755         
18 individuals ucl    531.0000000 306.3091750 0.423147         
19  individuals df     24.9746819   7.2261257 0.710662         


@erex
Copy link
Member

erex commented Dec 9, 2015

Moving on to cluster size estimation, project called D70 Cluster solutions brings no joy

[[2]][[1]]$`E(s) by ln(s)_g(x)`
         Statistic Distance_value   mrds_value  Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.0000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.0000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.8474008         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303         
5              P_a    6.03467e-01  6.22440e-01 0.0314400         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710         
8          density    4.99112e-02  2.93538e-02 0.4118794         
9      CV(density)    2.73226e-01  2.78722e-01 0.0201151         
10     density lcl    2.90415e-02  1.61491e-02 0.4439313         
11     density ucl    8.57783e-02  5.33558e-02 0.3779799         
12      density df    4.19452e+01  1.17172e+01 0.7206538         
13     individuals    3.57020e+04  2.09973e+04 0.4118742         
14 CV(individuals)    2.73226e-01  2.78722e-01 0.0201151         
15 individuals lcl    2.07740e+04  1.15517e+04 0.4439347         
16 individuals ucl    6.13590e+04  3.81663e+04 0.3779843         
17  individuals df    4.19452e+01  1.17172e+01 0.7206538         

[[2]][[1]]$`truncation E(s)`
         Statistic Distance_value   mrds_value   Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.00000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.00000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.84740080         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.49323030         
5              P_a    6.03467e-01  6.22440e-01 0.03144005         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.06974621         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.85817100         
8          density    4.88947e-02  2.93538e-02 0.39965260         
9      CV(density)    2.81150e-01  2.78722e-01 0.00863715         
10     density lcl    2.80664e-02  1.61491e-02 0.42461100         
11     density ucl    8.51800e-02  5.33558e-02 0.37361130         
12      density df    4.63110e+01  1.17172e+01 0.74698810         
13     individuals    3.49750e+04  2.09973e+04 0.39964930         
14 CV(individuals)    2.81150e-01  2.78722e-01 0.00863715         
15 individuals lcl    2.00760e+04  1.15517e+04 0.42460150         
16 individuals ucl    6.09310e+04  3.81663e+04 0.37361510         
17  individuals df    4.63110e+01  1.17172e+01 0.74698810         

[[2]][[1]]$`Post-stratified E(s)_pooled f(0)_regr`
         Statistic Distance_value   mrds_value  Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.0000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.0000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.8474008         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303         
5              P_a    6.03467e-01  6.22440e-01 0.0314400         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710         
8          density    5.23999e-02  2.93538e-02 0.4398110         
9      CV(density)    2.32301e-01  2.78722e-01 0.1998326         
10     density lcl    3.32411e-02  1.61491e-02 0.5141826         
11     density ucl    8.26011e-02  5.33558e-02 0.3540543         
12      density df    9.51765e+01  1.17172e+01 0.8768894         
13     individuals    3.74820e+04  2.09973e+04 0.4398040         
14 CV(individuals)    2.32301e-01  2.78722e-01 0.1998326         
15 individuals lcl    2.37780e+04  1.15517e+04 0.5141854         
16 individuals ucl    5.90860e+04  3.81663e+04 0.3540558         
17  individuals df    9.51765e+01  1.17172e+01 0.8768894         

[[2]][[1]]$`Post-stratified E(s)_strat f(0)_regr`
         Statistic Distance_value mrds_value Rel_diff     Pass
1                n      8.800e+01  8.800e+01   0.0000 <U+2713>
2       parameters      6.000e+00  2.000e+00   0.6667         
3              AIC      3.174e+02  4.864e+01   0.8468         
4   log-likelihood     -1.527e+02 -2.232e+01   0.8538         
5          density      5.403e-02  2.935e-02   0.4567         
6      CV(density)      2.475e-01  2.787e-01   0.1264         
7      density lcl      3.330e-02  1.615e-02   0.5150         
8      density ucl      8.769e-02  5.336e-02   0.3915         
9       density df      9.204e+01  1.172e+01   0.8727         
10     individuals      3.865e+04  2.100e+04   0.4567         
11 CV(individuals)      2.475e-01  2.787e-01   0.1264         
12 individuals lcl      2.382e+04  1.155e+04   0.5150         
13 individuals ucl      6.272e+04  3.817e+04   0.3915         
14  individuals df      9.204e+01  1.172e+01   0.8727         

[[2]][[1]]$`Post-stratified  E(s) using mean `
         Statistic Distance_value   mrds_value  Rel_diff     Pass
1                n    8.80000e+01  8.80000e+01 0.0000000 <U+2713>
2       parameters    2.00000e+00  2.00000e+00 0.0000000 <U+2713>
3              AIC    3.18723e+02  4.86369e+01 0.8474008         
4          Chi^2 p    9.10454e-01  4.61390e-01 0.4932303         
5              P_a    6.03467e-01  6.22440e-01 0.0314400         
6          CV(P_a)    1.15354e-01  1.07308e-01 0.0697462         
7   log-likelihood   -1.57362e+02 -2.23184e+01 0.8581710         
8          density    5.93492e-02  2.93538e-02 0.5054048         
9      CV(density)    2.45132e-01  2.78722e-01 0.1370303         
10     density lcl    3.66852e-02  1.61491e-02 0.5597923         
11     density ucl    9.60151e-02  5.33558e-02 0.4442976         
12      density df    7.64981e+01  1.17172e+01 0.8468297         
13     individuals    4.24530e+04  2.09973e+04 0.5053997         
14 CV(individuals)    2.45132e-01  2.78722e-01 0.1370303         
15 individuals lcl    2.62410e+04  1.15517e+04 0.5597843         
16 individuals ucl    6.86810e+04  3.81663e+04 0.4442967         
17  individuals df    7.64981e+01  1.17172e+01 0.8468297         

@dill
Copy link
Contributor Author

dill commented Dec 10, 2015

hn with mstdo with 10% right I think has fairly good agreement, it's the other models with hour in them that are problematic.

It looks to me after some investigation like the hour covariate isn't doing much (in mrds-land at least):

screen shot 2015-12-09 at 12 02 57

There isn't an obvious relationship with observed distance (which we might expect?) and it seems to throw off the optimiser. It also looks like the values are all very similar.

Could you try re-running all these analyses (in CovarWhaleSim-solutions) in DISTANCE to check that these results are correct? Seems odd that what appears in R to be a spurious covariate should be reasonable in DISTANCE.

@erex
Copy link
Member

erex commented Dec 10, 2015

Reran the CovarWhaleSim analyses again in D7B3 and results are the same as previously. As the project name implies, these are indeed data that I simulated (in WiSP if memory serves back in 2004). I manufactured the data in such a way that there would be a meaningful covariate (MSTDO) and a worthless covariate (Hour) so that my students would 'believe' that model selection can find patterns.

capture

I'm not sure what you mean about Hour appearing reasonable to D7B3; the model with Hour and the only covariate is 13.8 AIC units worse than the model with MSTDO; and the model with both Hour and MSTDO is 1.9 AIC units worse that the MSTDO only model. I interpret the 1.9 Delta-AIC as implying there is no information in the added covariate Hour and the decrease in AIC is equal to the addition of 1 (meaningless) parameter. The 'Hour' parameter in the MSTDO and Hour model is estimated as:

 A( 2)     0.1966E-01   0.7744E-01

more signals from D7B3 there is nothing happening with that covariate.

@dill
Copy link
Contributor Author

dill commented Dec 12, 2015

Thanks for doing this!

Hm. Okay, I'm going to leave this issue for now but open a separate issue at #23. My feeling as this might be a mroe optimisation-related issue.

@dill
Copy link
Contributor Author

dill commented Dec 17, 2015

I'm a little confused about one of the analysis in the cluster project:

$`Post-stratified E(s)_strat f(0)_regr`
        Statistic Distance_value     mrds_value       Rel_diff Pass
1               n             88             88              0    ✓
2      parameters              6              4      0.3333333
3             AIC     317.379913   315.78955667    0.005010851
4  log-likelihood    -152.690002 -153.894778335    0.007890355
5         density     0.05403275     0.06260623      0.1586719
6     CV(density)      0.2474518      0.3224963      0.3032693
7     individuals          38650  44783.2363988      0.1586866
8 CV(individuals)      0.2474518      0.3224963      0.3032693

Here I don't even get the number of parameters right. I'm unsure about where the extra parameters come from.

The model definition is as follows:

> model_definitions[[3]]
 [1] "Engine=CDS;"
 [2] "Options;"
 [3] "Stratification=Post-stratify /LayerType=30 /FieldName=Cluster strat;"
 [4] "Sample /LayerType=20;"
 [5] "Selection=Specify;"
 [6] "Confidence=95;"
 [7] "Print=Selection;"
 [8] "End;"
 [9] "Data /Structure=Flat;"
[10] "End;"
[11] "Estimate;"
[12] "Distance;"
[13] "Density by All;"
[14] "Density by Stratum /Design=Strata /Weight=None;"
[15] "Encounter by Stratum;"
[16] "Detection by Stratum;"
[17] "Size by Stratum;"
[18] "Estimator /Key=HA /Adjust=CO /NAP=0;"
[19] "Monotone=Strict;"
[20] "Pick=AIC;"
[21] "GOF;"
[22] "Cluster /Bias=GXLOG;"
[23] "VarN=Empirical;"
[24] "End;"

@erex does this correspond to what you see in DISTANCE? Are you able to paste the log output (parameter estimates) from DISTANCE? Thanks!

@dill dill mentioned this issue Dec 17, 2015
@erex
Copy link
Member

erex commented Dec 17, 2015

Human description of this (admittedly perverse) analysis:

"In the analysis “Post-stratified E(s)_strat f(0)_regr”, the detection function has been estimated separately in each cluster size stratum. The detection functions are different from each other - it looks like cluster sizes 3 and above are detected with certainty almost all the way out to 1.2nm."

DisWin is fitting a 2-parameter hazard rate model to each of 3 strata (where strata are defined as "detections with cluster size=1, detections with cluster size 2 and detections with cluster size>=3", that is how 6 parameters come to be.

Log window contents for Post-stratified E(s)_pooled f(0)_regr

 This is mcds.exe version 6.2.0     
 Options;                                                                      
 Type=Line;                                                                    
 Length /Measure='Nautical mile';                                              
 Distance=Perp /Measure='Nautical mile';                                       
 Area /Units='Square nautical mile';                                           
 Object=Cluster;                                                               
 SF=1.0;                                                                       
 Selection=Specify;                                                            
 Confidence=95;                                                                
 Print=Selection;                                                              
 End;                                                                          
 Data /Structure=Flat;                                                         
 Fields=STR_LABEL, STR_AREA, SMP_LABEL, SMP_EFFORT, DISTANCE, SIZE;            
 Infile=C:\Users\eric\AppData\Local\Temp\dstFB94.tmp /NoEcho;                  
 Data will be input from file - [...]APPDATA\LOCAL\TEMP\DSTFB94.TMP
 End;                                                                          
 Dataset has been stored.
 Estimate;                                                                     
 Distance /Nclass=7 /Width=1.5 /Left=0;                                        
 Density=All;                                                                  
 Density=Stratum /Design=Strata /Weight=None;                                  
 Encounter=Stratum;                                                            
 Detection=Stratum;                                                            
 Size=Stratum;                                                                 
 Estimator /Key=HA /Adjust=CO /NAP=0;                                          
 Monotone=Strict;                                                              
 Pick=AIC;                                                                     
 GOF;                                                                          
 Cluster /Bias=GXLOG;                                                          
 VarN=Empirical;                                                               
 End;                                                                          
      ** Warning: Parameter  6 is at an upper bound. **
** Warning: Exact distance values, rather than distance intervals, have been used in size bias regression calculations. **

Also notice that your element [3] of model_definitions (where the phantom stratification appears) does not appear in the commands echoed in the DisWin log window.

The 6 parameter estimates you requested are:
Strat1:
A( 1) 0.9650 0.8472E-01
A( 2) 12.49 7.704
Strat2:
A( 3) 0.4248 0.2557
A( 4) 1.441 0.7711
Strat3:
A( 5) 1.219 0.1688
A( 6) 20.00 33.78

This third stratum is for "big" clusters that the detection function tries to fit with a flat hazard rate out to 1.2nm, and then falls off a cliff because no clusters of that size are seen beyond 1.2nm, hence the (non-meaningful) upper bound warning.

As I said, this is a pretty pathological case; unlikely any user is going to want to perform this kind of analysis on their data.

@dill
Copy link
Contributor Author

dill commented Dec 17, 2015

Thanks for this thorough investigation Eric!

My hunch was that this was what was happening. I don't know if the general case of "post-stratification by some covariate" is useful anyway. I think that should just consist of setting the Region.Label to be the appropriate covariate values and the corresponding Areas to the whole study area (does that sound correct?)

Mismatch in log window and what is stored in the database in terms of the MCDS command language is very odd. @lenthomas do you have any ideas about this?

@lenthomas
Copy link
Member

My initial instinct is that neither Distance nor readdst should bother translating post-stratified analyses. Seems to me like our time would be better spent with other details than that! I suggest just giving a warning and not converting those analyses?

For RDistance, users can always set up data using R to do post-stratification.

However, if you do want to implement it, fine by me. I do remember my SQL post-stratification code was a bit epic, to deal with post-stratification at the sample and observation layers...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants