Questions about edm-multi #21

eric-bunch · 2016-02-17T21:31:40Z

I've been going through the code for edm-multi.cpp, and I some general confusion about the algorithm being carried out. I'll keep my questions limited at first so that perhaps the answers to them can alleviate my confusion on the rest.

Is the statistic used in the algorithm to detect multiple breakouts (in edm-multi.cpp) to measure the significance of a breakout at a certain point the same as in that for detection of a at most one breakout (i.e. in edmTail.cpp)? I feel like I understand the process used in edmTail, especially thanks to the paper written introducing the breakout detection algorithm. But it looks like edm-multi measures something slightly different--as best as I can tell it's looking for a shift in the median with some sort of optional penalization.
Is it possible to get some explanation as to what is going on in lines 73-78 of edm-multi.cpp? I think one of the main things confusing me about this is the F[t] term in the definition of tmp. It seems like this would have an unwanted accumulative effect on the statistic.
Am I correct in saying that when *G = Quadratic, breakouts that occur earlier in the time series are favored more than those that occur later? It seems like the more breakouts that have been observed, the more this term will penalize the tmp statistic.

The text was updated successfully, but these errors were encountered:

putnam120 · 2016-02-18T16:10:58Z

Thanks for your interest in our work. Below are the answers to your
questions. Don't hesitate to ask followup questions if you have any, or any
other questions for that matters.

In the method to detect multiple breakouts there is no significance
test unfortunately. This just tries to maximize the divergence measure
between adjacent segments. While using penalization to fight against
overfitting/oversegmentation.
The point of F[s] is to store the value of the optimal segmentation of
the first s observations. Thus the best way to segment the first t
observations is to assume the last breakpoint is a s<t and then add on the
change caused by the new segment. You then iterate over all possible values
of s and keep the best one.
No having *G = Quadratic should not favor earlier breakouts. This choice
will more negatively penalize the use of additional breakouts, as compared
to using the linear penalization option. One thing to remember is that the
breakpoints used to segment the first t observations isn't necessarily a
subset of the breakpoints used to segment the first t+n (n>0) observations.

Sorry for the delayed response, but hopefully I was able to answer your
questions.

On Wed, Feb 17, 2016 at 1:31 PM, Eric Bunch [email protected]
wrote:

I've been going through the code for edm-multi.cpp, and I some general
confusion about the algorithm being carried out. I'll keep my questions
limited at first so that perhaps the answers to them can alleviate my
confusion on the rest.

Is the statistic used in the algorithm to detect multiple breakouts
(in edm-multi.cpp) to measure the significance of a breakout at a certain
point the same as in that for detection of a at most one breakout (i.e. in
edmTail.cpp)? I feel like I understand the process used in edmTail,
especially thanks to the paper written introducing the breakout detection
algorithm. But it looks like edm-multi measures something slightly
different--as best as I can tell it's looking for a shift in the median
with some sort of optional penalization.
2.

Is it possible to get some explanation as to what is going on in lines
73-78 of edm-multi.cpp? I think one of the main things confusing me about
this is the F[t] term in the definition of tmp. It seems like this
would have an unwanted accumulative effect on the statistic.
3.

Am I correct in saying that when *G = Quadratic, breakouts that occur
earlier in the time series are favored more than those that occur later? It
seems like the more breakouts that have been observed, the more this term
will penalize the tmp statistic.

—
Reply to this email directly or view it on GitHub
#21.

Nicholas James

eric-bunch · 2016-02-19T22:55:37Z

Thanks for the response. With regards to #1, that makes more sense with how I was understanding the algorithm. So there is no way to make use of the significance measure described in the paper for multiple breakouts? Would it be possible to determine the bestStat value for each point and each segmentation, and then do some sort of filtering process based on the min_size and some sort of tolerance parameter? Or is that approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it, but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when there have been breakpoints detected in the segment associated with t, correct?

Another question I'm having is will the left segment ever be the union of two disjoint intervals? The reason I'm asking that is on lines 47 and 48 the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted into the left tree. But since the value of prev[s] is only ever changed on line 78, and s is iterating from 2*min_size ton, then prev[min_size - 1] never gets altered. So prev[min_size - 1] will always be 0. Then on line 54, the left tree is filled further so that in total it has {Z[0], ...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree, leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left tree. This seems odd to me; I would expect that something like only {Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you are the author Nicholas on that), and think the method is really cool. And the package you've written is great. I don't know how the multi version works quite yet, but it works really well!

putnam120 · 2016-03-29T15:14:24Z

I'm so sorry about the delay on getting back to your latest questions.

For your questions about beta*G(number[t]) it makes sense that the function
is equal to zero when number[t] = 0. However, this doesn't have to be the
case. Just as long a G( ) is an increasing function it's use should still
make sense.

Actually, Z[0], Z[1], ... , Z[prev[t]] - 1 are removed from the left tree.
You are correct in that prev[min_size - 1] will always be zero but that's
fine. I think you forgot that the first time this happens that prev[t-1] =
0. And after that it will be updated. Note that when iterating the value of
t we always start at min_size so we avoid the problem that you mentioned.

Hope that helps. Once again sorry for the delay.

On Fri, Feb 19, 2016 at 2:55 PM, Eric Bunch [email protected]
wrote:

Thanks for the response. With regards to #1
#1, that makes more
sense with how I was understanding the algorithm. So there is no way to
make use of the significance measure described in the paper for multiple
breakouts? Would it be possible to determine the bestStat value for each
point and each segmentation, and then do some sort of filtering process
based on the min_size and some sort of tolerance parameter? Or is that
approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it,
but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when
there have been breakpoints detected in the segment associated with t,
correct?

Another question I'm having is will the left segment ever be the union of
two disjoint intervals? The reason I'm asking that is on lines 47 and 48
the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted
into the left tree. But since the value of prev[s] is only ever changed
on line 78, and s is iterating from 2*min_size ton, then prev[min_size -
1] never gets altered. So prev[min_size - 1] will always be 0. Then on
line 54, the left tree is filled further so that in total it has {Z[0],
...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the
elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree,
leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left
tree. This seems odd to me; I would expect that something like only
{Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you
are the author Nicholas on that), and think the method is really cool. And
the package you've written is great. I don't know how the multi version
works quite yet, but it works really well!

—
Reply to this email directly or view it on GitHub
#21 (comment)
.

Nicholas James

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about edm-multi #21

Questions about edm-multi #21

eric-bunch commented Feb 17, 2016

putnam120 commented Feb 18, 2016

eric-bunch commented Feb 19, 2016

putnam120 commented Mar 29, 2016

Questions about edm-multi #21

Questions about edm-multi #21

Comments

eric-bunch commented Feb 17, 2016

putnam120 commented Feb 18, 2016

eric-bunch commented Feb 19, 2016

putnam120 commented Mar 29, 2016