Skip to content
This repository has been archived by the owner on Nov 13, 2021. It is now read-only.

Questions about edm-multi #21

Open
eric-bunch opened this issue Feb 17, 2016 · 3 comments
Open

Questions about edm-multi #21

eric-bunch opened this issue Feb 17, 2016 · 3 comments

Comments

@eric-bunch
Copy link

I've been going through the code for edm-multi.cpp, and I some general confusion about the algorithm being carried out. I'll keep my questions limited at first so that perhaps the answers to them can alleviate my confusion on the rest.

  1. Is the statistic used in the algorithm to detect multiple breakouts (in edm-multi.cpp) to measure the significance of a breakout at a certain point the same as in that for detection of a at most one breakout (i.e. in edmTail.cpp)? I feel like I understand the process used in edmTail, especially thanks to the paper written introducing the breakout detection algorithm. But it looks like edm-multi measures something slightly different--as best as I can tell it's looking for a shift in the median with some sort of optional penalization.
  2. Is it possible to get some explanation as to what is going on in lines 73-78 of edm-multi.cpp? I think one of the main things confusing me about this is the F[t] term in the definition of tmp. It seems like this would have an unwanted accumulative effect on the statistic.
  3. Am I correct in saying that when *G = Quadratic, breakouts that occur earlier in the time series are favored more than those that occur later? It seems like the more breakouts that have been observed, the more this term will penalize the tmp statistic.
@putnam120
Copy link
Contributor

Thanks for your interest in our work. Below are the answers to your
questions. Don't hesitate to ask followup questions if you have any, or any
other questions for that matters.

  1. In the method to detect multiple breakouts there is no significance
    test unfortunately. This just tries to maximize the divergence measure
    between adjacent segments. While using penalization to fight against
    overfitting/oversegmentation.
  2. The point of F[s] is to store the value of the optimal segmentation of
    the first s observations. Thus the best way to segment the first t
    observations is to assume the last breakpoint is a s<t and then add on the
    change caused by the new segment. You then iterate over all possible values
    of s and keep the best one.
  3. No having *G = Quadratic should not favor earlier breakouts. This choice
    will more negatively penalize the use of additional breakouts, as compared
    to using the linear penalization option. One thing to remember is that the
    breakpoints used to segment the first t observations isn't necessarily a
    subset of the breakpoints used to segment the first t+n (n>0) observations.

Sorry for the delayed response, but hopefully I was able to answer your
questions.

On Wed, Feb 17, 2016 at 1:31 PM, Eric Bunch [email protected]
wrote:

I've been going through the code for edm-multi.cpp, and I some general
confusion about the algorithm being carried out. I'll keep my questions
limited at first so that perhaps the answers to them can alleviate my
confusion on the rest.

Is the statistic used in the algorithm to detect multiple breakouts
(in edm-multi.cpp) to measure the significance of a breakout at a certain
point the same as in that for detection of a at most one breakout (i.e. in
edmTail.cpp)? I feel like I understand the process used in edmTail,
especially thanks to the paper written introducing the breakout detection
algorithm. But it looks like edm-multi measures something slightly
different--as best as I can tell it's looking for a shift in the median
with some sort of optional penalization.
2.

Is it possible to get some explanation as to what is going on in lines
73-78 of edm-multi.cpp? I think one of the main things confusing me about
this is the F[t] term in the definition of tmp. It seems like this
would have an unwanted accumulative effect on the statistic.
3.

Am I correct in saying that when *G = Quadratic, breakouts that occur
earlier in the time series are favored more than those that occur later? It
seems like the more breakouts that have been observed, the more this term
will penalize the tmp statistic.


Reply to this email directly or view it on GitHub
#21.

  • Nicholas James

@eric-bunch
Copy link
Author

Thanks for the response. With regards to #1, that makes more sense with how I was understanding the algorithm. So there is no way to make use of the significance measure described in the paper for multiple breakouts? Would it be possible to determine the bestStat value for each point and each segmentation, and then do some sort of filtering process based on the min_size and some sort of tolerance parameter? Or is that approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it, but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when there have been breakpoints detected in the segment associated with t, correct?

Another question I'm having is will the left segment ever be the union of two disjoint intervals? The reason I'm asking that is on lines 47 and 48 the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted into the left tree. But since the value of prev[s] is only ever changed on line 78, and s is iterating from 2*min_size ton, then prev[min_size - 1] never gets altered. So prev[min_size - 1] will always be 0. Then on line 54, the left tree is filled further so that in total it has {Z[0], ...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree, leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left tree. This seems odd to me; I would expect that something like only {Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you are the author Nicholas on that), and think the method is really cool. And the package you've written is great. I don't know how the multi version works quite yet, but it works really well!

@putnam120
Copy link
Contributor

I'm so sorry about the delay on getting back to your latest questions.

For your questions about beta*G(number[t]) it makes sense that the function
is equal to zero when number[t] = 0. However, this doesn't have to be the
case. Just as long a G( ) is an increasing function it's use should still
make sense.

Actually, Z[0], Z[1], ... , Z[prev[t]] - 1 are removed from the left tree.
You are correct in that prev[min_size - 1] will always be zero but that's
fine. I think you forgot that the first time this happens that prev[t-1] =
0. And after that it will be updated. Note that when iterating the value of
t we always start at min_size so we avoid the problem that you mentioned.

Hope that helps. Once again sorry for the delay.

On Fri, Feb 19, 2016 at 2:55 PM, Eric Bunch [email protected]
wrote:

Thanks for the response. With regards to #1
#1, that makes more
sense with how I was understanding the algorithm. So there is no way to
make use of the significance measure described in the paper for multiple
breakouts? Would it be possible to determine the bestStat value for each
point and each segmentation, and then do some sort of filtering process
based on the min_size and some sort of tolerance parameter? Or is that
approach too naive?

I think the answer to 2 makes sense. I think maybe I'm overthinking it,
but I'll give it some time to digest.

So the penalization term beta*G(number[t]) will only be nonzero when
there have been breakpoints detected in the segment associated with t,
correct?

Another question I'm having is will the left segment ever be the union of
two disjoint intervals? The reason I'm asking that is on lines 47 and 48
the values {Z[prev[min_size - 1]], ..., Z[min_size - 2]} are inserted
into the left tree. But since the value of prev[s] is only ever changed
on line 78, and s is iterating from 2*min_size ton, then prev[min_size -
1] never gets altered. So prev[min_size - 1] will always be 0. Then on
line 54, the left tree is filled further so that in total it has {Z[0],
...., Z[t-1]}. Then going to lines 61 & 62, if prev[t] > prev[t-1], the
elements {Z[prev[t-1]],...,Z[prev[t]]} are removed from the left tree,
leaving {Z[0],...,Z[prev[t-1]-1], Z[prev[t]+1,...,Z[t-1]]} in the left
tree. This seems odd to me; I would expect that something like only
{Z[prev[t]]+1,...,Z[t-1]} in the left tree.

By the way, I really enjoyed reading through the paper (I'm assuming you
are the author Nicholas on that), and think the method is really cool. And
the package you've written is great. I don't know how the multi version
works quite yet, but it works really well!


Reply to this email directly or view it on GitHub
#21 (comment)
.

  • Nicholas James

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants