Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defragment framework interfaces into context based calls #107

Open
hongchaodeng opened this issue Jan 14, 2015 · 27 comments
Open

Defragment framework interfaces into context based calls #107

hongchaodeng opened this issue Jan 14, 2015 · 27 comments

Comments

@hongchaodeng
Copy link

Problem Description

The latest build failure was caused in the race:

  1. Everything worked correctly at epoch 6
  2. After finished sending the data to parent and before new epoch set, a slave died. It restarted in a new node and repeated the same computation.
  3. In regression framework,
func (t *dummySlave) ParentDataReady(parentID uint64, req string, resp []byte) {
        ...
    if len(children) != 0 {
        t.framework.FlagMetaToChild("ParamReady")
    } else {
        // On leaf node, we can immediately return by and flag parent
        // that this node is ready.
        t.framework.FlagMetaToParent("GradientReady")
    }
}

In FlagMetaToX, it prepended epoch to the string and set it in etcd. However, the current implementation is wrong about synchronizing epoch here:

func (f *framework) FlagMetaToChild(meta string) {
    value := fmt.Sprintf("%d-%s", f.epoch, meta)
    f.etcdClient.Set (ChildMetaPath, value, 0 )
}

The problem is, a new epoch 7 was set, and it flagged meta with the epoch 7. But it's actually dealing with data in epoch 6!

Analysis

Apparently there are many ways to resolve it. We can stop all work on hand at the end of each epoch. But this is not code scalable -- if we write more remote requests, we also need to consistently close them, and heavy burden on debugging. Or we can just let user pass the epoch to FlagMetaToChild. This violates our design, and I am aware that keeping track of epoch, etc. info should be done in framework, not user level.

Proposed Solution

After a second thought to framework interface, actually there are two types of work in it. One type is epoch specific: FlagMetaToX, DataRequest, (maybe) IncEpoch. The rest of APIs are the other type, epoch agnostic.

I propose to defragment FlagMetatoX, DataRequest into a different interface:

type Context:
  GetEpoch, GetFromID
  FlagMetatoChild, FlagMetatoParent, DataRequest

And change Task interface:

func (t *dummySlave) ParentDataReady(ctx Context, req string, resp []byte) {
    if len(children) != 0 {
        ctx.FlagMetaToChild("ParamReady")
    } else {
        // On leaf node, we can immediately return by and flag parent
        // that this node is ready.
        ctx.FlagMetaToParent("GradientReady")
    }
}

In this way, the context will help us track epoch and it's easy for use to synchronized them under the framework hood.

@hongchaodeng
Copy link
Author

@xiang90 @xiaoyunwu
I am aware of the latest build failure and I have some thoughts on it. Help read and discuss it here. Let's define the context interface quickly as first step and then I can start implementing it.

@xiaoyunwu
Copy link
Contributor

"In FlagMetaToX, it prepended epoch to the string and set it in etcd."
Why do we need to set that in etcd?

@hongchaodeng
Copy link
Author

@xiaoyunwu

// epoch is prepended to meta. When a new one starts and replaces

@hongchaodeng
Copy link
Author

More about the "epoch-meta":

  1. At start-up, a new node will read meta from neighbors.
  2. But it didn't know whether the meta is flagged at current or previous epoch.
  3. it parses epoch from "epoch-meta" and only passes the meta to user if epoch is current.

@xiaoyunwu
Copy link
Contributor

ok, I got it, you mean the epoch is in meta, which is in etcd.

But I still do not understand why we have issues in
func (f *framework) FlagMetaToChild(meta string) {
value := fmt.Sprintf("%d-%s", f.epoch, meta)
f.etcdClient.Set (ChildMetaPath, value, 0 )
}

Can you show a list of event that problem will arise? What if we prepend the epoch info to req in the framework, so that we know which epoch is this request about? Basically doing the same thing like we did to meta.

@hongchaodeng
Copy link
Author

Request has epoch. In spirit of #101 , we add epoch to all communication between frameworks.

What if we prepend the epoch info to req in the framework, so that we know which epoch is this request about?

This is user API. User didn't add epoch to the meta. We need to separate framework and task layers. See my Analysis section at the top.

@xiaoyunwu
Copy link
Contributor

I know, user call our framework with req, in our implementation we prepend epcoh, before we return result, we take out epcoh from req, that way, user do not have to know anything, but we have info we need to do the right thing.

@hongchaodeng
Copy link
Author

That's right. That's why I am proposing to give context to user and that context carries epoch, etc. info.

@xiaoyunwu
Copy link
Contributor

but can we encode context into req then take it out?

On Thu, Jan 15, 2015 at 5:16 PM, Hongchao Deng [email protected]
wrote:

That's right. That's why I am proposing to give context to user and that
context carries epoch, etc. info.


Reply to this email directly or view it on GitHub
#107 (comment)
.

@hongchaodeng
Copy link
Author

There are many choices to give context to user:

  1. We can give context at SetEpoch which is the only thing we need to solve this problem.
  2. We can give context at all callbacks (DataReady, MetaReady, SetEpoch...) which provides more flexibility to user.
  3. We can also give context in other ways like encoding it into req.

It's decision making. I only care letting user call APIs in this way.

There is one more important thing to discuss: what should be called via Context not Framework? My list:

  • FlagMetaToX
  • DataRequest
  • IncEpoch

@xiaoyunwu
Copy link
Contributor

So u do not think pretend epoch would work?

@hongchaodeng
Copy link
Author

I didn't catch your question.. Could you use example?

@xiaoyunwu
Copy link
Contributor

I think we should stay in simple if possible, what do u think?

@hongchaodeng
Copy link
Author

Do you have simpler solution for this?

@xiaoyunwu
Copy link
Contributor

My preference is that we encoding the context into req and also response,
this way, user has one less thing to worry about. What do you think?

Otherwise, we have to again explain what is context, and how to use it.
Unless there are absolute reason for that, we should stay away.

On the separate note, I am wondering whether we can use the following
interface
ParentDataReady(parentID uint64, req string, resp ByteSink)
ChildDataReady(childID uint64, req string, resp ByteSink)

where ByteSink

type ByteSink interface{
ReadAll(r io.Reader)
}

this way the application can get access to the raw data stream directly,
that can be a big saving when we copy gigabytes data around.

We can also make

// These are payload for application purpose.
ServeAsParent(fromID uint64, req string) []byte
ServeAsChild(fromID uint64, req string) []byte
use writer interface
ServeAsParent(fromId uint64, req string, io.Writer)
for the same reason.

Xiaoyun

On Thu, Jan 15, 2015 at 5:23 PM, Hongchao Deng [email protected]
wrote:

There are many choices to give context to user:

  1. We can give context at SetEpoch if we assume this is only epoch
    specific context
  2. We can give context at all callbacks?
  3. We can also give context in other ways like encoding it into req.

It's decision making. I only care letting user call APIs in this way.

There is one more important thing to discuss: what should be called via
Context not Framework? My list:

  • FlagMetaToX
  • DataRequest
  • IncEpoch


Reply to this email directly or view it on GitHub
#107 (comment)
.

@hongchaodeng
Copy link
Author

Let's talk with examples.

Let's assume we have encoded context into req:

func (t *dummySlave) ParentDataReady(parentID uint64, req Request, resp Resp) {
        ...
    if len(children) != 0 {
        req.Context.FlagMetaToChild("ParamReady")
    } else {
        req.Context.FlagMetaToParent("GradientReady")
    }
}

So my question is, what APIs should be included in Context not Framework? My list:

  • FlagMetaToX
  • DataRequest
  • IncEpoch

@xiaoyunwu
Copy link
Contributor

Is it possible to do away with the concept of context?

I still believe we can handle it without task implementation ever know
about context or epoch, am I missing something?

Xiaoyun

On Thu, Jan 15, 2015 at 6:37 PM, Hongchao Deng [email protected]
wrote:

Let's talk with examples.

Let's assume we have encoded context into req:

func (t *dummySlave) ParentDataReady(parentID uint64, req Request, resp Resp) {
...
if len(children) != 0 {
req.Context.FlagMetaToChild("ParamReady")
} else {
req.Context.FlagMetaToParent("GradientReady")
}
}

So my question is, what APIs should be included in Context not Framework?
My list:

  • FlagMetaToX
  • DataRequest
  • IncEpoch


Reply to this email directly or view it on GitHub
#107 (comment)
.

@hongchaodeng
Copy link
Author

See my analysis section and first paragraph in solution section. Without context, there is no way between framework and task to determine whether they are talking in the same epoch.

@xiaoyunwu
Copy link
Contributor

Ok, Let me try to sequence it in a way that makes sense.

in epoch 6, task implementation says:
t.framework.FlagMetaToChild("ParamReady")

before framework.FlagMetaToX executed,
epoch got move to epoch 7

in the function, the epoch 7 is encoded in FlagMetaToChild
func (f *framework) FlagMetaToChild(meta string) {
value := fmt.Sprintf("%d-%s", f.epoch, meta)
f.etcdClient.Set (ChildMetaPath, value, 0 )
}

I agree with you that instantiate a new context for each epoch is good idea. But can it be a parameter instead of a type.

@xiaoyunwu
Copy link
Contributor

What is the reason that we should include incEpoch in context?

Here is my proposal:
on famework we add

GetContext() Context

Where Context is a opaque type that includes epoch id.

in Context it supports the follow three functions:
FlagMetaToParent(meta string)
FlagMetaToChild(meta string)
DataRequest

What do you guys think?

@hongchaodeng
Copy link
Author

Adding GetContext() in framework is the same thing of FlagMetaToChild().

What is the reason that we should include incEpoch in context?

I mentioned we need discussion on this. Currently I am fine only adding FlagMetaToX(). IncEpoch() in context indicates more explicit operation: we expected to increment at a specific epoch we know, not a global one user couldn't be sure.

@xiaoyunwu
Copy link
Contributor

What I really mean is following.

task implementation first get context from framework, that context then can
be used for the following three methods:
FlagMetaToX and DataRequest

Is this what you have in mind, if so, we are on the same page.

On Fri, Jan 16, 2015 at 4:30 PM, Hongchao Deng [email protected]
wrote:

Adding GetContext() in framework is the same thing of FlagMetaToChild().

What is the reason that we should include incEpoch in context?

I mentioned we need discussion on this. Currently I am fine only adding
FlagMetaToX().


Reply to this email directly or view it on GitHub
#107 (comment)
.

@hongchaodeng
Copy link
Author

No.

Task get context from each callback method (i.e. DataReady, MetaReady, SetEpoch, etc.). Because it's during this communication between them that epoch shift might happen.

@xiaoyunwu
Copy link
Contributor

Using a state pattern, let's introduce a state call Context. Slightly different from Hongchao's suggestion, I think we can get away with just state object with no method.

So I take it that we both agree that the following method in framework can be changed into
FlagMetaToParent(context Context, meta string)
FlagMetaToChild(context Context, meta string)
DataRequest(context Context, toID uint64, meta string)

As suggested by Hongchao, we can pass context as parameter to all the task implementation callback functions:
// NOTE: the meta/data ready notifications follow at-least-once fault
// tolerance semantics
ParentMetaReady(context Context, parentID uint64, meta string)
ChildMetaReady(context Context, childID uint64, meta string)
ParentDataReady(context Context, parentID uint64, req string, resp []byte)
ChildDataReady(context Context, childID uint64, req string, resp []byte)

The fundamental reason for the current test failure is that we need to have a state machine inside task implementation which handle its event in the right order. But I think add this context is a reasonable thing to do.
@hongchao,@xiang what do you think?

@hongchaodeng
Copy link
Author

It's too complicated to have APIs like this:

FlagMetaToParent(context Context, meta string)
FlagMetaToChild(context Context, meta string)
DataRequest(context Context, toID uint64, meta string)

We shouldn't rely on user to pass in a context. It will only add burden of checking and complication.

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2015

@fengjingchao How about using a generic context data struct? For example, an embed hash map? And we always attach Context to each request?

Actually we can use this small lib for context and cancellation:
http://blog.golang.org/context
http://godoc.org/golang.org/x/net/context

@xiang90
Copy link
Contributor

xiang90 commented Jan 19, 2015

/cc @xiaoyunwu

Context is widely used in golang world. So we still can preserve the simplicity we want by not introducing "new" concepts. Actually this might also simplify the code by avoiding customized cancellation solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants