Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The system shuts down automatically if the template creation fails. #463

Merged
merged 7 commits into from
Nov 25, 2024

Conversation

YangYumings
Copy link
Contributor

@YangYumings YangYumings commented Nov 22, 2024

问题

  • 当注册元数据方法抛出异常后,bm 卡死不退出。这是由于抛出异常的线程少走了一次 CyclicBarrier 的 await() ,导致其他线程一直处于等待状态。

场景

将 device 均分给每个 schemaClient 创建,创建 template -> 创建 database -> 激活 template -> 创建时间序列,使用 template 元数据的流程存在顺序关系:

  • 创建 template 只需一个线程执行一次即可,完成 template 创建后,其他线程再创建database;
  • 激活 template 前要保证所有的 database 已被创建;
  • 创建时间序列前,需要所有线程完成 template 的激活;
    所以,此处使用 3 个 CyclicBarrier :templateBarrier、schemaBarrier、activateTemplateBarrier。

解法

  1. 在 basemode 下设置 volatile stopAllSchemaClient,默认为 false, 当有创建元数据方法抛出异常后在 catch 中将其改为 true;
  2. 使用 try finally 保证不论方法是否正确执行,都会执行 await();
  3. 在每个 await() 下加上对 stopAllSchemaClient 的检查,当为 true 时,提前结束方法,从而结束当前线程;

@@ -66,6 +66,7 @@ public abstract class BaseMode {
protected List<DataClient> dataClients = new ArrayList<>();
protected List<SchemaClient> schemaClients = new ArrayList<>();
protected Measurement baseModeMeasurement = new Measurement();
public static volatile boolean stopAllSchemaClient = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉加这个有点依赖全局变量了,不太优雅

目前看代码这个变量有两个作用,一是元数据注册线程之间的通信,二是BaseMode依赖这个变量判断是否注册成功

对于一,我问了chatgpt ”如果一个线程想让其它正在等待的线程出现异常该怎么做“,它说“调用 reset() 方法会打破当前的屏障,并使所有正在等待的线程抛出 BrokenBarrierException”

对于二,我看了一遍调用过程,是从BaseMode一层层调用到ModelStrategy的,最好通过返回值的方式去传达元数据注册是否成功

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

方法一:这种之前尝试过,reset 后其余的线程能顺利执行,但我觉得既然注册 template 都失败了,那注册 database、与实践序列也没有必要了。
方法二:这个方式逻辑上确实清晰,但这个可能需要把其他数据库得方法也要修改,改动的多一些,当失败后打印失败信息,然后直接 return false。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实不存在方法一、二之分吧,只是说如果不用全局变量的话怎么处理这俩问题
对于“reset 后其余的线程能顺利执行”,reset后其它线程应该是否为BrokenBarrierException?catch到这个异常就不继续执行
对于“其他数据库得方法也要修改”,这个确实是个问题,能否给registerSchema接口改一下让接口抛异常?其它数据库的实现就不用改了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| 对于“reset 后其余的线程能顺利执行”,reset后其它线程应该是否为BrokenBarrierException?catch到这个异常就不继续执行

感觉可以按照这个搞?如果是这样的话那就处理一下异常就行,不用额外维护一个全局变量,可能更清晰?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当存在线程被屏障阻塞时,使用 reset 后能使其余线程抛出 BrokenBarrierException,然后 catch 后不继续运行也没有问题。对于二返回值有两个想法:

一、使用 callable 替换 runnable ,在主线程中获取每一个 schemaclient 的结果,当存在一个线程结果为 false,主线程注册元数据也返回 false,然后停止 bm。但是 callable 开销要比 runnable 大,担心影响测试。
二、使用全局变量,默认为 true,当一个 schemaClient 线程失败后,将其改为 false。当所有 schemaClient 结束后,主线程根据全局变量的结果决定是否继续执行,这个开销小一些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用方法一。

Comment on lines 84 to 109
try {
if (config.isTEMPLATE() && templateInit.compareAndSet(false, true)) {
Template template = null;
if (config.isTEMPLATE() && !schemaList.isEmpty()) {
template = createTemplate(schemaList.get(0));
}
int sessionIndex = random.nextInt(sessionListMap.size());
Session templateSession = new ArrayList<>(sessionListMap.keySet()).get(sessionIndex);
registerTemplate(templateSession, template);
}
} finally {
templateBarrier.await();
if (BaseMode.isStopAllSchemaClient()) {
return;
}
int sessionIndex = random.nextInt(sessionListMap.size());
Session templateSession = new ArrayList<>(sessionListMap.keySet()).get(sessionIndex);
registerTemplate(templateSession, template);
}
templateBarrier.await();
for (Map.Entry<Session, List<TimeseriesSchema>> pair : sessionListMap.entrySet()) {
registerDatabases(pair.getKey(), pair.getValue());
try {
for (Map.Entry<Session, List<TimeseriesSchema>> pair : sessionListMap.entrySet()) {
registerDatabases(pair.getKey(), pair.getValue());
}
} finally {
schemaBarrier.await();
if (BaseMode.isStopAllSchemaClient()) {
return;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉没必要try多次?一个大try就行?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用多个 try 是保证每个屏障都能正确执行,防止死锁。不过要是使用上面提到的方法二,这里确实使用一个 try 就 ok。这样我去看看 方法二 都要修改哪些部分。

Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果上面用 reset 的话应该就用一个就行?

Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@OneSizeFitsQuorum OneSizeFitsQuorum merged commit bb4035a into thulab:master Nov 25, 2024
1 check passed
Copy link
Collaborator

@liyuheng55555 liyuheng55555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice and clear~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants