We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在使用Inception_v3进行evaluation时,如果使用oneflow.distributed.launch进行多卡并行会导致signal 11
首先需要一个inception_v3
model = flowvision.models.inception_v3(pretrained=False) model = flow.nn.parallel.DistributedDataParallel(model, broadcast_buffers=False)
然后需要进行evaluation
@flow.no_grad() def evaluate(model, data_loader, device, print_freq=4, eval_max_steps=-1, num_threads=1, model_name="Inception"): cpu_device = flow.device("cpu") flow.set_num_threads(num_threads) model.eval() metric_logger = utils.MetricLogger(delimiter=" ") num_classes, task, average = 1000, "multiclass", "macro" metric_collection = torchmetrics.MetricCollection({ 'Accuracy': torchmetrics.Accuracy(task=task, num_classes=num_classes, average=average).to('cpu'), 'Precision': torchmetrics.Precision(task=task, num_classes=num_classes, average=average).to('cpu'), 'Recall': torchmetrics.Recall(task=task, num_classes=num_classes, average=average).to('cpu'), "AUROC": torchmetrics.AUROC(task=task, num_classes=num_classes, average=average).to('cpu'), }) for (images, labels), i, global_step in metric_logger.log_every(data_loader, print_freq, 0, is_eval=True): images, labels = images.to(device), labels.to(device) model_time = time.time() preds = model(images) # error raise model_time = time.time() - model_time if model_name == "Inception": preds = preds[0] preds = preds.softmax(dim=1).cpu() evaluator_time = time.time() # only for oneflow preds = torch.from_numpy(preds.numpy()) labels = torch.from_numpy(labels.numpy()) batch_metrics = metric_collection.forward(preds, labels) evaluator_time = time.time() - evaluator_time metric_logger.update(model_time=model_time, evaluator_time=evaluator_time) if 0 < eval_max_steps <= i: break # gather the stats from all processes metric_logger.synchronize_between_processes() val_metrics = metric_collection.compute() eval_res = { "Accuracy": val_metrics["Accuracy"].item(), "Precision": val_metrics["Precision"].item(), "Recall": val_metrics["Recall"].item(), "AUROC": val_metrics["AUROC"].item(), } print("Averaged stats:", metric_logger) model.train() return eval_res
最后通过python -m oneflow.distributed.launch --nproc_per_node 2 --master_port 12345 eval.py进行测试,在 preds = model(images)时报错subprocess.CalledProcessError: Command [xxxxx] died with <Signals.SIGSEGV: 11>.
python -m oneflow.distributed.launch --nproc_per_node 2 --master_port 12345 eval.py
preds = model(images)
subprocess.CalledProcessError: Command [xxxxx] died with <Signals.SIGSEGV: 11>.
python3 -m oneflow --doctor
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Summary
在使用Inception_v3进行evaluation时,如果使用oneflow.distributed.launch进行多卡并行会导致signal 11
Code to reproduce bug
首先需要一个inception_v3
然后需要进行evaluation
最后通过
python -m oneflow.distributed.launch --nproc_per_node 2 --master_port 12345 eval.py
进行测试,在preds = model(images)
时报错subprocess.CalledProcessError: Command [xxxxx] died with <Signals.SIGSEGV: 11>.
System Information
python3 -m oneflow --doctor
): 0.9.0 (git_commit: 381b12c)The text was updated successfully, but these errors were encountered: