Reproducing Results (Table 1) #3

nil123532 · 2025-02-08T01:15:13Z

Dear Authors,

I have a question regarding your evaluation methodology. In the paper, you mention that “for each l, we draw instances randomly from the training data while ensuring balanced class proportions.” However, from the code, it appears that the evaluation is performed using dlval, which is built from data that is disjoint from the training set. Could you please clarify how this is intended to work and how one should reproduce Table 1 in your paper?

lukasthede · 2025-02-10T10:10:49Z

Dear nil123532,

To analyze the tradeoff between labeling costs and performance, we vary the number of available labels in our experiments. To ensure an equal number of labeled samples per class, we divide the total number of labels by the number of classes and randomly select the required number of samples for each class. You can find the corresponding implementation in lines 103–118 of Semi-Supervised/datasets/cifar.py.

Table 1 in our paper provides an overview of the quality of the generated artificial expert labels. Specifically, we compute the F-0.5 score between the artificial expert labels and the ground-truth expert labels on the training set. This metric helps assess how well the artificial expert labels approximate the true expert labels.

Let me know if you have any further questions!

Best,
Lukas

nil123532 · 2025-02-10T10:54:33Z

Thank you for your prompt response both here and via email.

From my understanding, once training is complete, you generate the expert labels from the binary output and then evaluate using the F0.5 score. Could you please let me know if you have code available for that evaluation? Also, could you clarify which dataset you use for F0.5 score evaluation: the test set or a combination of the labeled and unlabeled training sets?

Best,
Nilesh Ramgolam

lukasthede · 2025-02-10T11:07:36Z

Hi Nilesh,

That’s correct. We evaluate the artificial expert labels by computing the F0.5 score between the artificial and true expert labels on the test set of CIFAR-100. While we do not have dedicated code for this in our repository, computing the F0.5 score is straightforward—for example, you can use the fbeta_score function from the sklearn library.

Let me know if you need further clarification!

Best,
Lukas

nil123532 · 2025-02-10T11:28:12Z

Thank you,
I shall continue my quest of reproducing results!

nil123532 · 2025-02-11T06:03:39Z

Hi,

I have a few additional clarifications.
It appears that the artificially generated label is binary ouput; indicating whether the expert is correct or not.
Do we then use that artificially generated binary labels and dlval (which also has binary expert labels) to get the F0.5 score?

Thanks.

nil123532 · 2025-02-12T01:18:53Z

Hello,

I hope you’re doing well. I noticed that in the embedding model’s learning rate graph, the LR quickly decays to 8e‑4, which might be due to the scheduler step being called after every mini‐batch (on line 99 ). As a result, the schedule reaches the milestone of 160 steps almost immediately. It might be more appropriate to call the scheduler’s step() method once per epoch, so the learning rate decays at the intended intervals.

Because of this rapid LR decay, the model only reached around 63% accuracy on CIFAR, which is relatively low for an EfficientNet‐based approach. I made a small change to increment the scheduler’s epoch counter only after each epoch, instead of after every batch.

Separately, regarding the training of the “expertise” model (a linear model), I noticed that loss_x decreases while loss_u increases and then remains at that level for the rest of the 50 epochs. I would appreciate any guidance on how best to address this issue.

At the moment, I suspect it's because the embedding model might not have been trained properly, and that could explain why I’m seeing an F0.5 score of 74% (for n_labelled = 120) instead of 84% for Embedding‐FixMatch. Any advice or suggestions would be greatly appreciated.
I have tried with the idea of stepping the scheduler after an epoch and I get 76% for the F0.5 score.

Thank you, and I look forward to your response.

nil123532 · 2025-02-13T05:16:44Z

def evaluate_f0_5_sklearn(model, ema_model, emb_model, dataloader, beta=0.5):
"""
Evaluate the F0.5 score on a binary classification (expert correct vs. incorrect)
using sklearn's fbeta_score.

:param model: Trained model for predictions.
:param ema_model: Optional EMA model; if not None, we use the EMA model for predictions.
:param emb_model: Embedding model that provides get_embedding().
:param dataloader: DataLoader where each batch yields (images, labels, im_ids).
                   The 'labels' here should be binary: 1 if expert is correct, 0 if not.
:param beta: Value of beta for fbeta_score (default 0.5).
:return: Computed F0.5 score (float).
"""
model.eval()
y_true = []
y_pred = []

with torch.no_grad():
    for ims, lbs, im_id in dataloader:
        ims = ims.cuda()
        lbs = lbs.cuda()

        # Use EMA model if available; otherwise use main model
        embedding = emb_model.get_embedding(batch=ims)
        if ema_model is not None:
            logits = ema_model(embedding)
        else:
            logits = model(embedding)

        # Convert logits to predicted labels (0 or 1)
        output = torch.softmax(logits, dim=1)
        predicted_class = torch.argmax(output, dim=1)  # shape [batch_size]

        # Accumulate ground truth and predictions
        y_true.extend(lbs.cpu().tolist())
        y_pred.extend(predicted_class.cpu().tolist())

# Compute F0.5 using scikit-learn
f05_score = fbeta_score(y_true, y_pred, beta=beta)
return f05_score

Here's my script to calculate the F0.5 score. Note that the dataloader will give items containing binary expert labels.

nil123532 closed this as completed Feb 10, 2025

nil123532 reopened this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Results (Table 1) #3

Reproducing Results (Table 1) #3

nil123532 commented Feb 8, 2025

lukasthede commented Feb 10, 2025

nil123532 commented Feb 10, 2025

lukasthede commented Feb 10, 2025

nil123532 commented Feb 10, 2025

nil123532 commented Feb 11, 2025

nil123532 commented Feb 12, 2025 •

edited

Loading

nil123532 commented Feb 13, 2025 •

edited

Loading

Reproducing Results (Table 1) #3

Reproducing Results (Table 1) #3

Comments

nil123532 commented Feb 8, 2025

lukasthede commented Feb 10, 2025

nil123532 commented Feb 10, 2025

lukasthede commented Feb 10, 2025

nil123532 commented Feb 10, 2025

nil123532 commented Feb 11, 2025

nil123532 commented Feb 12, 2025 • edited Loading

nil123532 commented Feb 13, 2025 • edited Loading

nil123532 commented Feb 12, 2025 •

edited

Loading

nil123532 commented Feb 13, 2025 •

edited

Loading