Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Results (Table 1) #3

Open
nil123532 opened this issue Feb 8, 2025 · 7 comments
Open

Reproducing Results (Table 1) #3

nil123532 opened this issue Feb 8, 2025 · 7 comments

Comments

@nil123532
Copy link

Dear Authors,

I have a question regarding your evaluation methodology. In the paper, you mention that “for each l, we draw instances randomly from the training data while ensuring balanced class proportions.” However, from the code, it appears that the evaluation is performed using dlval, which is built from data that is disjoint from the training set. Could you please clarify how this is intended to work and how one should reproduce Table 1 in your paper?

@lukasthede
Copy link
Collaborator

Dear nil123532,

To analyze the tradeoff between labeling costs and performance, we vary the number of available labels in our experiments. To ensure an equal number of labeled samples per class, we divide the total number of labels by the number of classes and randomly select the required number of samples for each class. You can find the corresponding implementation in lines 103–118 of Semi-Supervised/datasets/cifar.py.

Table 1 in our paper provides an overview of the quality of the generated artificial expert labels. Specifically, we compute the F-0.5 score between the artificial expert labels and the ground-truth expert labels on the training set. This metric helps assess how well the artificial expert labels approximate the true expert labels.

Let me know if you have any further questions!

Best,
Lukas

@nil123532
Copy link
Author

Thank you for your prompt response both here and via email.

From my understanding, once training is complete, you generate the expert labels from the binary output and then evaluate using the F0.5 score. Could you please let me know if you have code available for that evaluation? Also, could you clarify which dataset you use for F0.5 score evaluation: the test set or a combination of the labeled and unlabeled training sets?

Best,
Nilesh Ramgolam

@lukasthede
Copy link
Collaborator

Hi Nilesh,

That’s correct. We evaluate the artificial expert labels by computing the F0.5 score between the artificial and true expert labels on the test set of CIFAR-100. While we do not have dedicated code for this in our repository, computing the F0.5 score is straightforward—for example, you can use the fbeta_score function from the sklearn library.

Let me know if you need further clarification!

Best,
Lukas

@nil123532
Copy link
Author

Thank you,
I shall continue my quest of reproducing results!

@nil123532
Copy link
Author

Hi,

I have a few additional clarifications.
It appears that the artificially generated label is binary ouput; indicating whether the expert is correct or not.
Do we then use that artificially generated binary labels and dlval (which also has binary expert labels) to get the F0.5 score?

Thanks.

@nil123532 nil123532 reopened this Feb 11, 2025
@nil123532
Copy link
Author

nil123532 commented Feb 12, 2025

Hello,

I hope you’re doing well. I noticed that in the embedding model’s learning rate graph, the LR quickly decays to 8e‑4, which might be due to the scheduler step being called after every mini‐batch (on line 99 ). As a result, the schedule reaches the milestone of 160 steps almost immediately. It might be more appropriate to call the scheduler’s step() method once per epoch, so the learning rate decays at the intended intervals.

Image

Because of this rapid LR decay, the model only reached around 63% accuracy on CIFAR, which is relatively low for an EfficientNet‐based approach. I made a small change to increment the scheduler’s epoch counter only after each epoch, instead of after every batch.

Separately, regarding the training of the “expertise” model (a linear model), I noticed that loss_x decreases while loss_u increases and then remains at that level for the rest of the 50 epochs. I would appreciate any guidance on how best to address this issue.

Image

At the moment, I suspect it's because the embedding model might not have been trained properly, and that could explain why I’m seeing an F0.5 score of 74% (for n_labelled = 120) instead of 84% for Embedding‐FixMatch. Any advice or suggestions would be greatly appreciated.
I have tried with the idea of stepping the scheduler after an epoch and I get 76% for the F0.5 score.

Thank you, and I look forward to your response.

@nil123532
Copy link
Author

nil123532 commented Feb 13, 2025

def evaluate_f0_5_sklearn(model, ema_model, emb_model, dataloader, beta=0.5):
"""
Evaluate the F0.5 score on a binary classification (expert correct vs. incorrect)
using sklearn's fbeta_score.

:param model: Trained model for predictions.
:param ema_model: Optional EMA model; if not None, we use the EMA model for predictions.
:param emb_model: Embedding model that provides get_embedding().
:param dataloader: DataLoader where each batch yields (images, labels, im_ids).
                   The 'labels' here should be binary: 1 if expert is correct, 0 if not.
:param beta: Value of beta for fbeta_score (default 0.5).
:return: Computed F0.5 score (float).
"""
model.eval()
y_true = []
y_pred = []

with torch.no_grad():
    for ims, lbs, im_id in dataloader:
        ims = ims.cuda()
        lbs = lbs.cuda()

        # Use EMA model if available; otherwise use main model
        embedding = emb_model.get_embedding(batch=ims)
        if ema_model is not None:
            logits = ema_model(embedding)
        else:
            logits = model(embedding)

        # Convert logits to predicted labels (0 or 1)
        output = torch.softmax(logits, dim=1)
        predicted_class = torch.argmax(output, dim=1)  # shape [batch_size]

        # Accumulate ground truth and predictions
        y_true.extend(lbs.cpu().tolist())
        y_pred.extend(predicted_class.cpu().tolist())

# Compute F0.5 using scikit-learn
f05_score = fbeta_score(y_true, y_pred, beta=beta)
return f05_score 

Here's my script to calculate the F0.5 score. Note that the dataloader will give items containing binary expert labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants