You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some cases like mask_enc[-1] and mask_enc[-4], the mask is applied only to the first frame.
(There are 2 frames and 16 patches for each frame, then, the indices of [[ 4, 5, 8, 9, 10, 11, 14, 15]] can mask the first frame only -- because the index under 16 is included in the first frame.)
In this case, for some batches, the model seems to use the part of the frames (ex. 4 masked frames out of 8 frames) and is required to reconstruct the entire patches only with first some patches in some frames. (ex. reconstruct 8 frames using 4 masked frames)
Is my analysis correct? If so, it might not be the same as the description of the paper that says the mask is the same for all frames.
3D Multi-Block Masking. We use a simple 3D extension of the block masking strategy employed
for images (Bao et al., 2021). Given a video, we sample several (possibly overlapping) spatially
continuous blocks with various aspect ratios and take their union to construct a single mask. This
spatial mask is then repeated across the entire temporal dimension. Masking a large continuous
block that covers the full temporal dimension limits information leakage due to the spatial and
temporal redundancy of videos, and results in a harder prediction task (Tong et al., 2022).
In this case, the masking strategy does not work as the intention to limit information leakage.
Question 02) The sum of the visible and invisible masks seems not to be the same as the total number of patches.
When I print the shape of each mask, I get the output like below:
There are 32 patches (2 frames * 16 patches for each frame = 32) but the sum of the lengths is less than the total patch counts.
Discussion
The second question might not be that problematic. It uses the part of the visible patches for each sample to reconstruct the part of the input video. Because partial reconstruction in MAE is shown to be effective in the paper [1]
[1] CrossMAE: Rethinking Patch Dependence for Masked Autoencoders
Approach (if the analysis is correct and the behavior is not intended)
However, the first question can affect the performance because the masking method aims to block the information leakage between the frames, specifically, preventing the model from copying the near patches at the different frames.
To resolve the problem, I think the masking block should be sampled for a single frame and repeated along the time axis with an offset (the number of patches in each frame).
I hope the discussion improves the clarity of the source code and the paper.
Thanks.
Update
The source code below can be a way to fix the mask sampling method.
collated_masks_pred, collated_masks_enc = [], []
min_keep_enc = min_keep_pred = self.duration * self.height * self.width
for _ in range(batch_size):
empty_context = True
while empty_context:
mask_e = torch.ones((1, self.height, self.width), dtype=torch.int32)
for _ in range(self.npred):
mask_e *= self._sample_block_mask(p_size)
mask_e = mask_e.flatten()
mask_p = torch.argwhere(mask_e == 0).squeeze()
mask_e = torch.nonzero(mask_e).squeeze()
empty_context = (len(mask_e) == 0)
if not empty_context:
min_keep_pred = min(min_keep_pred, len(mask_p))
min_keep_enc = min(min_keep_enc, len(mask_e))
collated_masks_pred.append(mask_p)
collated_masks_enc.append(mask_e)
if self.max_keep is not None:
min_keep_enc = min(min_keep_enc, self.max_keep)
# --
return self._truncate_mask(collated_masks_enc, min_keep_enc), self._truncate_mask(collated_masks_pred, min_keep_pred)
def _truncate_mask(self, masks, min_keep):
result = []
for cm in masks:
# choice min_keep items randomly
idx = torch.randperm(len(cm))[:min_keep]
cm = cm[idx]
tmp = torch.zeros((1, self.height, self.width), dtype=torch.int32)
tmp.flatten()[cm] = 1
tmp = tmp.repeat(self.duration, 1, 1)
tmp = torch.nonzero(tmp.flatten()).squeeze()
result.append(tmp)
return torch.utils.data.default_collate(result)
For the sanity check, I run the code without "tmp = torch.nonzero(tmp.flatten()).squeeze()".
The outputs are like:
The text was updated successfully, but these errors were encountered:
If there's substantial information leakage due to this unintended mask sampling behavior, it could compromise the model's temporal learning capabilities by simplifying the learning task. Definitely interested in learning more.
Was anyone able to achieve training and evaluating for the vjepa code?
Are there any updates on this issue? I am trying to understand the vjepa masking strategy: i) the number of non-masked patches + number of masked-patches do not match the total number of patches ii) From the two types of masks (short and long-range), seems to choose only the first of them in the predictor's forward.
Hi, I read the paper JEPA and it is an effective way to learn temporal information better than other works like VideoMAE and UMT.
I have a question about the mask sampling.
To be clear, I do not mean to review or criticize the paper, but I want to reproduce the work exactly.
Question 01) When I instantiate a mask generator and then sample a mask, it sometimes masks only the first N frames.
For example, the source code below describes the situation.
it outputs
In some cases like mask_enc[-1] and mask_enc[-4], the mask is applied only to the first frame.
(There are 2 frames and 16 patches for each frame, then, the indices of [[ 4, 5, 8, 9, 10, 11, 14, 15]] can mask the first frame only -- because the index under 16 is included in the first frame.)
In this case, for some batches, the model seems to use the part of the frames (ex. 4 masked frames out of 8 frames) and is required to reconstruct the entire patches only with first some patches in some frames. (ex. reconstruct 8 frames using 4 masked frames)
Is my analysis correct? If so, it might not be the same as the description of the paper that says the mask is the same for all frames.
In this case, the masking strategy does not work as the intention to limit information leakage.
Question 02) The sum of the visible and invisible masks seems not to be the same as the total number of patches.
When I print the shape of each mask, I get the output like below:
There are 32 patches (2 frames * 16 patches for each frame = 32) but the sum of the lengths is less than the total patch counts.
Discussion
The second question might not be that problematic. It uses the part of the visible patches for each sample to reconstruct the part of the input video. Because partial reconstruction in MAE is shown to be effective in the paper [1]
[1] CrossMAE: Rethinking Patch Dependence for Masked Autoencoders
Approach (if the analysis is correct and the behavior is not intended)
However, the first question can affect the performance because the masking method aims to block the information leakage between the frames, specifically, preventing the model from copying the near patches at the different frames.
To resolve the problem, I think the masking block should be sampled for a single frame and repeated along the time axis with an offset (the number of patches in each frame).
I hope the discussion improves the clarity of the source code and the paper.
Thanks.
Update
The source code below can be a way to fix the mask sampling method.
For the sanity check, I run the code without "tmp = torch.nonzero(tmp.flatten()).squeeze()".
The outputs are like:
The text was updated successfully, but these errors were encountered: