Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: so for App flow dataset, the only feature is time? #25

Open
mw66 opened this issue Apr 14, 2023 · 2 comments
Open

Q: so for App flow dataset, the only feature is time? #25

mw66 opened this issue Apr 14, 2023 · 2 comments

Comments

@mw66
Copy link

mw66 commented Apr 14, 2023

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L19-L22

extract: time, weekday, hour, month

and is used here:

https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L54-L57

I'm just wondering:

  1. why, for example, not using zone (convert to some integer) as extra features, and in that case, how does this model perform?

  2. or: if the train data only contains the single time feature (without weekday, hour, month), will this model still perform?

Sorry for the silly questions, want to hear your insight.

Thanks.

@Zhazhan
Copy link

Zhazhan commented Apr 14, 2023

Hi,

  1. The information of 'zone' and 'app_name' is actually used, see https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L13 and https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L57. Each 'app_name' in each 'zone' corresponds to a time series, so we convert the 'app_name' and 'zone' information into an integer, namely, the 'seq_id'.
  2. It is also possible to make predictions based solely on historical time series. Following previous works, our implementation introduced these covariates.

@mw66
Copy link
Author

mw66 commented Apr 15, 2023

Ok, so the app_name and zone are there, but how about the previous value of the raw input sequence (inside the window size)?

Let's check the raw input sequence data, in:
https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L17-L26

        single_df = grouped_data[i][1].drop(labels=['app_name', 'zone'], axis=1).sort_values(by="time", ascending=True)
        times = pd.to_datetime(single_df.time)
        single_df['weekday'] = times.dt.dayofweek / 6
        single_df['hour'] = times.dt.hour / 23
        single_df['month'] = times.dt.month / 12
        temp_data = single_df.values[:, 1:]    # L22, 'time' column is dropped here
        if (temp_data[:, 0] == 0).sum() / len(temp_data) > 0.2:
            continue

        all_data.append(temp_data)

we can see temp_data[:, 0] is the raw input sequence ('app_name', 'zone' are dropped on L17, and 'time' is dropped on L22, so temp_data[:, 0] is the 'value' in the original csv file.

Then, in
https://github.com/ant-research/Pyraformer/blob/master/preprocess_flow.py#L55

  single_data[:, 0] = seq_data.copy()

is the real raw input sequence data,

but in https://github.com/ant-research/Pyraformer/blob/master/data_loader.py#L513-L518

        cov = all_data[:, :, 1:]   # the real raw input sequence data 'value' (all_data[:, :, 0]) dropped?

        split_start = len(label[0]) - self.pred_length + 1
        data, label = split(split_start, label, cov, self.pred_length)

        return data, label

it's dropped from the training data?

That's my question: so the previous value of the raw input sequence value is not used at all in training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants