Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop NaN before sending to ML? #3

Open
FlorinAndrei opened this issue Dec 20, 2019 · 3 comments
Open

drop NaN before sending to ML? #3

FlorinAndrei opened this issue Dec 20, 2019 · 3 comments

Comments

@FlorinAndrei
Copy link
Contributor

FlorinAndrei commented Dec 20, 2019

https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/4ce8e1e829ed6f6ecd74f8abbfaf91114af1201b/cloudml_prepare_local_csv.py#L31

Should there be a df.dropna(subset=['prc_change_t2']) here?

I collected the data for the last 10 years. I get 91k rows at that point. But if I run .dropna(subset=['prc_change_t2']), only about 20k rows remain. I think the NaN rows should not even be sent to ML.

@FlorinAndrei
Copy link
Contributor Author

FlorinAndrei commented Dec 20, 2019

Also for the text column. I did dropna() on that one too, and the final table now has about 18k rows.

@FlorinAndrei
Copy link
Contributor Author

FlorinAndrei commented Dec 20, 2019

Finally, should the filename column even be kept in the CSV? It's not used by ML at all, is it?

The CSV is big enough as it is, might as well trim down the stuff that's not used.

@Ben-Sherman
Copy link
Owner

Ben-Sherman commented Dec 21, 2019

Yes you're right on dropping the filename column. I added that here c5b1458

As for another dropna, there should not be anything to drop there if you're joining it with the financial dataframe where that column has already had the nulls dropped https://github.com/PlatorSolutions/quarterly-earnings-machine-learning-algo/blob/c5b145852bbf8f17f3e472eb5fb319e254a554a3/cloudml_prepare_local_csv.py#L8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants