Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report on the pandas merge lecture #198

Closed
jlperla opened this issue Jan 3, 2022 · 6 comments · May be fixed by #199
Closed

Bug report on the pandas merge lecture #198

jlperla opened this issue Jan 3, 2022 · 6 comments · May be fixed by #199

Comments

@jlperla
Copy link
Member

jlperla commented Jan 3, 2022

From @jstac and an emailed bug report from Guilaume

I was looking at the merge part of the Pandas lecture in the QE for data science series.
The airline example part has a code that does not work: The airline code does not match with the descriptions.
It looks like replacing

avg_delays_w_code = avg_delays.join(carrier_code )

 by

avg_delays_w_code = avg_delays.merge(carrier_code, left_index = True, right_on = 'Code')

does the trick.

@wupeifan Fyi. @jbrightuniverse could help fix if you concur with this.

@wupeifan
Copy link
Contributor

wupeifan commented Jan 3, 2022

I don't know whether it's a bug or not because it's working well on my side. After all, pd.join is exactly equivalent to what the merge line does. Maybe @jbrightuniverse can help verify whether this is indeed a problem?

@jlperla
Copy link
Member Author

jlperla commented Jan 3, 2022

Totally up to you Peifan if you think this is expected behavior

@jbrightuniverse
Copy link
Collaborator

When I copy the example and run it on my end in a Python script, I get the same set of NaN entries.

When I use the fix avg_delays_w_code = avg_delays.merge(carrier_code, left_index = True, right_on = 'Code'), it works.

According to the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html, the join() function without specifying any keys will join "index on index" (not entirely sure what that's referring to). I haven't debugged far enough to determine why it shows up as all NaN because of that, but the merge line works for sure because its specifically linking through the Code from the RHS database linked to the existing order of entries on the left.

For reference, we currently see this:
image
I think we should be seeing this instead:
image

@jbrightuniverse
Copy link
Collaborator

It could have something to do with the left dataset not having a specified index; I can check this later today.

@wupeifan
Copy link
Contributor

wupeifan commented Jan 4, 2022

@jbrightuniverse Thanks, let me have a look at this. It could be because of different versions of Pandas (because it worked well on the Syzygy server).

@wupeifan
Copy link
Contributor

wupeifan commented Jan 4, 2022

OK. This is because pd.read_csv changed its default behavior by setting an additional level of numerical indices.
Change avg_delays_w_code = avg_delays.join(carrier_code ) to avg_delays_w_code = avg_delays.join(carrier_code.set_index("Code") would be a minimal working adjustment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants