Bug report on the pandas merge lecture #198

jlperla · 2022-01-03T19:20:02Z

From @jstac and an emailed bug report from Guilaume

I was looking at the merge part of the Pandas lecture in the QE for data science series.
The airline example part has a code that does not work: The airline code does not match with the descriptions.
It looks like replacing

avg_delays_w_code = avg_delays.join(carrier_code )

 by

avg_delays_w_code = avg_delays.merge(carrier_code, left_index = True, right_on = 'Code')

does the trick.

@wupeifan Fyi. @jbrightuniverse could help fix if you concur with this.

The text was updated successfully, but these errors were encountered:

wupeifan · 2022-01-03T20:49:18Z

I don't know whether it's a bug or not because it's working well on my side. After all, pd.join is exactly equivalent to what the merge line does. Maybe @jbrightuniverse can help verify whether this is indeed a problem?

jlperla · 2022-01-03T21:57:32Z

Totally up to you Peifan if you think this is expected behavior

jbrightuniverse · 2022-01-04T19:29:42Z

When I copy the example and run it on my end in a Python script, I get the same set of NaN entries.

When I use the fix avg_delays_w_code = avg_delays.merge(carrier_code, left_index = True, right_on = 'Code'), it works.

According to the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html, the join() function without specifying any keys will join "index on index" (not entirely sure what that's referring to). I haven't debugged far enough to determine why it shows up as all NaN because of that, but the merge line works for sure because its specifically linking through the Code from the RHS database linked to the existing order of entries on the left.

For reference, we currently see this:

I think we should be seeing this instead:

jbrightuniverse · 2022-01-04T19:43:22Z

It could have something to do with the left dataset not having a specified index; I can check this later today.

wupeifan · 2022-01-04T19:46:39Z

@jbrightuniverse Thanks, let me have a look at this. It could be because of different versions of Pandas (because it worked well on the Syzygy server).

wupeifan · 2022-01-04T20:26:23Z

OK. This is because pd.read_csv changed its default behavior by setting an additional level of numerical indices.
Change avg_delays_w_code = avg_delays.join(carrier_code ) to avg_delays_w_code = avg_delays.join(carrier_code.set_index("Code") would be a minimal working adjustment.

wupeifan mentioned this issue Jan 4, 2022

Fix an issue in merge lecture #199

Open

doctor-phil closed this as completed Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug report on the pandas merge lecture #198

Bug report on the pandas merge lecture #198

jlperla commented Jan 3, 2022

wupeifan commented Jan 3, 2022

jlperla commented Jan 3, 2022

jbrightuniverse commented Jan 4, 2022

jbrightuniverse commented Jan 4, 2022

wupeifan commented Jan 4, 2022

wupeifan commented Jan 4, 2022

Bug report on the pandas merge lecture #198

Bug report on the pandas merge lecture #198

Comments

jlperla commented Jan 3, 2022

wupeifan commented Jan 3, 2022

jlperla commented Jan 3, 2022

jbrightuniverse commented Jan 4, 2022

jbrightuniverse commented Jan 4, 2022

wupeifan commented Jan 4, 2022

wupeifan commented Jan 4, 2022