Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives #2225

Open
nikeshv opened this issue Nov 4, 2022 · 1 comment

Comments

@nikeshv
Copy link

nikeshv commented Nov 4, 2022

Hi,
I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value.
I did the same operations on a Pandas dataframe, but it gives the minimum value .

Please find attached the notebook for step by step operations I performed.

cmd1
import databricks.koalas as ks
import pandas as pd
import random

cmd2
#Create Sample dataframe in Koalas
df = ks.DataFrame.from_dict({
'Age': [random.randint(0, 100000) for i in range(100000)],
'Income': [random.randint(0, 100000) for i in range(100000)]
})

print(df.head(5))

cmd3
import scipy.stats as stats
import numpy as np
ks.set_option('compute.ops_on_diff_frames', True)
df['Income_zscore'] = ks.Series(stats.zscore(df['Income'].to_numpy()))
df['Age_zscore'] = ks.Series(stats.zscore(df['Age'].to_numpy()))
df['sq_dist'] = [np.linalg.norm(i) for i in df[['Income_zscore','Age_zscore']].to_numpy()]
ks.set_option('compute.ops_on_diff_frames', False)

cmd4
#display(df)

cmd5
#calculate min of sq_dist
minindex=df['sq_dist'].idxmin()
minindex

cmd6
#display min value of sq_dist
df['sq_dist'].iloc[minindex]

cmd7
df.to_spark().createOrReplaceTempView("koalastable")

cmd8
%sql
select min(sq_dist) from koalastable -- THis doesnt match with the value we got in cmd6

cmd9
#do same operations with Pandas
df_spark = df.to_spark()
stats_array = np.array(df_spark.select('Age', 'Income').collect())
normalized_data = stats.zscore(stats_array, axis=0)
df_pd = pd.DataFrame(data=normalized_data, columns=['Age', 'Income'])
df_pd['sq_dist'] = [np.linalg.norm(i) for i in normalized_data]
df_pd.head(5)

cmd10
minindex_pd=df_pd['sq_dist'].idxmin()
minindex_pd

cmd11
#minimum of sq_dist using Koalas
df_pd['sq_dist'].iloc[minindex_pd]

cmd12
spark.createDataFrame(df_pd).createOrReplaceTempView("pandastable")

cmd13
%sql
select min(sq_dist) from pandastable -- This match with the value we got in cmd11

@nikeshv
Copy link
Author

nikeshv commented Nov 4, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant