Spark preprocessor optimization #123

basavaraj29 · 2022-11-07T18:54:35Z

removing id assignment for edges
using zipwithindex instead of repartition(1) and windowing
parititonBy([src_bucket, dst_bucket])

todo:

custom binary writer to eliminate intermediate csv

…tion(1) and windowing

shivaram · 2022-11-07T18:57:55Z

This is great. Do we have any numbers on how much this improves pre-processing?

basavaraj29 · 2022-11-07T19:19:23Z

on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex.

thodrek · 2022-11-07T20:54:23Z

Excellent work!

…

Sent from my iPhone On Nov 7, 2022, at 8:20 PM, Basava Kolagani ***@***.***> wrote: on the freebase 86m dataset, the spark preprocessor earlier took ~70m. it now takes ~10m. i guess the id assignment for the edges (which we don't really need) was the major bottleneck. Also, earlier we were triggering a repartition(1) on both nodes and relations dataframes. Now, we have replaced that with spark's library fn zipWithIndex. — Reply to this email directly, view it on GitHub<#123 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAH6W333EQFD32BOKLN4BXTWHFI4LANCNFSM6AAAAAARZP6PZA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [ { ***@***.***": "http://schema.org", ***@***.***": "EmailMessage", "potentialAction": { ***@***.***": "ViewAction", "target": "#123 (comment)", "url": "#123 (comment)", "name": "View Pull Request" }, "description": "View this Pull Request on GitHub", "publisher": { ***@***.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

removing id assignment to edges, using zipwithindex instead of repati…

cbc9b0a

…tion(1) and windowing

basavaraj29 requested a review from JasonMoho November 7, 2022 18:54

fixed lint issues

dabdabf

partition by both src and dst bucket id

8e1a31f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark preprocessor optimization #123

Spark preprocessor optimization #123

basavaraj29 commented Nov 7, 2022 •

edited

Loading

shivaram commented Nov 7, 2022

basavaraj29 commented Nov 7, 2022

thodrek commented Nov 7, 2022 via email

Spark preprocessor optimization #123

Are you sure you want to change the base?

Spark preprocessor optimization #123

Conversation

basavaraj29 commented Nov 7, 2022 • edited Loading

shivaram commented Nov 7, 2022

basavaraj29 commented Nov 7, 2022

thodrek commented Nov 7, 2022 via email

basavaraj29 commented Nov 7, 2022 •

edited

Loading