Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RING-44425 - Comments for SPARK scripts #48

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

TrevorBenson
Copy link
Member

There are a few spots with ???? which either the entire operation I wanted to confirm before adding a comment, or there is a description, but I am not 100% confident in its accuracy and need to compare some more current output from the scripts to be sure.

Copy link
Contributor

@scality-fno scality-fno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stashing my review here — to be continued.
I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.

scripts/S3_FSCK/s3_fsck_p0.py Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
@TrevorBenson
Copy link
Member Author

TrevorBenson commented Sep 28, 2023

stashing my review here — to be continued. I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.

I've applied most. Questions posed to outstanding suggestions.

… github.com:scality/spark into improvement/RING-44425-S3_FSCK-scripts-add-comments
@TrevorBenson TrevorBenson changed the title Comments for SPARK scripts RING-44425 - Comments for SPARK scripts Oct 2, 2023
Copy link
Contributor

@scality-fno scality-fno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New pass on P0. TODO: evaluate the need for each column to exist.

scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p0.py Show resolved Hide resolved
@scality-fno scality-fno marked this pull request as ready for review October 9, 2023 15:05
@TrevorBenson
Copy link
Member Author

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

@TrevorBenson TrevorBenson force-pushed the improvement/RING-44425-S3_FSCK-scripts-add-comments branch from e99b340 to 50c5108 Compare October 10, 2023 15:46
@scality-fno
Copy link
Contributor

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

👍
I'll make the time for P1, then

@TrevorBenson
Copy link
Member Author

TrevorBenson commented Oct 10, 2023 via email

@TrevorBenson TrevorBenson force-pushed the improvement/RING-44425-S3_FSCK-scripts-add-comments branch from 3eb5bc2 to 644b54e Compare October 10, 2023 20:50
Copy link
Contributor

@scality-fno scality-fno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.

scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p1.py Outdated Show resolved Hide resolved
Copy link
Contributor

@scality-fno scality-fno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite sure there's not much to say (in the current design) about P2, P3 and P4.
One question I have is: can we make the P2 anti-join operation more efficient?

scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p2.py Outdated Show resolved Hide resolved
scripts/S3_FSCK/s3_fsck_p4.py Outdated Show resolved Hide resolved
dfCOSsingle = dfCOSsingle.withColumn("ringkey",dfCOSsingle["_c1"])
# ???
dfCOSsingle = dfCOSsingle.withColumn("_c1",F.expr("substring(_c1, 1, length(_c1)-14)"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation likely explains why the withColumn() is used to duplicate _c1 into ringkey instead of a withColumnRenamd() for dfCOSsingle.

dfnew = rdd.flatMap(lambda x: x).toDF()

single = "%s://%s/%s/s3fsck/s3-dig-keys.csv" % (PROTOCOL, PATH, RING)
# write the dataframe to a csv file with a header
# output structure: (digkey, sproxyd input key, subkey if available)
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want correct headers, this write operation w/ header="true" is where we get our first instance of generic _c0, _c1 column names. We could perform:

Suggested change
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)
dfnew = dnew.withColumnRenamed("_c0", "digkey).withColumnRenamed("_c1", "input_key").withColumnRenamed("_c2", "subkey")
dfnew.write.format("csv").mode("overwrite").options(header="true").save(single)

Requires updating the p2 script to read the new column names instead of the generic ones.

# e.g. 555555A4948FAA554034E155555555A61470C07A,8000004F3F3A54FFEADF8C00000000511470C070,g1disk1,0
# Required Fields:
# - _c1 (main chunk)
# - _c3 (FLAG)
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)
Copy link
Member Author

@TrevorBenson TrevorBenson Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another spot we can inject valid headers prior to later commands, making them a bit simpler to comprehend:

Suggested change
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)
df = spark.read.format("csv").option("header", "false").option("inferSchema", "true").option("delimiter", ",").load(files)\
df = df.withColumnRenamed("_c0", "ringkey").withColumnRenamed("_c1", "mainchunk").withColumnRenamed("_c2", "disk").withColumnRenamed("_c3", "flag")

in this example I name the _c0 (ring chunk keys) as ringkey, instead of the naming _c1, the main chunk, as ringkey. I think this could potentially reduce confusion if we decide to be very specific and use explicit terms for each data type

  • ringkey (or ring_key) # The 30-33 chunk keys and 70-7B chunk keys
  • mainchunk (or main_chunk) # The 30 or 70 main chunk (aka zero keys)
  • disk
  • flag
  • inputkey (or input_key) # The sproxyd input key
  • digkey (or dig_key) # The md5sum digged from the main chunk

I like the underscore versions for better readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever's easier to read is fine by me

@TrevorBenson
Copy link
Member Author

"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.

It might be useful to contribute those to the repository so we can improve upon them as needed. We might need to repeat this process for SOFS_FSCK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants