-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate ERA content found by Google Search Console #3372
Comments
New items found in the last two weeks:
Rough query to find items without an attached file (some may be on purpose
Rough query to find theses without an attached file (some may be on purpose
Find Items with the same title as an record without and attachment
Find theses with the same title as an record without and attachment
Find Items with the same title as an record without and attachment: add sometimes present name of the attached file to help
Find theses with the same title as an record without and attachment: add sometimes present name of the attached file to help
Notes:
|
Examples like the follow will likely need additional metadata to avoid Google labeling resources with similar metadata but different file attachments as "Duplicate, Google chose different canonical than user". Some examples |
Updated list sent to the ERA service team for review |
Related to #3289
When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).
The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).
Question: is there a way to test for duplicates more efficiently than Google?
Attempt 1.: use the Active Storage database table
active_storage_blob
columnchecksum
to verify each attachment is unique therefore finding any duplicate items.The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?
Let's test, each active_storage_blob should appear only once for each unique attachment, right?
Why are there so many blobs with the same checksum? Let's filter the attachment count by
record_type
Why the decrease in numbers?
Maybe due to the filter on the record_type? The answer seems like "yes" from the below
In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.
Let's filter out the DraftItems and DraftThesis
Let's write this to a CSV file
Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):
Are these intentional?
Let's output nicely in a similar format to the duplicate records finder
Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193
The text was updated successfully, but these errors were encountered: