Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source problem: the new dataset has duplications and misses some documents #113

Open
andrew2net opened this issue Jul 12, 2024 · 6 comments
Assignees

Comments

@andrew2net
Copy link
Contributor

andrew2net commented Jul 12, 2024

The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical.
Here are the URLs of duplicated docs:
log.txt

Originally posted by @andrew2net in #112 (comment)

@andrew2net
Copy link
Contributor Author

Also, the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed. Here is the list of the differences diff.txt

  • Lines that are only in the old dataset are prefixed with <.
  • Lines that are only in the new dataset are prefixed with >.
  • If a file name has changed, it will be shown as the file from the old dataset and then the file from the new dataset.
  • Each change is preceded by the line numbers that apply to that change. The format is start,end for ranges of lines, or just a single number for single lines. The line numbers for the first file and the second file are separated by c (for changed), a (for added), or d (for deleted).

@ronaldtse ronaldtse changed the title The new dataset has duplications and misses some docs. Data source problem: the new dataset has duplications and misses some documents Jul 12, 2024
@ronaldtse
Copy link
Contributor

ronaldtse commented Jul 12, 2024

@andrew2net in the diff file I see a number of "IPD" entries.

I believe only the NIST CSRC source provides IPDs and other draft entries and the CSWPs. These were never given by the NIST-Tech-Pubs repository. Can you update the diff file? Thanks.

@andrew2net
Copy link
Contributor Author

@ronaldtse do you mean NIST IR 8320C ipd? There were such documents in allrecords.xml

   ...
   <query key="IR">
      <doi type="report-paper_title">10.6028/NIST.IR.8320C.ipd</doi>
      ...
   <query key="IR">
      <doi type="report-paper_title">10.6028/NIST.IR.8286D.ipd</doi>
      ...

@ronaldtse
Copy link
Contributor

Very interesting. I will report this to NIST.

@ronaldtse
Copy link
Contributor

The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical. Here are the URLs of duplicated docs: log.txt

We now have a command that diffs the duplicated docs to see what happened:

I have posted the detailed diff to NIST here:

@ronaldtse
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants