-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data source problem: the new dataset has duplications and misses some documents #113
Comments
Also, the MODS dataset has 193 docs less than allrecords.xml. Also, some docs exist only in the MODS dataset, so more than 193 are missed. Here is the list of the differences diff.txt
|
@andrew2net in the diff file I see a number of "IPD" entries. I believe only the NIST CSRC source provides IPDs and other draft entries and the CSWPs. These were never given by the NIST-Tech-Pubs repository. Can you update the diff file? Thanks. |
@ronaldtse do you mean ...
<query key="IR">
<doi type="report-paper_title">10.6028/NIST.IR.8320C.ipd</doi>
...
<query key="IR">
<doi type="report-paper_title">10.6028/NIST.IR.8286D.ipd</doi>
... |
Very interesting. I will report this to NIST. |
We now have a command that diffs the duplicated docs to see what happened: I have posted the detailed diff to NIST here: |
The new dataset has 417 duplications. In each duplication case, the URLs of duplicated documents are identical.
Here are the URLs of duplicated docs:
log.txt
Originally posted by @andrew2net in #112 (comment)
The text was updated successfully, but these errors were encountered: