-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(URGENT) Failing to parse a number of original NIST PubIDs from NIST Library's allrecords.xml
#177
Comments
We have to parse all these and provide a mapping, provide these in an index that is rendered. (Old ID and the new PubID) |
@mico the |
@andrew2net will you provide the mapping for @mico to implement the PubID substitutes? Thanks. |
I found some incorrect IDs in the |
@mico what is the progress with this issue? |
@andrew2net which IDs are not parsing correctly right now? Do you have a code snippet that does that?
Are you saying that pubid-nist needs to handle these changes?
|
Current parser can't parse all the IDs and can't solve some ongoing issues. For example it parses parts of IDs independently and can not detect if a reference has incorrect parts. def refparts
{
prefix: match(/^(NIST|NBS)/, text),
series: match(/(SP|FIPS|CSWP|IR|ITL\sBulletin|White\sPaper)(?=\.|\s)/, text),
code: match(/(?<=\.|\s)[0-9-]+(?:(?!(ver|r|v|[Pp]t)\d|-add\d?)[A-Za-z-])*/, text),
prt: match(/(?:(?<dl>\.)?pt(?(<dl>)-)|\sPart\s)(?<val>[A-Z\d]+)/, text),
vol: match(/(?:(?<dl>\.)?v(?(<dl>)-)|\sVol\.\s)(?<val>\d+)/, text),
ver: match(/(?:(?<dl>\.)?\s?ver|\sVer\.\s)(?<val>\d(?(<dl>)[-\d]|[.\d])*)/, text)&.gsub(/-/, "."),
rev: match(/(?:(?:(?<dl>\.)|[^a-z])r|\sRev\.\s)(?(<dl>)-)(?<val>\d+)/, text),
add: match(/(?:(?<dl>\.)?add|\/Add)(?(<dl>)-)(?<val>\d*)/, text),
draft: !(match(/\((?:Draft|PD)\)/, text).nil? && @opts[:stage].nil?),
}
end
def match(regex, code)
m = regex.match(code)
return unless m
m.named_captures["val"] || m.to_s
end I didn't test the parser agains all of the
I think the pubid-nist should just log incorrect IDs. We will handle them in the relaton-nist. |
@andrew2net Thanks for the snippet, I was actually wondering if you have a snippet of how pubid-nist does not parse correctly for all IDs. Our principle is: "pubid-nist must parse all NIST PubIDs, whether parsing them right or wrong". @mico could you please help? |
@ronaldtse Oh, here is the snippet: require 'pubid/nist'
File.read('allrecords.txt').each_line do |id|
Pubid::Nist::Identifier.parse id
rescue Pubid::Core::Errors::ParseError
puts id
end |
These are the failing IDs. The snippet I used to run: require 'pubid/nist'
File.read('allrecords.txt').split("\n").sort.uniq do |id|
x = Pubid::Nist::Identifier.parse(id.strip)
rescue StandardError => e
puts "ERROR: '#{id.strip}' failed parsing"
end Failed IDs:
|
allrecords.xml
@ronaldtse there are many identifiers in format I didn't have before to test with. |
@mico yes I realized, but we do really need to parse all of them ASAP. Let's make the necessary modifications. Thanks. |
@mico the allrecords dataset contains more than 19k IDs. There are a few odd IDs like |
@mico can you please help expedite this issue so that @andrew2net can finally integrate pubid-nist into relaton-nist? Thanks! |
Currently we use 2 datasets: CRSC and Allrecords
CRSC is prioritized but it has only 719 document. So if document can't be found in the CRSC dataset, relaton-nist tries to find in in the Allrecords.
To handle these datasets we need to parse all the documents IDs that they have. Two files with the IDs are attached. The
pubs-export.txt
file contains all the IDs from the CRSC dataset. Theallrecords.txt
contains all the IDs from the Allrecors dataset.The Allrecords dataset contains some incorrect IDs. I found these documents and their correct IDs:
pubs-export.txt
allrecords.txt
This issue blocks this relaton/relaton-nist#63
The text was updated successfully, but these errors were encountered: