-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address Cusips with more than 9 characters #81
Comments
It seems we'd have to tweak the regular expressions to capture these. I'm not sure that this would be worth the effort, as we'd need to filter out a lot of rubbish that we'd be adding. library(dplyr, warn.conflicts = FALSE)
library(DBI)
Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")
pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")
cusip_cik <- tbl(pg, "cusip_cik")
cusip_cik %>% filter(nchar(cusip) > 9)
#> # Source: lazy query [?? x 6]
#> # Database: postgres [[email protected]:5432/crsp]
#> # … with 6 variables: file_name <chr>, cusip <chr>, check_digit <int>,
#> # cik <int>, company_name <chr>, formats <chr> Created on 2020-04-22 by the reprex package (v0.3.0) |
@iangow Yes, I wrote the initial python code to only accept cusip candidates up to 9 characters, hence the result you have found in the code snippet in the post above. Tweaking the regular expressions would not be that difficult, in my opinion, we'd just have to put the upper limit on the number of characters matched for a 'cusip' up to 12, as these could correspond to an ISIN. Anything more than that, however, is clearly a mistake. I tend to agree with you, in that it would add work in filtering out rubbish. It is possible that we have missed some potential cik-cusip matches due to some filers quoting the ISIN number, which contains the cusip number if it has been derived from a security with a cusip number, which is why I made this issue. I'm not sure exactly how many matches we might have lost (I'm guessing it is rather a small number, though I'm not sure if it is small enough to make it insignificant). |
@iangow Is this a priority? I think it would be helpful to do a little cost-benefit analysis. In my opinion, going beyond 9 digits to capture 12 digit ISIN numbers, and then to derive 9-digit cusips from the ISINs is not worth it. It seems that doing so will just introduce more noise for little gain. If you agree with me, I think we close this |
Just add |
@iangow
This issue is in reference to these points, in #76. I think an alternative option would be to keep the full raw character string, and having a separate field which classifies whether it is a cusip or not. That would allow us the opportunity to see if the ISIN's contain the correct CUSIPs, if we so choose.
The text was updated successfully, but these errors were encountered: