Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Cusips with more than 9 characters #81

Open
bdcallen opened this issue Apr 14, 2020 · 5 comments
Open

Address Cusips with more than 9 characters #81

bdcallen opened this issue Apr 14, 2020 · 5 comments
Assignees
Labels

Comments

@bdcallen
Copy link
Contributor

@iangow

We have different results when the CUSIPs over 9 digits. I think the solution here is to write the regular expressions to capture the full CUSIP and to ditch CUSIPs that are too long to be valid. Capturing part of a twelve-digit CUSIP is just a recipe for bad matches in my view.

We have some CUSIPs that are actually ISINs (these will appear as "bad CUSIPs" if we apply the idea in (4) above). I'm not sure we want to bother with these. Only if we had a number of valid ISINs that yielded CUSIP-CIK matches that we wouldn't otherwise get would be get into the business of detecting ISINs and converting them to CUSIPs.

This issue is in reference to these points, in #76. I think an alternative option would be to keep the full raw character string, and having a separate field which classifies whether it is a cusip or not. That would allow us the opportunity to see if the ISIN's contain the correct CUSIPs, if we so choose.

@iangow
Copy link
Owner

iangow commented Apr 22, 2020

It seems we'd have to tweak the regular expressions to capture these. I'm not sure that this would be worth the effort, as we'd need to filter out a lot of rubbish that we'd be adding.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")

cusip_cik %>% filter(nchar(cusip) > 9)
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [[email protected]:5432/crsp]
#> # … with 6 variables: file_name <chr>, cusip <chr>, check_digit <int>,
#> #   cik <int>, company_name <chr>, formats <chr>

Created on 2020-04-22 by the reprex package (v0.3.0)

@bdcallen
Copy link
Contributor Author

@iangow Yes, I wrote the initial python code to only accept cusip candidates up to 9 characters, hence the result you have found in the code snippet in the post above. Tweaking the regular expressions would not be that difficult, in my opinion, we'd just have to put the upper limit on the number of characters matched for a 'cusip' up to 12, as these could correspond to an ISIN. Anything more than that, however, is clearly a mistake. I tend to agree with you, in that it would add work in filtering out rubbish. It is possible that we have missed some potential cik-cusip matches due to some filers quoting the ISIN number, which contains the cusip number if it has been derived from a security with a cusip number, which is why I made this issue. I'm not sure exactly how many matches we might have lost (I'm guessing it is rather a small number, though I'm not sure if it is small enough to make it insignificant).

@iangow
Copy link
Owner

iangow commented Apr 22, 2020

Let's push this down the priority list. Focus instead on tweaks that can be evaluated with data we already have. For example, I think #82 and #77 should be higher on the list.

@bdcallen
Copy link
Contributor Author

@iangow Is this a priority? I think it would be helpful to do a little cost-benefit analysis. In my opinion, going beyond 9 digits to capture 12 digit ISIN numbers, and then to derive 9-digit cusips from the ISINs is not worth it. It seems that doing so will just introduce more noise for little gain. If you agree with me, I think we close this

@iangow
Copy link
Owner

iangow commented Oct 18, 2020

Just add defer label for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants