Address Cusips with more than 9 characters #81

bdcallen · 2020-04-14T09:01:11Z

We have different results when the CUSIPs over 9 digits. I think the solution here is to write the regular expressions to capture the full CUSIP and to ditch CUSIPs that are too long to be valid. Capturing part of a twelve-digit CUSIP is just a recipe for bad matches in my view.

We have some CUSIPs that are actually ISINs (these will appear as "bad CUSIPs" if we apply the idea in (4) above). I'm not sure we want to bother with these. Only if we had a number of valid ISINs that yielded CUSIP-CIK matches that we wouldn't otherwise get would be get into the business of detecting ISINs and converting them to CUSIPs.

This issue is in reference to these points, in #76. I think an alternative option would be to keep the full raw character string, and having a separate field which classifies whether it is a cusip or not. That would allow us the opportunity to see if the ISIN's contain the correct CUSIPs, if we so choose.

iangow · 2020-04-22T16:28:03Z

It seems we'd have to tweak the regular expressions to capture these. I'm not sure that this would be worth the effort, as we'd need to filter out a lot of rubbish that we'd be adding.

library(dplyr, warn.conflicts = FALSE)
library(DBI)

Sys.setenv(PGHOST = "10.101.13.99", PGDATABASE = "crsp")

pg <- dbConnect(RPostgres::Postgres())
rs <- dbExecute(pg, "SET search_path TO edgar")

cusip_cik <- tbl(pg, "cusip_cik")

cusip_cik %>% filter(nchar(cusip) > 9)
#> # Source:   lazy query [?? x 6]
#> # Database: postgres [[email protected]:5432/crsp]
#> # … with 6 variables: file_name <chr>, cusip <chr>, check_digit <int>,
#> #   cik <int>, company_name <chr>, formats <chr>

^{Created on 2020-04-22 by the reprex package (v0.3.0)}

bdcallen · 2020-04-22T17:01:44Z

@iangow Yes, I wrote the initial python code to only accept cusip candidates up to 9 characters, hence the result you have found in the code snippet in the post above. Tweaking the regular expressions would not be that difficult, in my opinion, we'd just have to put the upper limit on the number of characters matched for a 'cusip' up to 12, as these could correspond to an ISIN. Anything more than that, however, is clearly a mistake. I tend to agree with you, in that it would add work in filtering out rubbish. It is possible that we have missed some potential cik-cusip matches due to some filers quoting the ISIN number, which contains the cusip number if it has been derived from a security with a cusip number, which is why I made this issue. I'm not sure exactly how many matches we might have lost (I'm guessing it is rather a small number, though I'm not sure if it is small enough to make it insignificant).

iangow · 2020-04-22T17:14:59Z

Let's push this down the priority list. Focus instead on tweaks that can be evaluated with data we already have. For example, I think #82 and #77 should be higher on the list.

bdcallen · 2020-10-18T14:57:15Z

@iangow Is this a priority? I think it would be helpful to do a little cost-benefit analysis. In my opinion, going beyond 9 digits to capture 12 digit ISIN numbers, and then to derive 9-digit cusips from the ISINs is not worth it. It seems that doing so will just introduce more noise for little gain. If you agree with me, I think we close this

iangow · 2020-10-18T15:50:03Z

Just add defer label for now.

iangow assigned bdcallen Apr 22, 2020

iangow added the assigned to project label Apr 23, 2020

bdcallen added defer assigned to project and removed assigned to project labels Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address Cusips with more than 9 characters #81

Address Cusips with more than 9 characters #81

bdcallen commented Apr 14, 2020

iangow commented Apr 22, 2020

bdcallen commented Apr 22, 2020

iangow commented Apr 22, 2020

bdcallen commented Oct 18, 2020

iangow commented Oct 18, 2020

Address Cusips with more than 9 characters #81

Address Cusips with more than 9 characters #81

Comments

bdcallen commented Apr 14, 2020

iangow commented Apr 22, 2020

bdcallen commented Apr 22, 2020

iangow commented Apr 22, 2020

bdcallen commented Oct 18, 2020

iangow commented Oct 18, 2020