-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt a record linkage between EIA utilities and CorpsWatch's SEC filers #2337
Comments
I did an initial pass at record linkage with just 2005 data and here are some questions I'm left with. Pretty much zero tuning went into this model so results are, as expected, nowhere near perfect. I'm using This first chart shows the match weight for each of the "comparison levels". For example, if the utility names of two records have a Levenshtein edit distance < 2, then it will be given a positive match weight of ~8 for the utility name comparison. If the city names are very different, it will be given a very negative match weight. This next chart shows how records that the model deems to be a match are distributed amongst the comparison levels. Of note here is that of the matching records, ~97% have a utility name Levenshtein edit distance > 5. Additionally, of the matching records, 98% have an exact match on city. I'll note here that the "match set" that the model is using here isn't actually its predictions. It's based on a set of conditions that I gave it that I think should be true for matching records. I estimated the recall with these rules to be 70%. These rules definitely need some tuning to make that "match set" better. Takeaways:
With just the 2005 data, the model matches about 60% of SEC companies and 10% of EIA companies (assuming that the threshold for a correct match is a score of .5 or greater). There are far fewer SEC companies so this is sort of expected. I think with some pretty basic tuning, results will get much better. |
I'm not sure what you mean by this:
I may not have enough context here but IIRC, the CorpWatch database has a table that's all of the companies, covering both the parents and subsidiaries, and I thought that the record linkage problem we wanted to do was to find matches between that big SEC list of companies and the EIA list of utilities, without (initially) concerning ourselves about which role they're playing. With that linkage, we'd hopefully be able to join the SEC/CorpWatch company ID into the Why did you choose 2005 to start with? I could imagine the addresses associated with utilities changing a fair bit from year to year, and it might be that they change at different times in the two datasets (or even that they report entirely different addresses to SEC vs. EIA -- incorporation location in Delaware vs. operational HQ?). Might it make sense to try doing the record linkage without considering year to start with? Just get a deduplicated list of all the companies in both datasets with all the names and addresses they've ever reported? IIRC, we are not currently harvesting utilities that only show up as owners (See #1393) which means you'll need to compile your own mall-encompassing list of potential utility names and addresses, based on both the On the Zip codes, I would think that it would be common for one city to have many zip codes, and rare for one zip code to have multiple cities (probably only in really rural places), but I'm sure both of them happen. But I would expect each full address to have an almost perfectly 1:1 relationship with zip code (unless the zip code boundaries got moved, which happens occasionally I think). I bet there are some standard address normalization libraries out there we could use. Or I think USPS has an API for standardizing addresses too, but it might be annoying to use / rate-limited. I agree with your intuition that the company names should be highly weighted. You could have lots of companies registered to the same (or almost the same) address, as often happens with PO Boxes in Delaware. But at the same time, I think a lot of parents and subsidiaries will have similar names, Like look at how many utility names contain the word "Duke" in the EIA data. I think the exact match + Levenshein distance <= 2 criteria might be too stringent, or not the right way to do it. I'm sure there are misspellings, but I think differences will more frequently arise from things lik Inc vs. Incorporated vs. Corp. (or lack of Inc at all), LLC vs. Limited or Ltd. Maybe some address-style normalization of these words makes sense? |
@katie-lamb @cmgosnell @jrea-rmi Some work from Climate Trace that might be interesting in the context of this issue:
Seems like it could be more granular and complete, at least in the US:
|
That's probably a significant fraction of electricity sector emissions, but doesn't get us owned generation mwh and will miss a lot of utilities that we'd do analysis on. So unfortunately probably not a new complete resource we could use |
Got it. This makes sense. The names actually will match up then.
This was mostly because if you ignore year as a blocking rule, there are a ton of essentially duplicate records on both sides that I thought would potentially mess with matching. And I wanted instantaneously fast results lol. Agree that a good next step is going to be deduplicating both datasets. In the SEC data there’s a column for min year and max year, which I assume represent when there was a change in address, and for what years that address is applicable. There’s also a year column that falls within that that range. I thought it might work to just match this year column with the EIA records, but agree that some potentially bad address reporting makes this not a great strategy.
Ah interesting, I didn’t realize this.
Ya I agree that Levenshtein maybe isn’t the right metric here. I did some very basic string cleaning and normalization on the addresses, but definitely more normalization is an easy next step. |
@katie-lamb if you haven't seen it, the OS-Climate CompanyNameCleaner has some useful tools for cleaning company names |
I made a few quick changes to the preprocessing and reran the model:
Results were slightly better, but still too much weight is put on Here are the charts comparing match weight. As you can see, still too much weight is put on city and not enough on utility name. |
Could we use TF-IDF to vectorize the utility names (or other text fields) and cosine similarity to compare them? Or is the menu of similarity metrics hard coded as part of Splink? |
Ya that's a good question and something I'm trying to figure out with the CCAI work. The hardcoded similarity metrics are Jaccard, Jaro-Winkler, and Levenshtein, but they're subclasses of a generic Distance Metric class that I can use to implement a cosine similarity. I'm not sure how it would actually perform with the model. I think it could work. Here's a recent issue about this in the |
The Inverse Document Frequency (IDF) part of TF-IDF deals with the presence of common (and thus not very important) words like "Limited" and "Corp" nicely. If the matching is being impacted more by word-level differences than myriad misspellings, maybe word-level tokenization rather than length-N substrings would be good enough, in which case it wouldn't blow up memory in the same way that it does in the FERC Plant ID assignments? |
I'm closing this as it was a good first pass to get us to understand what a full integration and linkage would entail. Next steps coming soon & will link back to this PR |
This is a next step out of the preliminary investigation from #2225
In scope
Out-of Scope
The text was updated successfully, but these errors were encountered: