Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate integrating SEC 10-K EX-21 subsidiary owner info with PUDL #2225

Closed
zaneselvans opened this issue Jan 21, 2023 · 11 comments
Closed
Assignees
Labels
inframundo new-data Requests for integration of new data. rmi

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Jan 21, 2023

Many utilities identified in FERC, EIA, and EPA data are subsidiaries owned by some larger utility holding company. Understanding these ownership relationships can be helpful in understanding the economic incentives of utilities, e.g. which electric utilities are intimately linked to natural gas companies through being owned by the same parent.

@MichaelTiemannOSC pointed me at CorpWatch which has scraped data describing these relationships from the SEC 10-K filings, where it shows up in Exhibit 21, aka EX-21. They make the data freely available for download in bulk or via a RESTful API. The 10-K is reported in XBRL, so it might be possible to download the bulk data directly from SEC and create our own database.

@zaneselvans zaneselvans added the new-data Requests for integration of new data. label Jan 21, 2023
@zaneselvans zaneselvans changed the title Investigate bringing SEC 10-K EX-21 subsidiary ownership information into PUDL Investigate integrating SEC 10-K EX-21 subsidiary owner info with PUDL Jan 21, 2023
@MichaelTiemannOSC
Copy link

Do let me know when you attempt to process Constellation Energy Corp's 2022 report from whatever sources you can find.
I'm looking at their 2022 Sustainability Report (https://www.constellationenergy.com/content/dam/constellationenergy/pdfs/Constellation-2022-Sustainability-Report.pdf), where I find both a claim of delivering over 1.6 dekatherms of natural gas in 2021 (page 21), as well as a Scope 3 category 11 emissions disclosure totaling 69,126 kt CO2e out of a total reported Scope 3 of 93,647 kt CO2e (page 59). The problem?

ureg('1.6e9 dekatherm').to('kt CO2e')
<Quantity(84653.60969816365, 'CO2e * kt')>

Either they know how to deliver gas that burns 22% more efficiently than EPA knows about, or ??? Will be very interested to see how this ties to other disclosures and/or mass/energy/volume identities.

@jdangerx jdangerx moved this to 🆕 New in Catalyst Megaproject Feb 7, 2023
@jdangerx jdangerx moved this from 🆕 New to 📋 Backlog in Catalyst Megaproject Feb 7, 2023
@cmgosnell cmgosnell self-assigned this Feb 13, 2023
@cmgosnell cmgosnell added the rmi label Feb 13, 2023
@cmgosnell cmgosnell moved this from 📋 Backlog to 🔖 Ready and Prioritized in Catalyst Megaproject Feb 13, 2023
@cmgosnell
Copy link
Member

cmgosnell commented Feb 17, 2023

Hokay. here is some learnings from a preliminary investigation:

the original PDF forms

From what i can tell, the 10k filer report all of their subsidiaries and the percentage ownership for each of those subsidiaries. seems pretty good!

Corpwatch Access

it's pretty simple! they have bulk downloads of both CSV and MySQL versions. I've only looked into the CSVs, but it seems like they both have the same structure. There is also api access! Which i assume would be nice for more one-off questions.

Columns shared with EIA

  • utility name
  • city
  • state
  • postal_code

This is actually a fair amount of information for a record linkage since location is so specific.

Data Coverage

It looks like some tables go back to 2001 but most go back to 2003 👍🏻

The one bummer in here is that it doesn't seem like the ownership % is reported in this corpwatch bulk download :-(

@zaneselvans
Copy link
Member Author

zaneselvans commented Feb 20, 2023

That is really too bad about the percentage ownership not being available. 😢

Looking at the CorpWatch API GitHub repo, it seems like most of the codebase is like 13+ years old, and written in Perl. I wonder how far automated PDF parsing has come since then, and how hard it would be to re-write this system to use e.g. Layout Parser (GitHub repo) which @TrentonBush and @katie-lamb have been talking about for that other project. Especially if there's a well defined document structure that's shared across all of the EX-21 forms?

Edit: of course there is no standard document layout. That would be too easy. A few examples:

@zaneselvans
Copy link
Member Author

There's also TabulaPDF, but it hasn't made a release since 2018.

@zaneselvans
Copy link
Member Author

The Corpwatch DB looks like it tracks all the companies -- both parents and subsidiaries -- and counts up how many parents each subsidiary has, and how many subsidiaries each parent has, so even without the actual percentages there's some measure of how concentrated ownership is. It might be that in most cases the utility subsidiaries are wholly owned, so you could assume it's just 100%.

Big caveats on the data quality though -- it might be that misspellings or name changes mean that the counts of parents / subsidiaries are off, since they aren't using unique universal IDs.

Depending on what fraction of these relationships are simple for the electric utilities, it might be tractable to try and fill in the missing ones by hand. Or just the number of owners of a subsidiary could be taken as a gross estimate of how distributed the ownership is (assume each of 2 owners owns 50%, each of 3 owners owns 33% etc.) Wouldn't be great, but would be a lot easier than trying to parse out the wide variety of tables with nesting relationships that respondents are using to report this information.

@zaneselvans
Copy link
Member Author

@jrea-rmi do you want to talk about what we might do next?

@MichaelTiemannOSC
Copy link

FWIW, there are PRs for Pint (hgrecco/pint#1615) and a Pint-Pandas branch (https://github.com/MichaelTiemannOSC/pint-pandas/tree/ducks-unlimited) that make it possible to use uncertainties, so you can make them 50/50 +/- 50%.

@jrea-rmi
Copy link
Collaborator

In our reverse engineering of ERM with EIA860 ownership information, we see that ~90% of power plant owners are wholly owned by a single parent. About a third of the multiple owner cases are 50-50 ownership.

image

utility_id_eia_to_parent_name.csv

In our use of these relationships, we do need the fraction owned so we can calculate the amount of generation and emissions attributable to a ticker. 90% is pretty good, but I definitely don't want to go backwards with data quality from what we already have.

The main benefits I see in CorpWatch are:

  • time-dependence, whereas ours is for a single point in time, which allows us to backtest our calculations and would make manual maintenance a much lower burden in the future
  • comprehensiveness (if that's true, after record linkage, which I presume will have at least a few % error or non-matches) because our reverse engineering requires some manual review of conflicting cases, and also only covers EIA860 but not EIA861 entities

Overall at the moment, it'd be amazing to have a source I'm not convinced that the data quality or coverage will be good enough that we'll use it. It's hard to guess what that will be until record linkages have been attempted. So if you're feeling optimistic then I'm interested in seeing what comes of your attempt, but if not then maybe this stays on the backburner...

@jrea-rmi
Copy link
Collaborator

A summary comment: I'm very curious, but the lack of % ownership is an issue and I'd want to understand the coverage after record linkage better (measured by the fraction of capacity/generation/emissions covered) and if there are gaps in coverage, would Catalyst be willing to maintain overrides/additions to sub-parent ownership information?

@zaneselvans zaneselvans moved this from 🔖 Backlog to 🚧 In progress in Catalyst Megaproject Feb 23, 2023
@zaneselvans
Copy link
Member Author

If the data you currently have indicates that 90% of the owners that show up in EIA have a single parent, and we can confirm that the Corpwatch data agrees (based on linking SEC companies to EIA companies), it seems like checking the remaining 10% with multiple owners manually might be tractable. It would be several hundred records that need to be updated once a year right?

So it seems like the potential weak point is the record linkage between the EIA-860 utilities and the SEC companies. The location data is an extremely specific discriminator, and so potentially very helpful in doing the linkage, but it seems like it's probably a legal address (HQ, incorporation address of record, mailing address, etc), rather than a permanent physical location like we have with the plants, so we might get a significant number of false negatives -- where different addresses have been reported to SEC vs. EIA for some reason. Many record linkage systems are designed with deduplicating addresses in mind, so they should work well when the reported addresses represent the same place even if they're not totally identical. I don't think we'll know to what extent the addresses (or names) being reported to the two agencies are just flat out different without trying to do the linkage. But @katie-lamb and @cmgosnell have gotten good at doing initial drafts of the record linkages.

@jrea-rmi
Copy link
Collaborator

yes, around 500 records to check per year

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inframundo new-data Requests for integration of new data. rmi
Projects
Archived in project
Development

No branches or pull requests

5 participants