Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database schema #6

Open
ChrisBeeley opened this issue Mar 1, 2022 · 16 comments
Open

Database schema #6

ChrisBeeley opened this issue Mar 1, 2022 · 16 comments

Comments

@ChrisBeeley
Copy link

When sharing analytic code it's important to establish the extent to which it is safe or desirable to share database schema (raw SQL statements, column names, DB and table names...)

I have been told that we cannot share material of this sort because a) sharing DB and table names poses a security risk b) database vendors hold copyright of schema and names, although these objections appear to be contradicted by other government guidance

I can't find this addressed in the policy or any of the linked documents. It would be very useful to have this explicitly addressed in this guidance

@otlah
Copy link
Contributor

otlah commented Mar 1, 2022

Absolutely, it's not addressed here (yet), and it would be useful to include that kind of detail. I fully agree with the CDDO guidance but I'll leave this issue open while we investigate the copyright concern further.

@connor1q
Copy link

connor1q commented Mar 8, 2022

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

@ChrisBeeley
Copy link
Author

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

Absolutely. A definitive statement on security seems relatively straightforward and I would hope that it is added to this guidance in due course.

As for the copyright issue, somebody needs to confront the vendors about this and get them to make a statement on it one way or the other. I tried and was met with a wall of silence. It's a simple question with a simple answer and the vendors need to play fair and answer it

@otlah
Copy link
Contributor

otlah commented Mar 9, 2022

Hi folks, if you'd like to work together to propose a security statement on this topic I can check it through and find the right space for it next week?

@connor1q
Copy link

connor1q commented Mar 9, 2022

Hi @otlah, @ChrisBeeley

I'll try to get hold of someone from security who can give a steer.

While table names should never be risky, I could imagine a situation where the CNAME address for a database might be risky. Security tends to take the onion philosophy - so the more protection the better. We just need to be able to identify a sensible balance of risk versus value in sharing code.

For that type of environment configuration though (database address), our open code guidance should explain how to avoid storing those things in open code. E.g. using a secrets manager or using environment variables. That type of thing would go along other code security practices like running githooks to avoid credentials being pushed to git and using .gitignore files.

The value of open code comes from seeing the business logic - not the database details. So it is no loss to mask that info.

@ChrisBeeley
Copy link
Author

Indeed, yes, I have no problem with running odbcConnect(Sys.getenv("DB_NAME")) or whatever it is it's column names that are problematic. Short of renaming the entire table they can pop up throughout the DB and analytic code. It's hard to believe that there would be any problems caused by my publishing time_from_referral = df$date_seen - df$date_referred (assuming they were real column names) but we have mountains of unpublished code that looks just like that

@connor1q
Copy link

connor1q commented Apr 7, 2022

Hi @ChrisBeeley @otlah
I've not been able to get hold of a security person for this unfortunately.

I have seen that the Goldacre report is out and sets out a very strong expectation that code should be open: https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis/better-broader-safer-using-health-data-for-research-and-analysis

@ChrisBeeley
Copy link
Author

But is itself frustratingly short on detail

@connor1q
Copy link

connor1q commented Apr 7, 2022

That's true - but probably expected for a report like this. The report does waggle its eyebrows suggestively at the right organisations to get involved in setting out the specifics.

E.g. governance people
image

and the ICO
image

Perhaps we could convene a wider group to focus on this draft policy and try to elaborate on some of the missing pieces?

@ChrisBeeley
Copy link
Author

Indeed, yes. Not a criticism of the report, which I think is great. Just saying we need a different approach here. A technocratic, this is the answer, tablets of stone, job done type thing.

100% up for convening a "detail" group, you know where to find me

@wbryant
Copy link

wbryant commented Apr 8, 2022

I'd also be more than happy to contribute to a group like this.

@otlah
Copy link
Contributor

otlah commented Apr 12, 2022

Hi folks, we've still got a couple of major groups to get feedback from (including in UK research funding), but once that feedback's in we're intending to gather a list of issues that have to be solved before we can publish a version 1.0, and a list that we can take longer over to unstick. I'm stoked to have a bunch of volunteers to help populate those lists!

@sebbacon
Copy link

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

@susheel
Copy link
Collaborator

susheel commented May 26, 2022

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

@ChrisBeeley
Copy link
Author

ChrisBeeley commented May 26, 2022

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

My Trust does use a TPP database, so that might be an angle for me to try @sebbacon . Don't suppose the permission is published anywhere, or generalisable, or anything I can use to try to push things on where I am?

@ChrisBeeley
Copy link
Author

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

This is amazing. I had no idea this existed. I need to go and read this carefully. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants