-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Screening Private Tags #112
Conversation
… and values from specified tags.
… and values from specified tags. Changes inspired by modifications within howardpchen's fork of pydicom/deid.
Added sections describing the usage of the new private tag screening functionality which was added to the replace_identifiers method.
What exactly do you mean by screening (for example how is it distinguished from the remove private flags option?) |
For our deidentification efforts, it is necessary for us to include private tags, yet still perform some level of deidentification on those included private tags. To accomplish this, the functionality was developed in this pull request to allow the caller to specify public tags that contain certain phi values and use the values from those tags to compare against the values from the private tags. In other words, if the PatientName tag contains "SIMPSON^HOMER", scan and remove private tags that contain the values of SIMPSON or HOMER. I have also added functionality to enable the scanning and removal of private tags given a specified regex pattern - the intent with this was to be able to find an remove private tags containing values that match patterns which would lead us to believe that the value was a patient identifier. |
Changed default on screen_values from [] to None.
The core of deid is having the ability to apply custom actions to particular header tags, and in this light, a private tag is no different than that, other than being private. I would be open to pursue an implementation that doesn't treat private tags as fundamentally different, as this leads to a lot of redundant code. Instead, the same functions / flow should be applied to the tags. Let me know if this makes sense. |
I think you would want to add the private tag keys to the list of fields, and then run the actions as previously done. Then the same actions would be applied to the tags as if they were any other field. The user could still blindly remove all of them, of course. I don't think the screen variables would need to be added, the default would be to not just screen header fields, but all fields (including private). |
I understand what you're saying, but I would also contend that private tags are fundamentally different. While public tags are named, standard elements, private tags are unnamed, vendor and device specific. If we were to add private tags to the list of fields and use recipe-defined actions to deidentify them, we would potentially need to have separate recipes for each individual make/model of devices used to acquire the images in an image set. |
Okay, so regardless of the recipe, then does this really come down to adding a custom set of additional fields to be parsed? (I am guessing you are running everything from within Python and not using a recipe?). The solution that we figure out should be able to support both use cases:
|
We're definitely using recipes. Recipe rules are still used for defining how to deidentify and replace values within the public tags. We just need a way to enable the caller to define public tags that contain values that we want to use to scan values of private tags to determine if they have PHI. Personally, I don't know how much benefit there would be to specifying specific private tag rules within a recipe. First, I think this would break one of the great things about the recipe file - it's human readable. Adding private tags into the recipe, you would be forced to specify rules by the tag number (0031, 1101). Second, the vendor/machine specific nature of these tags makes reuse almost impossible. I am currently using the new private tag screening logic as an addition to the baseline deidentification that is occurring as defined in the recipe. screen_values.append(ScreenValue(type='value', tag='AccessionNumber', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientID', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientBirthDate', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='OtherPatientIDs', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientTelephoneNumbers', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientAddress', split=True, separator=' ', minvaluelen=4, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientName', split=True, separator='^', minvaluelen=4, pattern=None))
screen_values.append(ScreenValue(type='pattern', tag=None, split=False, separator=None, minvaluelen=None, pattern=r'^(Q\d+)|(\d{7,})'))
screen_values.append(ScreenValue(type='pattern', tag=None, split=False, separator=None, minvaluelen=None, pattern=r'.*\^+.*'))
cleaned_header = replace_identifiers(dicom_files=currentfile,
ids=ids,
deid=deidrecipe,
overwrite=True,
remove_private=False,
output_folder=deiddir,
screen_private=True,
screen_values=screen_values) As an aside - I'm having trouble tracking down the formatting that the CI job is failing (this is my first real python project, so I'm far from an expert in python standards and best practices). |
Don't worry about formatting for now - it's likely just a mismatch for the version of black. I probably need to think this through - I have to be honest that the implementation is a bit messy - it should be taking advantage of functions that already exist, instead of adding redundancy, and a lot of nesting of for loops and if statements (it's hard to follow). Let me know if you have other ideas, I can keep thinking as well. |
I thought of another approach to this, and I’d like to get your perspective before proceeding down the path. I appreciate the time and thought that you've already put into this! Really, when looking at this problem there are two issues that need to be solved:
For determining which values to look for within the tags, I would add two new action types which could be used in recipe files, unlike the rest of the rules in the recipe, these action types would not act on specific tags in the header, but would collect data to be used at a later point to identify additional tags for removal.
Next, in replace_identifiers I would still need to rescan tags after recipe actions have been evaluated and processed. This would allow me to interrogate the values of each tag and determine if the value in the tag matches one of the patterns defined as VALUEPATTERN or REGEXPATTERN. For now, for our purpose, I would like to limit this to removing tags that match the patterns. While I could see this being potentially extended to replacements or jitters, I’m not sure – and haven’t thought about how replacement and jitter would play with the subcomponents of fields we’d be using to identify the tags on which to act. |
To step back a bit, the pattern of the fitlers is the following:
For example, here is how I jitter any header fields that ends with "Date" based on the variable "jitter."
And the reason actions go first is because you can read the recipe left to right - and it makes sense! I want to jitter stuff that ends with date using a variable "jitter." Or I want to remove tags that
So what seems logical to me is to not add something like
Or if you were looking for Simpson, you would do something like:
and note that it's not "contains" that is actually used, it's regular expressions (re) Line 132 in 61a3619
See https://pydicom.github.io/deid/user-docs/recipe-headers/ for more examples. If I'm not understanding you correctly, please provide a very simple case. E.g.:
because what you are describing above could be accomplished with the current actions and filters and custom function, unless I'm misunderstanding something. |
Late to the discussion.
I have tried using the current system to address private tags before and
found shortcomings - where public tags can be addressed via recipe with a
name (either directly named or with "endswith:Date" etc), private tags have
no generally agreed upon names, so can't be reliably addressed with a
typical recipe. Also, I had trouble finding a way to manipulate private
tags because much of the existing code addresses public tags only.
For example, if a private tag is (0011, 0010) and holds patient name per
vendor decision, but (0023,0040) contains useful information you want to
keep, how might we target (0011, 0010) with a recipe while keeping (0023,
0040)
The second point here is more technical... I think the recipes actually
ignore private tags in most cases. The deid.dicom.fields.get_fields and
deid.dicom.header.replace_identifiers functions relies on pydicom's
DataSet.dir() function, which returns only the list of public tags
(deid.dicom.header.get_identifiers() calls get_fields and has the same
problem). For this reason, AFAIK most functions of the package would touch
only public tags (except for remove_private_identifiers which calls
pydicom's DataSet.remove_private_tags() and works as expected)
…-hc
On Fri, Feb 28, 2020, 12:50 PM Vanessasaurus ***@***.***> wrote:
To step back a bit, the pattern of the fitlers is the following:
[ACTION] [SUBSET] [CUSTOMIZATION]
For example, here is how I jitter any header fields that ends with "Date"
based on the variable "jitter."
JITTER endswith:Date var:jitter`
And the reason actions go first is because you can read the recipe left to
right - and it makes sense! I want to jitter stuff that ends with date
using a variable "jitter." Or I want to remove tags that
start with Patient
REMOVE startswith:patient
So what seems logical to me is to not add something like VALUEPATTERN or
REGEXPATTERN- what kind of an action is that? But rather to use the same
action tags, assume that private tags are included too, and then do
something like:
REMOVE contains:Patient
Or if you were looking for Simpson, you would do something like:
REPLACE ALL func:look_for_simpson
and note that it's not "contains" that is actually used, it's regular
expressions (re)
https://github.com/pydicom/deid/blob/61a3619fcf8735f425acd6e0ddb3babc61d8dc94/deid/dicom/fields.py#L132
See https://pydicom.github.io/deid/user-docs/recipe-headers/ for more
examples.
If I'm not understanding you correctly, please provide a very simple case.
E.g.:
- I have a header with fields and values X:a, Y:b...
- I want to... etc.
because what you are describing above could be accomplished with the
current actions and filters and custom function, unless I'm
misunderstanding something.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#112?email_source=notifications&email_token=AB2KRVRQZZJUVRSB3ZZ2TCTRFFFETA5CNFSM4K5B4G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJRPRQ#issuecomment-592648134>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2KRVWVJ3SRSBA3P5TIQATRFFFETANCNFSM4K5B4G7Q>
.
|
@howardpchen I think we would address this by simply (given that private tags aren't all flagged to be removed) adding them to the list of fields to be parsed over (at the beginning of the function, instead of just taking dir(dicom), we would add the private tags to that list.
I think you'd want to use the |
Yes, but expand_field_expression in deid.fields, which captures the ALL
identifier in recipe, also uses DataSet.dir() and therefore processes only
public tags.
That's the whole problem :(
I spent a long time tracking down this problem last year...
See the implementation of dir function in pydicom DataSet
https://github.com/pydicom/pydicom/blob/master/pydicom/dataset.py
It identifies the tag names as a set and matches each to a dicom tag in the
file. This works when each DICOM tag has a unique name. This is true for
standard DICOM tags. However private tags don't just have weird names,
their names also don't have to be unique (I think most are just called
"Private Creator" - like if a DICOM file has 100 private tags each with
different values, all 100 can be called "Private Creator"). Basically to
manipulate priv tags we can't use names and have to go with the tag (i.e.
the group and numbers) directly.
I hope I'm explaining the situation better...
…On Fri, Feb 28, 2020 at 3:31 PM Vanessasaurus ***@***.***> wrote:
Also, I had trouble finding a way to manipulate private tags because much
of the existing code addresses public tags only.
@howardpchen <https://github.com/howardpchen> I think we would address
this by simply (given that private tags aren't all flagged to be removed)
adding them to the list of fields to be parsed over (at the beginning of
the function, instead of just taking dir(dicom), we would add the private
tags to that list.
private tags have no generally agreed upon names, so can't be reliably
addressed with a
typical recipe
I think you'd want to use the ALL identifier to check all (including the
weird names), and then have the logic for blanking in a custom function -
the function would replace whatever content is there based on what it finds.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#112?email_source=notifications&email_token=AB2KRVUJULFCI27IAF5ZFPLRFFYCHA5CNFSM4K5B4G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENKB2OQ#issuecomment-592715066>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2KRVQZX32YH4BDG72XQ2TRFFYCHANCNFSM4K5B4G7Q>
.
|
@howardpchen why the sad face? You are speaking about this code like it's not changeable. There is never a problem that can't be fixed, for both of those cases, we would have a field to return all fields (with include_private as True or False, possibly) and then have it include private tags. Yes, private tags might take a little more work to get the group and numbers to be used as keys, but there is no reason we can't do that. |
Yes. Love that spirit!
Glad we got that sorted out!
Proposal by @wetzelj may or may not fit the design philosophy of deid, but
there exists a valid need to be able to manipulate private tags in a way
that the current deid package can be improved upon.
…On Fri, Feb 28, 2020 at 4:07 PM Vanessasaurus ***@***.***> wrote:
@howardpchen <https://github.com/howardpchen> why the sad face? You are
speaking about this code like it's not changeable. There is never a problem
that can't be fixed, for both of those cases, we would have a field to
return all fields (with include_private as True or False, possibly) and
then have it include private tags.
Yes, private tags might take a little more work to get the group and
numbers to be used as keys, but there is no reason we can't do that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#112?email_source=notifications&email_token=AB2KRVQUEYHD6EJANV26HBDRFF4IHA5CNFSM4K5B4G72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENKFCYI#issuecomment-592728417>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2KRVSKAMPNNPMOX7K4O5LRFF4IHANCNFSM4K5B4G7Q>
.
|
Agreed! If either of you can provide a dicom dataset with private fields and an example of the functionality that you want, I can give this a shot. I don't work in this domain anymore, but I'm happy to try and help. |
While I can't provide it today, I'd be happy to offer up a sample dataset. A lot of the discussion today has been about accessing the private tags, but there's another important piece to this - dynamically determining which private tags need to be acted on. Rather than forcing the builder of the recipe to write a rule: I'd like to do something that allows the recipe builder to specify a known tag (PatientName) and use the value to determine which other tags need to be removed. While I know it's not an "action" - this was my mindset with the REGEXPATTERN and VALUEPATTERN proposal. Sure, this could be accomplished with a function, but I'm also trying to enable this type of functionality without the need for a custom function. pydicom/deid is serving as the core for an application which wraps the functionality to expose it to semi-technical end users. While these users can understand and will build recipe files, breaking into python code to build custom functions is beyond their expertise. The discussion and exchange of ideas on this has been awesome! Thank you both! |
Hmm, so if you want to achieve this:
Couldn't you just dicom.get("PatientName") and then use that in a REPLACE filter, with a custom function (as I've shown previously?). The pattern matching is exactly what contains:[pattern] does.
We could consider adding a filter ANY, which would be like all, but instead provide a function to filter (and return true or false)
Where in the above, we would derive the name and then use it in the function is_name. |
I've attached a dump of a sample dicom header which has private tags included, and contains PHI within those private tags (obviously for uploading purposes, this has all been thoroughly deidentified, but tags and content consistency have been maintained).
I'd like to make sure that I understand your suggestion of using a custom function:
Assuming we have the recipe entry - and have added functionality for "ANY":
The corresponding is_name function would look something like this: def is_name(self, dicom, field, value):
currentvalue = dicom.get(value)
name = dicom.get('PatientName')
splitvalues = name.split('^')
for phi in splitvalues:
if len(phi) > 4 and phi in currentvalue:
return True
return False With this approach, we'd need to define "is_*****" for every piece of phi that we'd potentially want to scan other tags for containing:
...or we could just group all of these together and have a single custom function for all phi components, which could get any finite list of phi components and scan for those values (is_phi). Is my understanding of your suggestion correct? If so, the the reason this breaks under our use case, is that no matter what we do with this type of a solution, the list of phi being interrogated must be defined within the custom function. We're trying to find a way to expose the list of elements to look for in tags without requiring our end users to write custom functions. |
You could write any custom function with multiple searches, or just keep them modular (it's up to you). There are a few bugs in your function - the value is already the currentvalue (it's passed to the function) and value is also the second argument. So you don't need to get it again. There is also no class here, so no need for self. It's simply a function that you add to the dictionary of some item you want cleaned / changed using it. def is_name(dicom, value, field):
name = dicom.get('PatientName')
splitvalues = name.split('^')
for phi in splitvalues:
if len(phi) > 4 and phi in value:
return True
return False Currently, the func works to return a value - so it fits to work with replace or add (so you could return an empty value, for example). To get this working with REMOVE, we would add the check for "func" here (it's currently part of the function parse_value which isn't used for remove). Then we would add the requirement that a function intended to be used with REMOVE must return True/False. Does that make sense? |
Here is a simple implementation of what I'm describing above - #113 along with documentation. What I haven't added to this yet is to include all private tags (this would need further testing). |
Sorry for the bugs... In addition to using javascript syntax for split, I reversed the parameter order in the definition. The self-reference is just a detail of my implementation/use of pydicom/deid. I've wrapped the custom functions in a class. # Incorrect definition:
# def is_name(self, dicom, field, value):
# Correct definition:
def is_name(self, dicom, value, field): Regarding your other comments:
In the utils\actions.py parse_value function the "func:is_name" string is split and Given this, we really do need to get the value from by calling dicom.get(). Regardless I think there's an underlying issue with taking this approach for the value scanning. Assuming that we're processing a recipe action like Obviously, the error situation above could be worked around by changing the is_name function to avoid removing the fields it is using as value sources, but doing something like below turns the ALL keyword into "ALL-except what I'm using in this function"... which would be far less than ideal. def is_name(dicom, value, field):
currentvalue = str(dicom.get(field))
if field != 'PatientName':
name = dicom.get('PatientName')
splitvalues = str(name).split('^')
for phi in splitvalues:
if len(phi) > 4 and phi in currentvalue:
return True
return False Ultimately, I really would love to get this to a point where these rules for searching and scanning could be defined within a recipe rather than custom functions. It would be great if we could get to a solution where functionality traditionally requiring a custom function could be defined in a recipe (and as a result, easily exposed to the deid client). I want to make sure that I am respectful of your time... at any point if you feel that it would be best for me to just proceed in my own fork, don't hesitate to say so. |
Yes you are correct- value is the original value (see here). Apologies for my oversight, I never actually used it, and after all these years I forgot. I'll update the examples accordingly.
This would be an issue with any custom function + REMOVE regardless. This could originally be controlled by way of the order of actions listed in the deid recipe, and to some extent this would still be possible if you put all the REMOVE at the end. And I hope you see that even for other actions that change fields, this is still an issue (albeit a different issue). What you really want to do is remove (or perform an action) based on using some unknown text from a field. What we really need is something like this:
And the
but that's a little ugly. Do you have a suggestion that doesn't require a func? And either way, to get around needing some cache of values, we'd need some other cache of fields or similar. |
So for this example:
This could fit into the current
And it reads nicely, I want to remove all the fields that match that regular expression. Func (as implemented now) would still work nicely. I still have a hard time wrapping my head around value pattern (at least reading it fresh every time) but if I understand correctly, why not create a different kind of setting that can extract and set some custom field, e.g.,
To say "Set field "patient_name" to be whatever we find in PatientName, and then remove that for all subsequent tags. The assumption with
Thoughts? |
And note that we are discussing design for the PR here #113. I've taken a look at the sample dicom you sent me with private tags, and this will be addressed as another PR following this one to keep them in scope. It's a bit more complex because we are trying to use two different kinds of things (a string that indicates a field vs. a tag element) to do the same thing. I need to look at newer versions of Pydicom and see if there is different representation of the tags. Likely we'll just need to do a check and then handle the setting of any updated value slightly differently. |
Yep, I do see that this is a problem across the board for removals and change fields. After I get the core functionality working as desired, my plan is to build a recipe builder/validator which would highlight recipe rules that are potentially in conflict (and potentially be in the library). I haven't thought through this other than at a high level, as an example, it would highlight the conflicts if a recipe contained duplicate (additive) JITTERS or duplicate/conflicting REMOVE or REPLACE values. Obviously I have to put some additional thought into this. Looking at your proposed solutions, the Looking at our current use case following the above patterns, we'd build the following recipe (ignoring the split point for the moment):
My concern with this approach really comes down to performance. In the above sample, we would be scanning ALL fields 9 times... and this is a real-world sample, this is the scanning rules that need to be applied for one of our projects. In this project, we have 10,000 CT studies to be deidentified (at around 300-500 images per study). At 300 images/study, that's 27,000,000 scans of all fields! :) As I was thinking about this last night, I did come up with a potential solution that you may find acceptable... or maybe a little crazy...
Entries within a scandefinition section would be of the format
Once these scandefinition entries could then be used within the standard
With this type of an approach, I would have to add functionality to parse and persist the scandefinition rules within the DeidRecipe object, but then in replace_identifiers, before processing actions, the scandefinition rules could be evaluated based on the image and processed into a single consolidated regex pattern ex: At this point, the standard recipe header action to If you think this is a reasonable approach, I'll get started on an implementation in my fork. Personally, I like this better than any of my prior suggestions. |
Thank you for sharing your detailed ideas, and I appreciate your enthusiasm to develop a solution! There are actually two separate things here that I think are being tangled into 1:
You are saying in your latest comment that you think the
To be honest, my first implementation of this library was heavily tied to another library for the school of medicine, so the implementation was done sort of intended for that. However I think the library has more use across different communities (and not as much the SOM) so I'd like to work on a larger refactor that will address point 2. I need to think more about how to go about this - I think we would want to parse an entire recipe first, build up a dependency graph of sorts (that also caches originally values that might be needed) and then do the replacements the most efficient way possible, possibly instead of parsing over every action and checking for every field, we first create an assignment of specific actions to specific fields, and then do a much more informed replacement. Figuring out how to account for order will be the hardest part. I'll do more thinking about this and get back to you, and please let me know if you have any thoughts about ways to go about this. |
I agree... when I came back into the office this morning and re-read, I didn't like it either. Really, I think I merged two purposes (scanning the tag values and identifying the targeted tag identifiers). For the sake of clarity, I'm going to try again with an example. Which I hope will also illustrate where the additional section could be beneficial. At the moment, I'm also specifically ignoring a potential wholesale refactor of the replacement. I need to do more studying of the current process to make informed comments... I'll continue to think on that point. Sample Header
For the sample project, the goal would be to remove patient identifiers, but there's also a need to replace any references to the scanner's vendor, make, and model. In the prior suggestion, I discussed a new section called Proposed Recipe (new sections)
The goal of these new sections would be to create lists of strings (processed tag values) that could be referenced by name in the header section. At this point, the tagvaluelist sections would simply be creating two lists of strings, but at runtime, these would be evaluated to image-specific lists of tags:
The existing header section pattern would remain unchanged ( Propsed Recipe (%header)
Option B:
Implementation/Processing Replacement Within replace identifiers, before processing actions, we would need to convert the tagvaluelists into specific tags to be acted on. This would need to occur before actions were applied to the dicom file, and given our sample would create the following tag lists using the tagvaluelists. The recipe option A and B that I listed above would just drive what we think is most clear in the recipe - would we want an implicit or explicit conversion from tagvaluelists to tag values. Regardless of the option chose, the tagvaluelists could be converted to actual tags:
At this point, standard the recipe actions could be performed in pretty much exactly the same way as a rule like The Results
|
After talking with some of the others on my team, I'm going to go ahead and start to work on the recipe changes that I described above while thinking about how to ultimately process the rules. This recipe pattern will work very nicely for our use cases. Should this pull request be closed? The only real value from it is the conversation... |
I think we are definitely going in a good direction! I work full time so I can't always respond immediately. I'd like to first move forward with adding func to |
See #115 linked above. I think we are going in a good direction - a few hairballs to sort through but we're making progress! Thanks for your patience in giving me a reasonable amount of time to respond. |
It was all cleanup of my fork. The changes in PR #112 were initially in my master branch. I stashed them in a separate branch for reference and then reverted my master to match pydicom/master. BTW... no worries on the response time, I've actually thought all of your responses were fast! |
haha, okay great! I can usually put in time to respond thoughtfully at least within 24 hours. I have a lot of projects I'm working on at the same time (this is my normal state, it's not stressful or bad, just my routine) so I typically cycle through for each working day, and I might not get to something immediately. I just wanted to let you know that I try to be responsive, and to typically expect about this amount of time. |
Description
This pull request serves two purposes:
Related issues: # (issue)
None
Checklist