Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Screening Private Tags #112

Closed
wants to merge 9 commits into from
Closed

Screening Private Tags #112

wants to merge 9 commits into from

Conversation

wetzelj
Copy link
Contributor

@wetzelj wetzelj commented Feb 27, 2020

Description

This pull request serves two purposes:

  1. The original code in header.py always removes all private tags, regardless of the flag. This change ensures that private tags are included when the remove_private option is false (this bug was initially identified by @howardpchen).
  2. Added functionality inspired by @howardpchen's fork to enable screening of private tag values based on rules passed in to the replace_identifiers method.

Related issues: # (issue)
None

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • My code follows the style guidelines of this project

… and values from specified tags. Changes inspired by modifications within howardpchen's fork of pydicom/deid.
Added sections describing the usage of the new private tag screening functionality which was added to the replace_identifiers method.
@vsoch
Copy link
Member

vsoch commented Feb 27, 2020

What exactly do you mean by screening (for example how is it distinguished from the remove private flags option?)

@wetzelj
Copy link
Contributor Author

wetzelj commented Feb 27, 2020

For our deidentification efforts, it is necessary for us to include private tags, yet still perform some level of deidentification on those included private tags. To accomplish this, the functionality was developed in this pull request to allow the caller to specify public tags that contain certain phi values and use the values from those tags to compare against the values from the private tags. In other words, if the PatientName tag contains "SIMPSON^HOMER", scan and remove private tags that contain the values of SIMPSON or HOMER. I have also added functionality to enable the scanning and removal of private tags given a specified regex pattern - the intent with this was to be able to find an remove private tags containing values that match patterns which would lead us to believe that the value was a patient identifier.

Changed default on screen_values  from [] to None.
@vsoch
Copy link
Member

vsoch commented Feb 27, 2020

The core of deid is having the ability to apply custom actions to particular header tags, and in this light, a private tag is no different than that, other than being private. I would be open to pursue an implementation that doesn't treat private tags as fundamentally different, as this leads to a lot of redundant code. Instead, the same functions / flow should be applied to the tags. Let me know if this makes sense.

@vsoch
Copy link
Member

vsoch commented Feb 27, 2020

I think you would want to add the private tag keys to the list of fields, and then run the actions as previously done. Then the same actions would be applied to the tags as if they were any other field. The user could still blindly remove all of them, of course. I don't think the screen variables would need to be added, the default would be to not just screen header fields, but all fields (including private).

@wetzelj
Copy link
Contributor Author

wetzelj commented Feb 27, 2020

I understand what you're saying, but I would also contend that private tags are fundamentally different. While public tags are named, standard elements, private tags are unnamed, vendor and device specific. If we were to add private tags to the list of fields and use recipe-defined actions to deidentify them, we would potentially need to have separate recipes for each individual make/model of devices used to acquire the images in an image set.

@vsoch
Copy link
Member

vsoch commented Feb 27, 2020

Okay, so regardless of the recipe, then does this really come down to adding a custom set of additional fields to be parsed? (I am guessing you are running everything from within Python and not using a recipe?). The solution that we figure out should be able to support both use cases:

  • someone writing a recipe and having private tags included in already existing filters
  • not using a recipe and using python, and providing those same extra fields to be parsed. The filters would still need to be provided - how are you currently handling this without using a recipe?

@wetzelj
Copy link
Contributor Author

wetzelj commented Feb 27, 2020

We're definitely using recipes. Recipe rules are still used for defining how to deidentify and replace values within the public tags.

We just need a way to enable the caller to define public tags that contain values that we want to use to scan values of private tags to determine if they have PHI.

Personally, I don't know how much benefit there would be to specifying specific private tag rules within a recipe. First, I think this would break one of the great things about the recipe file - it's human readable. Adding private tags into the recipe, you would be forced to specify rules by the tag number (0031, 1101). Second, the vendor/machine specific nature of these tags makes reuse almost impossible.

I am currently using the new private tag screening logic as an addition to the baseline deidentification that is occurring as defined in the recipe.

screen_values.append(ScreenValue(type='value', tag='AccessionNumber', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientID', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientBirthDate', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='OtherPatientIDs', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientTelephoneNumbers', split=False, separator=None, minvaluelen=0, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientAddress', split=True, separator=' ', minvaluelen=4, pattern=None))
screen_values.append(ScreenValue(type='value', tag='PatientName', split=True, separator='^', minvaluelen=4, pattern=None))
screen_values.append(ScreenValue(type='pattern', tag=None, split=False, separator=None, minvaluelen=None, pattern=r'^(Q\d+)|(\d{7,})'))
screen_values.append(ScreenValue(type='pattern', tag=None, split=False, separator=None, minvaluelen=None, pattern=r'.*\^+.*'))

cleaned_header = replace_identifiers(dicom_files=currentfile,
   ids=ids,
   deid=deidrecipe,
   overwrite=True,
   remove_private=False,
   output_folder=deiddir,
   screen_private=True,
   screen_values=screen_values)

As an aside - I'm having trouble tracking down the formatting that the CI job is failing (this is my first real python project, so I'm far from an expert in python standards and best practices).

@vsoch
Copy link
Member

vsoch commented Feb 27, 2020

Don't worry about formatting for now - it's likely just a mismatch for the version of black.

I probably need to think this through - I have to be honest that the implementation is a bit messy - it should be taking advantage of functions that already exist, instead of adding redundancy, and a lot of nesting of for loops and if statements (it's hard to follow). Let me know if you have other ideas, I can keep thinking as well.

@wetzelj
Copy link
Contributor Author

wetzelj commented Feb 28, 2020

I thought of another approach to this, and I’d like to get your perspective before proceeding down the path. I appreciate the time and thought that you've already put into this!

Really, when looking at this problem there are two issues that need to be solved:

  1. Determining the values to look for within the tags.
  2. Identifying the tags to be removed based upon what’s currently in the value
    a. Really, this could apply to both private tags and public tags. As an example, based on item 1, we’ve identified that the word SIMPSON is PHI. Standard recipe rules try to remove all PHI based on tag names (PatientName). But now, we want to scan all tag values to ensure we’re capturing an inappropriate patient name that was placed in the “Window Center & Width Explanation” tag.

For determining which values to look for within the tags, I would add two new action types which could be used in recipe files, unlike the rest of the rules in the recipe, these action types would not act on specific tags in the header, but would collect data to be used at a later point to identify additional tags for removal.

  • VALUEPATTERN – This would indicate that the value in the tag should be added to a dictionary of PHI values. Optionally, a function could be included on this action type. If a function was not specified, the entire string within the tag value would be added to the dictionary of PHI. If the function was specified, it would be used to process the value from the tag and add a list of values to the PHI dictionary.
    Examples:
VALUEPATTERN PatientName func:value_handler
VALUEPATTERN PatientID
VALUEPATTERN OtherPatientIDs
  • REGEXPATTERN – This would simply add a regex pattern to a list of regex patterns which should be used to identify PHI in values. This would not be based on any values/tags in the dicom header, but would simply be a pattern to match against the tag values.
    Examples:
REGEXPATTERN '.*\^+.*'
REGEXPATTERN '^(M\d+)|(\d{7,})'

Next, in replace_identifiers I would still need to rescan tags after recipe actions have been evaluated and processed. This would allow me to interrogate the values of each tag and determine if the value in the tag matches one of the patterns defined as VALUEPATTERN or REGEXPATTERN. For now, for our purpose, I would like to limit this to removing tags that match the patterns. While I could see this being potentially extended to replacements or jitters, I’m not sure – and haven’t thought about how replacement and jitter would play with the subcomponents of fields we’d be using to identify the tags on which to act.

@vsoch
Copy link
Member

vsoch commented Feb 28, 2020

To step back a bit, the pattern of the fitlers is the following:

[ACTION] [SUBSET] [CUSTOMIZATION]

For example, here is how I jitter any header fields that ends with "Date" based on the variable "jitter."

JITTER endswith:Date var:jitter`

And the reason actions go first is because you can read the recipe left to right - and it makes sense! I want to jitter stuff that ends with date using a variable "jitter." Or I want to remove tags that
start with Patient

REMOVE startswith:patient

So what seems logical to me is to not add something like VALUEPATTERN or REGEXPATTERN- what kind of an action is that? But rather to use the same action tags, assume that private tags are included too, and then do something like:

REMOVE contains:Patient

Or if you were looking for Simpson, you would do something like:

REPLACE ALL func:look_for_simpson

and note that it's not "contains" that is actually used, it's regular expressions (re)

elif expander.lower() == "contains":

See https://pydicom.github.io/deid/user-docs/recipe-headers/ for more examples.

If I'm not understanding you correctly, please provide a very simple case. E.g.:

  • I have a header with fields and values X:a, Y:b...
  • I want to... etc.

because what you are describing above could be accomplished with the current actions and filters and custom function, unless I'm misunderstanding something.

@howardpchen
Copy link

howardpchen commented Feb 28, 2020 via email

@vsoch
Copy link
Member

vsoch commented Feb 28, 2020

Also, I had trouble finding a way to manipulate private tags because much of the existing code addresses public tags only.

@howardpchen I think we would address this by simply (given that private tags aren't all flagged to be removed) adding them to the list of fields to be parsed over (at the beginning of the function, instead of just taking dir(dicom), we would add the private tags to that list.

private tags have no generally agreed upon names, so can't be reliably addressed with a
typical recipe

I think you'd want to use the ALL identifier to check all (including the weird names), and then have the logic for blanking in a custom function - the function would replace whatever content is there based on what it finds. Also the "contains" tag would accept a regular expression, which is possible to do even if you don't have an agreed upon name (but instead a pattern).

@howardpchen
Copy link

howardpchen commented Feb 28, 2020 via email

@vsoch
Copy link
Member

vsoch commented Feb 28, 2020

@howardpchen why the sad face? You are speaking about this code like it's not changeable. There is never a problem that can't be fixed, for both of those cases, we would have a field to return all fields (with include_private as True or False, possibly) and then have it include private tags.

Yes, private tags might take a little more work to get the group and numbers to be used as keys, but there is no reason we can't do that.

@howardpchen
Copy link

howardpchen commented Feb 28, 2020 via email

@vsoch
Copy link
Member

vsoch commented Feb 28, 2020

Agreed! If either of you can provide a dicom dataset with private fields and an example of the functionality that you want, I can give this a shot. I don't work in this domain anymore, but I'm happy to try and help.

@wetzelj
Copy link
Contributor Author

wetzelj commented Feb 28, 2020

While I can't provide it today, I'd be happy to offer up a sample dataset.

A lot of the discussion today has been about accessing the private tags, but there's another important piece to this - dynamically determining which private tags need to be acted on. Rather than forcing the builder of the recipe to write a rule:
REMOVE (0013, 0019) - because it contains the patient's name,

I'd like to do something that allows the recipe builder to specify a known tag (PatientName) and use the value to determine which other tags need to be removed. While I know it's not an "action" - this was my mindset with the REGEXPATTERN and VALUEPATTERN proposal.

Sure, this could be accomplished with a function, but I'm also trying to enable this type of functionality without the need for a custom function. pydicom/deid is serving as the core for an application which wraps the functionality to expose it to semi-technical end users. While these users can understand and will build recipe files, breaking into python code to build custom functions is beyond their expertise.

The discussion and exchange of ideas on this has been awesome! Thank you both!

@vsoch
Copy link
Member

vsoch commented Feb 28, 2020

Hmm, so if you want to achieve this:

I'd like to do something that allows the recipe builder to specify a known tag (PatientName) and use the value to determine which other tags need to be removed.

Couldn't you just dicom.get("PatientName") and then use that in a REPLACE filter, with a custom function (as I've shown previously?). The pattern matching is exactly what contains:[pattern] does.
To address not knowing these:

REMOVE (0013, 0019) - because it contains the patient's name,

We could consider adding a filter ANY, which would be like all, but instead provide a function to filter (and return true or false)

REMOVE ANY func:is_name

Where in the above, we would derive the name and then use it in the function is_name.

@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 2, 2020

I've attached a dump of a sample dicom header which has private tags included, and contains PHI within those private tags (obviously for uploading purposes, this has all been thoroughly deidentified, but tags and content consistency have been maintained).
Please note the following:

  • Private tags are listed with "names" of PrivateCreator1xxx or Unknown Tag.
  • (0031,1020) contains a master patient index identifier.
  • (0033,1013) contains the patient's name
  • (0031,1101) contains a visit number. This isn't available anywhere else within the header (public or private tags). The only way this would be identified is by regex screening values.

DICOMHeaderDump.txt

I'd like to make sure that I understand your suggestion of using a custom function:

Couldn't you just dicom.get("PatientName") and then use that in a REPLACE filter, with a custom function (as I've shown previously?).

Assuming we have the recipe entry - and have added functionality for "ANY":

	REMOVE ANY func:is_name

The corresponding is_name function would look something like this:

    def is_name(self, dicom, field, value):

        currentvalue = dicom.get(value)

        name = dicom.get('PatientName')
        splitvalues = name.split('^')
        for phi in splitvalues:
            if len(phi) > 4 and phi in currentvalue:
                return True 

        return False

With this approach, we'd need to define "is_*****" for every piece of phi that we'd potentially want to scan other tags for containing:

  • is_address
  • is_mrn
  • is_telephonenumber

...or we could just group all of these together and have a single custom function for all phi components, which could get any finite list of phi components and scan for those values (is_phi).

Is my understanding of your suggestion correct?

If so, the the reason this breaks under our use case, is that no matter what we do with this type of a solution, the list of phi being interrogated must be defined within the custom function. We're trying to find a way to expose the list of elements to look for in tags without requiring our end users to write custom functions.

@vsoch
Copy link
Member

vsoch commented Mar 2, 2020

You could write any custom function with multiple searches, or just keep them modular (it's up to you). There are a few bugs in your function - the value is already the currentvalue (it's passed to the function) and value is also the second argument. So you don't need to get it again. There is also no class here, so no need for self. It's simply a function that you add to the dictionary of some item you want cleaned / changed using it.

    def is_name(dicom, value, field):
        name = dicom.get('PatientName')
        splitvalues = name.split('^')
        for phi in splitvalues:
            if len(phi) > 4 and phi in value:
                return True 
        return False

Currently, the func works to return a value - so it fits to work with replace or add (so you could return an empty value, for example). To get this working with REMOVE, we would add the check for "func" here (it's currently part of the function parse_value which isn't used for remove). Then we would add the requirement that a function intended to be used with REMOVE must return True/False. Does that make sense?

@vsoch
Copy link
Member

vsoch commented Mar 2, 2020

Here is a simple implementation of what I'm describing above - #113 along with documentation. What I haven't added to this yet is to include all private tags (this would need further testing).

@vsoch
Copy link
Member

vsoch commented Mar 3, 2020

Also @wetzelj to test adding private tags, It would be very helpful to have the actual dicom file (and not just the header dump). If you can ping me on Twitter (@vsoch) I can share my email to send to.

@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 3, 2020

Sorry for the bugs... In addition to using javascript syntax for split, I reversed the parameter order in the definition. The self-reference is just a detail of my implementation/use of pydicom/deid. I've wrapped the custom functions in a class.

# Incorrect definition:
# def is_name(self, dicom, field, value):

# Correct definition:
def is_name(self, dicom, value, field):

Regarding your other comments:

...the value is already the currentvalue (it's passed into the function)

In the utils\actions.py parse_value function the "func:is_name" string is split and is_name(item, value, field) is only executed on the return. At this point, when parse_value calls is_name in the return statement, the value parameter is still "func:is_name".

Given this, we really do need to get the value from by calling dicom.get().

Regardless I think there's an underlying issue with taking this approach for the value scanning. Assuming that we're processing a recipe action like REMOVE ALL func:is_phi, and is_phi uses several dicom.get() calls to acquire PHI related fields, with the current processing flow, it will execute indepently on each field in the dicom header (deid\dicom\actions.py). The problem with this, is that as we're executing this loop, we're removing the values from dicom variable (populated from the original read_file() in header.py) and each time we do a removal, we're potentially removing a header tag that has a value that we're using in the custom function to determine what to look for in other tag values. This is one of the reasons why my first approach (and subsequent REGEXPATTERN and VALUEPATTERN action suggestions) took a two phase approach - first get the tag values we want to look for, second look for those values across the tags.

Obviously, the error situation above could be worked around by changing the is_name function to avoid removing the fields it is using as value sources, but doing something like below turns the ALL keyword into "ALL-except what I'm using in this function"... which would be far less than ideal.

def is_name(dicom, value, field):
        currentvalue = str(dicom.get(field))

        if field != 'PatientName':
            name = dicom.get('PatientName')
            splitvalues = str(name).split('^')
            for phi in splitvalues:
                if len(phi) > 4 and phi in currentvalue:
                    return True 

    return False

Ultimately, I really would love to get this to a point where these rules for searching and scanning could be defined within a recipe rather than custom functions. It would be great if we could get to a solution where functionality traditionally requiring a custom function could be defined in a recipe (and as a result, easily exposed to the deid client).

I want to make sure that I am respectful of your time... at any point if you feel that it would be best for me to just proceed in my own fork, don't hesitate to say so.

@vsoch
Copy link
Member

vsoch commented Mar 3, 2020

Yes you are correct- value is the original value (see here). Apologies for my oversight, I never actually used it, and after all these years I forgot. I'll update the examples accordingly.

The problem with this, is that as we're executing this loop, we're removing the values from dicom variable (populated from the original read_file() in header.py) and each time we do a removal, we're potentially removing a header tag that has a value that we're using in the custom function to determine what to look for in other tag values.

This would be an issue with any custom function + REMOVE regardless. This could originally be controlled by way of the order of actions listed in the deid recipe, and to some extent this would still be possible if you put all the REMOVE at the end. And I hope you see that even for other actions that change fields, this is still an issue (albeit a different issue).

What you really want to do is remove (or perform an action) based on using some unknown text from a field. What we really need is something like this:

REMOVE ALL equals-field:PatientID

And the equals-field parameter would then do a dicom.get() of the field, and remove whatever is found there if it matches a regex that includes the value of PatientID. What wouldn't be done is the split. We'd need something like:

REMOVE ALL equals-field:PatientID|split " "

but that's a little ugly. Do you have a suggestion that doesn't require a func? And either way, to get around needing some cache of values, we'd need some other cache of fields or similar.

@vsoch
Copy link
Member

vsoch commented Mar 3, 2020

So for this example:

REGEXPATTERN '^(M\d+)|(\d{7,})

This could fit into the current [ACTION] [FIELD] [VALUE] structure as the following:

REMOVE ALL re:^(M\d+)|(\d{7,})

And it reads nicely, I want to remove all the fields that match that regular expression. Func (as implemented now) would still work nicely. I still have a hard time wrapping my head around value pattern (at least reading it fresh every time) but if I understand correctly, why not create a different kind of setting that can extract and set some custom field, e.g.,

SET patient_name PatientName
REMOVE ALL var:patient_name

To say "Set field "patient_name" to be whatever we find in PatientName, and then remove that for all subsequent tags. The assumption with SET is that it wouldn't persist with the dicom (unlike ADD) so you wouldn't need to do this:

SET patient_name PatientName
REMOVE ALL var:patient_name
REMOVE patient_name

Thoughts?

@vsoch
Copy link
Member

vsoch commented Mar 3, 2020

And note that we are discussing design for the PR here #113. I've taken a look at the sample dicom you sent me with private tags, and this will be addressed as another PR following this one to keep them in scope. It's a bit more complex because we are trying to use two different kinds of things (a string that indicates a field vs. a tag element) to do the same thing. I need to look at newer versions of Pydicom and see if there is different representation of the tags. Likely we'll just need to do a check and then handle the setting of any updated value slightly differently.

@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 4, 2020

This would be an issue with any custom function + REMOVE regardless. This could originally be controlled by way of the order of actions listed in the deid recipe, and to some extent this would still be possible if you put all the REMOVE at the end. And I hope you see that even for other actions that change fields, this is still an issue (albeit a different issue).

Yep, I do see that this is a problem across the board for removals and change fields. After I get the core functionality working as desired, my plan is to build a recipe builder/validator which would highlight recipe rules that are potentially in conflict (and potentially be in the library). I haven't thought through this other than at a high level, as an example, it would highlight the conflicts if a recipe contained duplicate (additive) JITTERS or duplicate/conflicting REMOVE or REPLACE values. Obviously I have to put some additional thought into this.

Looking at your proposed solutions, the REMOVE ALL re:^(M\d+)|(\d{7,}) combined with the SET patient_name PatientName could meet the need, although I would have think more through how this would work with the non-discrete way that fields can be stored within DICOM (example: Splitting SIMPSON^HOMER by the ^ to search for HOMER or SIMPSON).

Looking at our current use case following the above patterns, we'd build the following recipe (ignoring the split point for the moment):

SET accession AccessionNumber
SET patient_id PatientID
SET patient_dob PatientBirthDate
SET other_pat_id OtherPatientIDs
SET patient_tel PatientTelephoneNumbers
SET patient_addr PatientAddress
SET patient_name PatientName

REMOVE ALL var:accession
REMOVE ALL var:patient_id
REMOVE ALL var:patient_dob
REMOVE ALL var:other_pat_id
REMOVE ALL var:patient_tel
REMOVE ALL var:patient_addr
REMOVE ALL var:patient_name
REMOVE ALL re:^(M\d+)|(\d{7,})
REMOVE ALL re:.*\^+.*

My concern with this approach really comes down to performance. In the above sample, we would be scanning ALL fields 9 times... and this is a real-world sample, this is the scanning rules that need to be applied for one of our projects. In this project, we have 10,000 CT studies to be deidentified (at around 300-500 images per study). At 300 images/study, that's 27,000,000 scans of all fields! :)

As I was thinking about this last night, I did come up with a potential solution that you may find acceptable... or maybe a little crazy...
What do you think about introducing a new section into the recipe file? For the sake of this conversation, let's call it %scandefinition (I'm not thrilled with this name)

%scandefinition entries would be similar to filters, in that they could be named, but inside the section, they would contain groupings of rules. For example the definition below would define a scandefinition called patient_identifiers, which would add splits of the included values as well as the appropriate regex patterns. We could potentially add additional scandefinitions too (think: %scandefinition vendor_identifiers).

Entries within a scandefinition section would be of the format [TYPE] [FIELD] [OPTIONS]

%scandefinition patient_identifiers

FIELD AccessionNumber 
FIELD PatientID
FIELD PatientBirthDate
FIELD OtherPatientIDs
FIELD PatientTelephoneNumbers
SPLIT PatientAddress splitval=' ';minlength='4'
SPLIT PatientName splitval='^';minlength='4'
REGEX ^(M\d+)|(\d{7,})
REGEX .*\^+.*

# FIELD could even potentially be expanded to allow for custom funcs.
FIELD SomeFieldName func:process_field

Once these scandefinition entries could then be used within the standard %headers section:

REMOVE ALL scandefinition:patient_identifiers

With this type of an approach, I would have to add functionality to parse and persist the scandefinition rules within the DeidRecipe object, but then in replace_identifiers, before processing actions, the scandefinition rules could be evaluated based on the image and processed into a single consolidated regex pattern ex:(123456789|HOMER|SIMPSON|(M\d+)|(\d{7,})). I could see us using this scandefinition with any of the standard DEID actions [ADD, REPLACE, etc].

At this point, the standard recipe header action to REMOVE ALL scandefinition:patient_identifiers would initiate an all fields scan, and for each field, could compare all patterns defined by the patient_identifiers scandefinition at once.

If you think this is a reasonable approach, I'll get started on an implementation in my fork. Personally, I like this better than any of my prior suggestions.

@vsoch
Copy link
Member

vsoch commented Mar 4, 2020

Thank you for sharing your detailed ideas, and I appreciate your enthusiasm to develop a solution! There are actually two separate things here that I think are being tangled into 1:

  • the recipe format, which should be clean and easy to read / understand
  • the implementation, or how the recipe is parsed and then rules applied to the data

You are saying in your latest comment that you think the REMOVE ALL re:<pattern> reads well, but it's problematic because of the implementation, and then (since the recipe drives the header actions) you are proposing a different element for the recipe to handle this. I do think this is a cool approach, but I think the way you've described this new section is confusing - a scan definition is hard to wrap my head around. I think really what we want to do is deal with these things separately.

  1. decide on the syntax for the recipe that is cleanest, and most intuitive (probably not adding extra entire sections with new keys and values)
  2. Separate from the first, fix up the implementation of the replacement so that it is performant.

To be honest, my first implementation of this library was heavily tied to another library for the school of medicine, so the implementation was done sort of intended for that. However I think the library has more use across different communities (and not as much the SOM) so I'd like to work on a larger refactor that will address point 2.

I need to think more about how to go about this - I think we would want to parse an entire recipe first, build up a dependency graph of sorts (that also caches originally values that might be needed) and then do the replacements the most efficient way possible, possibly instead of parsing over every action and checking for every field, we first create an assignment of specific actions to specific fields, and then do a much more informed replacement. Figuring out how to account for order will be the hardest part. I'll do more thinking about this and get back to you, and please let me know if you have any thoughts about ways to go about this.

@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 5, 2020

I think the way you've described this new section is confusing

I agree... when I came back into the office this morning and re-read, I didn't like it either. Really, I think I merged two purposes (scanning the tag values and identifying the targeted tag identifiers). For the sake of clarity, I'm going to try again with an example. Which I hope will also illustrate where the additional section could be beneficial. At the moment, I'm also specifically ignoring a potential wholesale refactor of the replacement. I need to do more studying of the current process to make informed comments... I'll continue to think on that point.

Sample Header

(0008,0050) : SH   Len: 10     AccessionNumber                Value: [999999999 ]
(0008,0070) : LO   Len: 8      Manufacturer                   Value: [SIEMENS ]
(0008,1090) : LO   Len: 22     ManufacturersModelName         Value: [SOMATOM Definition AS+]
(0009,0010) : LO   Len: 20     PrivateCreator10xx             Value: [SIEMENS CT VA1 DUMMY]
(0010,0010) : PN   Len: 14     PatientsName                   Value: [SIMPSON^HOMER^J^]
(0010,0020) : LO   Len: 12     PatientID                      Value: [000991991991 ]
(0010,1000) : LO   Len: 8      OtherPatientIDs                Value: [E123456]
(0010,1001) : PN   Len: 8      OtherPatientNames              Value: [E123456]
(0010,21B0) : LT   Len: 90     AdditionalPatientHistory       Value: [MR SIMPSON LIKES DUFF BEER]
(0019,1091) : DS   Len: 6      <Unknown Tag>                  Value: [E123456]
(0019,1092) : DS   Len: 6      <Unknown Tag>                  Value: [M123456]

For the sample project, the goal would be to remove patient identifiers, but there's also a need to replace any references to the scanner's vendor, make, and model. In the prior suggestion, I discussed a new section called %scandefinition. I think a more appropriate name for this section would be %tagvaluelist.

Proposed Recipe (new sections)

%tagvaluelist patient_info
SPLIT PatientsName splitval='^';minlength='4'
FIELD PatientID
FIELD OtherPatientIDs
REGEX (M\d+)

%tagvaluelist scanner_info
FIELD Manufacturer
SPLIT ManufacturersModelName splitval=' ',minlength='4'

The goal of these new sections would be to create lists of strings (processed tag values) that could be referenced by name in the header section. At this point, the tagvaluelist sections would simply be creating two lists of strings, but at runtime, these would be evaluated to image-specific lists of tags:

patient_info					scanner_info
-------------					--------------
HOMER						SIEMENS
SIMPSON						SOMATOM
000991991991					Definition
E123456
(M\d+)

The existing header section pattern would remain unchanged ([ACTION] [FIELD] [VALUE]), except now, whereas the FIELD must either be a header named field or the keyword ALL, tagvaluelist names could be specified in actions.

Propsed Recipe (%header)
I'm listing two options here... these are referenced below in the implementation/processing replacement section.
Option A:

%header
REMOVE patient_info
REPLACE scanner_info var:myvar

Option B:

%header
REMOVE func:getTags(patient_info)
REPLACE func:getTags(scanner_info) var:myvar

Implementation/Processing Replacement
(again, pushing off a refactor for performance for the time being)

Within replace identifiers, before processing actions, we would need to convert the tagvaluelists into specific tags to be acted on. This would need to occur before actions were applied to the dicom file, and given our sample would create the following tag lists using the tagvaluelists. The recipe option A and B that I listed above would just drive what we think is most clear in the recipe - would we want an implicit or explicit conversion from tagvaluelists to tag values. Regardless of the option chose, the tagvaluelists could be converted to actual tags:

patient_info tags
------------
(0010,0010) : PN   Len: 14     PatientsName
(0010,0020) : LO   Len: 12     PatientID
(0010,1000) : LO   Len: 8      OtherPatientIDs
(0010,1001) : PN   Len: 8      OtherPatientNames 
(0010,21B0) : LT   Len: 90     AdditionalPatientHistory 
(0019,1091) : DS   Len: 6      <Unknown Tag>
(0019,1092) : DS   Len: 6      <Unknown Tag>

scanner_info tags
------------
(0008,0070) : LO   Len: 8      Manufacturer 
(0008,1090) : LO   Len: 22     ManufacturersModelName 
(0009,0010) : LO   Len: 20     PrivateCreator10xx

At this point, standard the recipe actions could be performed in pretty much exactly the same way as a rule like REMOVE ALL or REPLACE ALL var:myvar works today. The only difference is that the ALL keyword would be a different value tying to the tagvaluelist which evaluates to a subset of the tags.

The Results
Given these adjusments, the final deidentified header would look like this:

(0008,0050) : SH   Len: 10     AccessionNumber                Value: [999999999 ]
(0008,0070) : LO   Len: 8      Manufacturer                   Value: [Replacement Variable Value]
(0008,1090) : LO   Len: 22     ManufacturersModelName         Value: [Replacement Variable Value]
(0009,0010) : LO   Len: 20     PrivateCreator10xx             Value: [Replacement Variable Value]

wetzelj added 3 commits March 5, 2020 15:05
…patterns and values from specified tags. Changes inspired by modifications within howardpchen's fork of pydicom/deid."

This reverts commit f839e2f.
@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 5, 2020

After talking with some of the others on my team, I'm going to go ahead and start to work on the recipe changes that I described above while thinking about how to ultimately process the rules. This recipe pattern will work very nicely for our use cases.

Should this pull request be closed? The only real value from it is the conversation...

@vsoch
Copy link
Member

vsoch commented Mar 5, 2020

I think we are definitely going in a good direction! I work full time so I can't always respond immediately. I'd like to first move forward with adding func to REMOVE to finish up that work, if you can review it #113. Once that is done, I would like to first solidify and verify the changes that are to be done before you do more work, and very likely I'd like to take charge of the work. I'll open up an issue to continue our conversation. I'm not sure what happened to the PR here, it says it's empty.

@vsoch vsoch closed this Mar 5, 2020
@vsoch
Copy link
Member

vsoch commented Mar 5, 2020

See #115 linked above. I think we are going in a good direction - a few hairballs to sort through but we're making progress! Thanks for your patience in giving me a reasonable amount of time to respond.

@wetzelj
Copy link
Contributor Author

wetzelj commented Mar 5, 2020

I'm not sure what happened to the PR here, it says it's empty.

It was all cleanup of my fork. The changes in PR #112 were initially in my master branch. I stashed them in a separate branch for reference and then reverted my master to match pydicom/master.

BTW... no worries on the response time, I've actually thought all of your responses were fast!

@vsoch
Copy link
Member

vsoch commented Mar 5, 2020

haha, okay great! I can usually put in time to respond thoughtfully at least within 24 hours. I have a lot of projects I'm working on at the same time (this is my normal state, it's not stressful or bad, just my routine) so I typically cycle through for each working day, and I might not get to something immediately. I just wanted to let you know that I try to be responsive, and to typically expect about this amount of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants