-
Notifications
You must be signed in to change notification settings - Fork 332
SQL:Multi field mixed deduplication followed by numbering
The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.
RegisterId | Name | Phone | |
---|---|---|---|
XXX-00001 | John Strauss | 241567 | Null |
XXX-00023 | Rick Astley | 241567 | [email protected] |
XXX-00003 | John Strauss | NULL | NULL |
XXX-00099 | NULL | 241567 | [email protected] |
XXX-00085 | NULL | 256819 | [email protected] |
XXX-00016 | NULL | NULL | [email protected] |
XXX-00007 | John Deep | 280933 | NULL |
XXX-00008 | John Deep | 93484 | NULL |
XXX-00009 | Javier Estrada | 94578 | [email protected] |
XXX-00010 | NULL | 39939 | [email protected] |
Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.
RegisterId | Name | Phone | no | |
---|---|---|---|---|
XXX-00001 | John Strauss | 241567 | 1 | |
XXX-00023 | Rick Astley | 241567 | [email protected] | 1 |
XXX-00003 | John Strauss | 1 | ||
XXX-00099 | 241567 | [email protected] | 1 | |
XXX-00085 | 256819 | [email protected] | 1 | |
XXX-00016 | [email protected] | 1 | ||
XXX-00007 | John Deep | 280933 | 7 | |
XXX-00008 | John Deep | 93484 | 7 | |
XXX-00009 | Javier Estrada | 94578 | [email protected] | 9 |
XXX-00010 | 39939 | [email protected] | 10 |
SPL code
A | |
---|---|
1 | =mssql.query("select * from tb") |
2 | =T=A1.derive(#:no) |
3 | for T.count(T[1:].count( if(no!=T.no && ( (Name && T.Name && Name==T.Name) || (Phone && T.Phone && Phone==T.Phone) || (Email && T.Email && Email==T.Email)), T.no=no=min(no,T.no) ))>0)>0 |
4 | return T |
A1: Query the database through JDBC.
A2: Add a new number column 'no', which defaults to record number #.
A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.
The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.
We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.
Problem source: https://stackoverflow.com/questions/78410258/group-rows-which-share-some-value-in-three-columns-and-assign-them-a-same-id
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code