Skip to content

SQL:Multi field mixed deduplication followed by numbering

esProcSPL edited this page Nov 20, 2024 · 1 revision

The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.

RegisterId Name Phone Email
XXX-00001 John Strauss 241567 Null
XXX-00023 Rick Astley 241567 [email protected]
XXX-00003 John Strauss NULL NULL
XXX-00099 NULL 241567 [email protected]
XXX-00085 NULL 256819 [email protected]
XXX-00016 NULL NULL [email protected]
XXX-00007 John Deep 280933 NULL
XXX-00008 John Deep 93484 NULL
XXX-00009 Javier Estrada 94578 [email protected]
XXX-00010 NULL 39939 [email protected]

Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.

RegisterId Name Phone Email no
XXX-00001 John Strauss 241567 1
XXX-00023 Rick Astley 241567 [email protected] 1
XXX-00003 John Strauss 1
XXX-00099 241567 [email protected] 1
XXX-00085 256819 [email protected] 1
XXX-00016 [email protected] 1
XXX-00007 John Deep 280933 7
XXX-00008 John Deep 93484 7
XXX-00009 Javier Estrada 94578 [email protected] 9
XXX-00010 39939 [email protected] 10

SPL code

A
1 =mssql.query("select * from tb")
2 =T=A1.derive(#:no)
3 for T.count(T[1:].count( if(no!=T.no && ( (Name && T.Name && Name==T.Name) || (Phone && T.Phone && Phone==T.Phone) || (Email && T.Email && Email==T.Email)), T.no=no=min(no,T.no) ))>0)>0
4 return T

A1: Query the database through JDBC.

A2: Add a new number column 'no', which defaults to record number #.

A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.

The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.

We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.

Problem source: https://stackoverflow.com/questions/78410258/group-rows-which-share-some-value-in-three-columns-and-assign-them-a-same-id

Clone this wiki locally