cancel
Showing results for 
Search instead for 
Did you mean: 

Finding duplicate documents and selecting their document handle

Steven_Zeltner4
Confirmed Champ
Confirmed Champ

I'm looking to get a list of document handles that have a duplicate.  A duplicate is defined by a 5 keyword match.  The approach I'm taking is something like:

SELECT id.itemnum,COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
--HAVING COUNT(*) > 1

That's a rough outline of a (broken) query.  I realize that this query isn't going to compile because I don't have id.itemnum in the GROUP BY clause but I stopped here to ask the Community about whether or not the approach is sound.

Is using a Group By clause wise here?  How would you find a document where 5 keywords matched?  It doesn't make sense to Group By itemnum but that is what I'm trying to get to (eventually).

I'm having some luck with this since I don't really care what doc handle is kept:

SELECT MAX(id.itemnum), COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
HAVING COUNT(*) > 1 

Any other thoughts out there for a better way?  I'd imagine this will be pretty resource intensive.

 

Thank you.

6 REPLIES 6

Thomas_Reu
Elite Collaborator
Elite Collaborator

The query isn't going to work because of the select and the group by.  Also, keep in mind that itemnum is always going to be unique, even if you have dups across your 4 keywords.  So, you are going to need 1 query to find dups that must exclude the itmenum.  Then you'll have to use the data from that dup query to join to another query to get the itemnums.  For example, see below.  Also, I'm only using a 2 part multi key for brevity and readability.  You could also break the 2 subqueries into views if you have a hard time understanding this.  Last, please make sure you do this in your test environment first; until you get a good handle on this...

Select itemnum

from

(select k154.keyvalue as id1, k155.keyvalue as id2, i.*

from hsi.itemdata as i

inner join hsi.keyitem154 as k154

on i.itemnum = k154. itemnum

inner join hsi.keyitem155 as k155

on i.itemnum = k155.itemnum) as v_all

Inner join

(select k154.keyvalue as id1, k155.keyvalue as id2, count(*) as dup_cnt

from hsi.itemdata as i

inner join hsi.keyitem154 as k154

on i.itemnum = k154. itemnum

inner join hsi.keyitem155 as k155

on i.itemnum = k155.itemnum

group by k154.keyvalue, k155.keyvalue

having count(*) > 1) as dups

on v_all.id1 = dups.id1

and v_all.id2 = dups.id2

Brad_Phillips
Confirmed Champ
Confirmed Champ

Steven,

 

I loved the question, and was looking for a way to do this myself with only one pass.  Here is what I came up with:

 

with c as

(

select id.itemnum as 'itemnum',

ki156.keyvaluechar as 'ki156',

ki167.keyvaluechar as 'ki167',

ki177.keyvaluechar as 'ki177',

lag(id.itemnum)   over (partition by ki156.keyvaluechar, ki167.keyvaluechar, ki177.keyvaluechar order by id.itemnum) as 'FirstVal',

lead(id.itemnum)  over (partition by ki156.keyvaluechar, ki167.keyvaluechar, ki177.keyvaluechar order by id.itemnum) as 'LastVal'

from hsi.itemdata (nolock) ID

inner join hsi.keyitem156 (nolock) ki156

on ID.itemnum = ki156.itemnum

inner join hsi.keyitem167 (nolock) ki167

on ID.itemnum = ki167.itemnum

inner join hsi.keyitem177 (nolock) ki177

on ID.itemnum = ki177.itemnum

)

select itemnum , ki156, ki167, ki177

from c

where firstval is not null

or lastval is not null

Ansley_Ingram
Elite Collaborator
Elite Collaborator

Hi Steven,

As has been noted, if you're looking for a list of the document handles, rather than an aggregate (i.e. count, max, etc.) then you'll need something slightly more complex than what you have now. Brad's approach is a Common Table Expression and will perform more efficiently than making multiple passes at the tables - which would be needed in a "standard" join query.

If this is going to be a recurring need for you, you might also consider the Exception Reports module which can identify duplicate documents in the manner you are describing them and produce an automated report of the affected documents.

Hope that helps!

Ansley

 

Thomas_Reu
Elite Collaborator
Elite Collaborator

 A caveat to anyone reading this thread - Brad's approach will require sql server 2012 (lag and lead were introduced there).  nolock goes back to 2005 and could be added to the previous query if you desire.

Getting started

Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.