cancel
Showing results for 
Search instead for 
Did you mean: 

Finding duplicate documents and selecting their document handle

Steven_Zeltner4
Confirmed Champ
Confirmed Champ

I'm looking to get a list of document handles that have a duplicate.  A duplicate is defined by a 5 keyword match.  The approach I'm taking is something like:

SELECT id.itemnum,COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
--HAVING COUNT(*) > 1

That's a rough outline of a (broken) query.  I realize that this query isn't going to compile because I don't have id.itemnum in the GROUP BY clause but I stopped here to ask the Community about whether or not the approach is sound.

Is using a Group By clause wise here?  How would you find a document where 5 keywords matched?  It doesn't make sense to Group By itemnum but that is what I'm trying to get to (eventually).

I'm having some luck with this since I don't really care what doc handle is kept:

SELECT MAX(id.itemnum), COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
HAVING COUNT(*) > 1 

Any other thoughts out there for a better way?  I'd imagine this will be pretty resource intensive.

 

Thank you.

6 REPLIES 6

Steven_Zeltner4
Confirmed Champ
Confirmed Champ

Thank you for the queries and the general SQL information.  Tom, thanks for the heads up on the SQL 2012 functionality for Brad's query.  Using this thread, I was able to get back the duplicates.

Thomas_Reu
Elite Collaborator
Elite Collaborator

Steven, were you able to run both queries?  We're on sqlserver 2008 R2, so I can't gather any stats.  I'd love to have empirical information on the run times on both queries against the same data set - which you may have.  Plus a count of how many records you are processing.  If the times are that much better, then maybe I need to deal with the upgrade pain and push for a DB update to 2012 sooner versus later. 

Getting started

Find what you came for

We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.