Hyland Connect

Steven_Zeltner4 · ‎07-16-2014

I'm looking to get a list of document handles that have a duplicate. A duplicate is defined by a 5 keyword match. The approach I'm taking is something like:

SELECT id.itemnum,COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
--HAVING COUNT(*) > 1

That's a rough outline of a (broken) query. I realize that this query isn't going to compile because I don't have id.itemnum in the GROUP BY clause but I stopped here to ask the Community about whether or not the approach is sound.

Is using a Group By clause wise here? How would you find a document where 5 keywords matched? It doesn't make sense to Group By itemnum but that is what I'm trying to get to (eventually).

I'm having some luck with this since I don't really care what doc handle is kept:

SELECT MAX(id.itemnum), COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
HAVING COUNT(*) > 1

Any other thoughts out there for a better way? I'd imagine this will be pretty resource intensive.

Thank you.

Steven_Zeltner4 · ‎07-18-2014

Thank you for the queries and the general SQL information. Tom, thanks for the heads up on the SQL 2012 functionality for Brad's query. Using this thread, I was able to get back the duplicates.

Thomas_Reu · ‎07-18-2014

Steven, were you able to run both queries? We're on sqlserver 2008 R2, so I can't gather any stats. I'd love to have empirical information on the run times on both queries against the same data set - which you may have. Plus a count of how many records you are processing. If the times are that much better, then maybe I need to deal with the upgrade pain and push for a DB update to 2012 sooner versus later.

Hyland Connect

Finding duplicate documents and selecting their document handle