07-16-2014 01:21 PM
I'm looking to get a list of document handles that have a duplicate. A duplicate is defined by a 5 keyword match. The approach I'm taking is something like:
SELECT id.itemnum,COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
--HAVING COUNT(*) > 1
That's a rough outline of a (broken) query. I realize that this query isn't going to compile because I don't have id.itemnum in the GROUP BY clause but I stopped here to ask the Community about whether or not the approach is sound.
Is using a Group By clause wise here? How would you find a document where 5 keywords matched? It doesn't make sense to Group By itemnum but that is what I'm trying to get to (eventually).
I'm having some luck with this since I don't really care what doc handle is kept:
SELECT MAX(id.itemnum), COUNT(*) As numoccurrences
FROM hsi.itemdata id
JOIN hsi.keyitem513 ki513 ON ki513.itemnum = id.itemnum
JOIN hsi.keyitem514 ki514 ON ki514.itemnum = id.itemnum
JOIN hsi.keyitem515 ki515 ON ki515.itemnum = id.itemnum
JOIN hsi.keyitem516 ki516 ON ki516.itemnum = id.itemnum
JOIN hsi.keyitem517 ki517 ON ki517.itemnum = id.itemnum
WHERE id.itemtypenum = 413
GROUP BY ki514.keyvaluebig,ki515.keyvaluesmall,ki516.keyvaluedate,ki517.keyvaluechar
HAVING COUNT(*) > 1
Any other thoughts out there for a better way? I'd imagine this will be pretty resource intensive.
Thank you.
07-16-2014 02:47 PM
The query isn't going to work because of the select and the group by. Also, keep in mind that itemnum is always going to be unique, even if you have dups across your 4 keywords. So, you are going to need 1 query to find dups that must exclude the itmenum. Then you'll have to use the data from that dup query to join to another query to get the itemnums. For example, see below. Also, I'm only using a 2 part multi key for brevity and readability. You could also break the 2 subqueries into views if you have a hard time understanding this. Last, please make sure you do this in your test environment first; until you get a good handle on this...
Select itemnum
from
(select k154.keyvalue as id1, k155.keyvalue as id2, i.*
from hsi.itemdata as i
inner join hsi.keyitem154 as k154
on i.itemnum = k154. itemnum
inner join hsi.keyitem155 as k155
on i.itemnum = k155.itemnum) as v_all
Inner join
(select k154.keyvalue as id1, k155.keyvalue as id2, count(*) as dup_cnt
from hsi.itemdata as i
inner join hsi.keyitem154 as k154
on i.itemnum = k154. itemnum
inner join hsi.keyitem155 as k155
on i.itemnum = k155.itemnum
group by k154.keyvalue, k155.keyvalue
having count(*) > 1) as dups
on v_all.id1 = dups.id1
and v_all.id2 = dups.id2
07-17-2014 07:16 AM
Steven,
I loved the question, and was looking for a way to do this myself with only one pass. Here is what I came up with:
with c as
(
select id.itemnum as 'itemnum',
ki156.keyvaluechar as 'ki156',
ki167.keyvaluechar as 'ki167',
ki177.keyvaluechar as 'ki177',
lag(id.itemnum) over (partition by ki156.keyvaluechar, ki167.keyvaluechar, ki177.keyvaluechar order by id.itemnum) as 'FirstVal',
lead(id.itemnum) over (partition by ki156.keyvaluechar, ki167.keyvaluechar, ki177.keyvaluechar order by id.itemnum) as 'LastVal'
from hsi.itemdata (nolock) ID
inner join hsi.keyitem156 (nolock) ki156
on ID.itemnum = ki156.itemnum
inner join hsi.keyitem167 (nolock) ki167
on ID.itemnum = ki167.itemnum
inner join hsi.keyitem177 (nolock) ki177
on ID.itemnum = ki177.itemnum
)
select itemnum , ki156, ki167, ki177
from c
where firstval is not null
or lastval is not null
07-17-2014 08:40 AM
Hi Steven,
As has been noted, if you're looking for a list of the document handles, rather than an aggregate (i.e. count, max, etc.) then you'll need something slightly more complex than what you have now. Brad's approach is a Common Table Expression and will perform more efficiently than making multiple passes at the tables - which would be needed in a "standard" join query.
If this is going to be a recurring need for you, you might also consider the Exception Reports module which can identify duplicate documents in the manner you are describing them and produce an automated report of the affected documents.
Hope that helps!
Ansley
07-17-2014 10:44 AM
A caveat to anyone reading this thread - Brad's approach will require sql server 2012 (lag and lead were introduced there). nolock goes back to 2005 and could be added to the previous query if you desire.
Find what you came for
We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.