During one of our performance tuning exercise, we observed that while querying for all variables using
Activity was using following query defined in the xml
<select id="selectTaskWithVariablesByQueryCriteria" parameterType="org.activiti.engine.impl.TaskQueryImpl" resultMap="taskAndVariablesResultMap">
${limitBefore}
select distinct RES.*,
VAR.ID_ as VAR_ID_, VAR.NAME_ as VAR_NAME_, VAR.TYPE_ as VAR_TYPE_, VAR.REV_ as VAR_REV_,
VAR.PROC_INST_ID_ as VAR_PROC_INST_ID_, VAR.EXECUTION_ID_ as VAR_EXECUTION_ID_, VAR.TASK_ID_ as VAR_TASK_ID_,
VAR.BYTEARRAY_ID_ as VAR_BYTEARRAY_ID_, VAR.DOUBLE_ as VAR_DOUBLE_,
VAR.TEXT_ as VAR_TEXT_, VAR.TEXT2_ as VAR_TEXT2_, VAR.LONG_ as VAR_LONG_
${limitBetween}
<include refid="selectTaskWithVariablesByQueryCriteriaSql"/>
${orderBy}
${limitAfter}
</select>
Having a look at the query it seems that the distinct part is unnecessary as using distinct in left outer join with select including values also from table OTHER THAN Base table does not make sense in my opinion. The rows of resultset of this query without distinct would be distinct anyway.
Because of distinct, the database has to work extra. For e.g., using explain plan on this query for MySQL, it shows that this uses a temporary table to copy first all the results to temp area and then process distinct which is un necessary extra work. If the act_ru_variable table size is large, then this involves huge processing.
Removing the distinct returns the same number of rows only this time the processing done by database is not required.
Of course if the query contains only columns from base tables then using distinct makes sense.
Have I overlooked something here or not aware of the reason why distinct is being used ?