Later in the year we will be launching a new enterprise analytics product for Alfresco built on top of Search Services and Solr.
One of the capabilities it will bring is the ability to write SQL queries against the Solr datastore and so I wanted to share where we are with that SQL support. Below we have documented how the SQL works and also listed what isn't yet supported at the end.
I'd love to know what you think about the level of support so do comment or message with any feedback or questions.
The basic syntax of the SQL select is as follows:
Select DBID, cm_creator as creator, `cm_content.size` as size from alfresco where `cm_content.size` > 1000 order by `cm_content.size` desc limit 100
In the sections below we’ll explain and expand on the different aspects of this SQL statement.
For the EA release the only table that can be specified is the alfresco table. The alfresco table contains the documents and fields that have been indexed within the Alfresco Indexing Server’s main alfresco index.
Fields
Alfresco has a set of standard fields, which can be referred to by name in the SQL field list. The DBID field in the example SELECT statement is an example of a standard field.
The most useful ones are: PARENT, PATH, ANCESTOR, TYPE, ASPECT, PROPERTIES, NULLPROPERTIES, FIELDS, LID and DBID.
Fields from Alfresco’s out of-the-box content models as well as fields from custom content models can be referred to using the content model property qname, as in AFTS and the CMIS query language. The cm_creator field in the example SQL statement refers to the creator field in the out-of-the-box cm content model. Fields that have a unique local name over all prefixes do not need to use the prefix. Note that we use "_" to separate the prefix and the locale name as ":" would have to be escaped.
Fields that include reserved words or special characters will need to be escaped using the back tick character (`). The `cm_content.size` field in the example SQL statement is an example of back tick escaping. The only non-word character that can be used without escaping is the underscore “_”.
Under the covers the implementation uses Apache Calcite which has a list of reserved words that would also need to be escaped. You are most likely to hit reserved keywords picking aliases for fields.
SQL field aliases are supported in the field list. Field aliases cannot be used in the WHERE, ORDER BY or HAVING clauses. Field aliases that contain special characters or reserved words will need to be escaped with the back tick.
Alfresco’s SQL predicate is designed to take advantage of the rich search capabilities available in the Alfresco Search Service.
The basic predicate on a text field performs a phrase search. Below is the syntax of a basic predicate on a text field:
select DBID from alfresco where cm_content = ‘hello world’
This will search for the phrase ‘hello world’ in the cm_content field.
To gain full control of the search predicate for a specific field you can wrap the predicate in parenthesis and enter the query using Alfresco full text search syntax. For example to search for (hello OR world) in the cm_content field the following search predicate can be used:
select DBID from alfresco where cm_content = ‘(hello OR world)’
Predicates on string identifier fields will perform an exact match on the field. For example:
select DBID, LID from alfresco where LID = ‘value’
The SQL statement above will perform an exact match on the LID field.
Most fields from the content models will perform full text search matches unless the property is defined as tokenised false in the model.
This may not be what you expect.
The predicate on numeric fields can perform =, >=, <= and Alfresco Solr range queries.
Below are some examples using the =, >=, <=
select DBID, LID from alfresco where DBID = 123
select DBID, LID from alfresco where DBID >= 123
select DBID, LID from alfresco where DBID <= 123
select DBID, LID from alfresco where DBID >= 100 AND DBID <=123
Below are examples of Alfresco Solr range queries:
select DBID, LID from alfresco where DBID =’[* TO 123]’
This will select all DBIDs below 123, with inclusive ranges. The square brackets are inclusive ranges.
select DBID, LID from alfresco where DBID =’[* TO 123>’
This will select all DBIDs below 123, with an exclusive top range. < and > are exclusive ranges.
select DBID, LID from alfresco where DBID =’[123 TO *]’
This will select all DBIDs above 123, with inclusive ranges.
select DBID, LID from alfresco where DBID =’<123 TO *]’
This will select all DBIDs above 123, with an exclusive bottom range.
select DBID, LID from alfresco where DBID =’<100 TO 123>’
This will select all DBIDs above 100 and below 123, with an exclusive top and bottom range.
SQL predicates can be combined with Boolean operators AND, OR and NOT and nested with parenthesis.
select DBID from alfresco where (cm_content = ‘hello world’ AND (DBID >= 100 AND DBID < 200))
SQL SELECT statements can contain an ORDER BY clause with one or more order by fields. String identifiers and numeric fields can be specific in the ORDER BY clause.
Below is an example of an ORDER BY on a numeric field:
select cm_creator, cm_name, exif_manufacturer, audio_trackNumber from alfresco order by audio_trackNumber asc
SQL SELECT statements can contain a LIMIT clause. If no limit is specified a default limit of 1000 is set. Caution should be used when increasing the default limit as performance and memory consumption increase as the limit increases.
The basic syntax for SELECT DISTINCT syntax is as follows:
select distinct cm_name from alfresco where cm_content = 'world' order by cm_name asc
SELECT DISTINCT queries can also have multiple fields and multiple order by fields.
SQL aggregations without a GROUP BY clause return a single result tuple with the aggregation results. Below is an example:
select count(*) as count, avg(`cm_content.size`) as content_size from alfresco where cm_owner = 'xyz'
Alfresco SQL supports the following aggregation functions:
count(*)
sum(numeric_field)
avg(numeric_field)
min(numeric_field)
max(numeric_field)
Any numeric field can be used within the aggregations sum, avg, min and max. As with the basic SELECT statements, aggregation fields defined by content models can be referenced using the content model prefix. Fields that are reserved words or contain special characters will need to be escaped with the back tick character.
If a field alias is specified for an aggregate function then the field alias will appear in the result tuple. If field aliases are not used then the field name for the aggregate functions will appear as follows: EXPR$1, EXPR$2. These values refer to the function expression by the order they appear in the field list, starting from 1. For example the first function that appears in the query will be named EXPR$1 in the result tuples.
SQL aggregations with a GROUP BY clause are also supported and take the following form:
select `cm_content.mimetype`, count(*) as total_count from alfresco group by `cm_content.mimetype` having count(*) < 4 order by count(*) asc
Alfresco SQL supports the following aggregation functions: count(*), sum(numeric_field), avg(numeric_field), min(numeric_field), max(numeric_field).
Any numeric field can be used within the aggregations sum, avg, min and max. As with the basic SELECT statements fields defined by content models can be referenced using the content model prefix. Fields that are reserved words or contain special characters will need to be escaped with the back tick character.
One or more fields can be specified as group by fields. Fields that are designated as facetable in a content model will provide the best aggregation results.
If a field alias is specified for an aggregate function then the field alias will appear in the result tuple. If field aliases are not used then the field name for the aggregate functions will appear as follows: EXPR$1, EXPR$2. These values refer to the function expression by the order they appear in the query, starting from 1. For example the first function that appears in the query will be named EXPR$1 in the result tuples.
One or more fields may used in the ORDER BY clause. The ORDER BY clause can include both fields from the field list and aggregate functions. Field aliases cannot be used in the ORDER BY clause. When referring to an aggregate function in the ORDER BY clause the function call as it appears in the field list should be used.
The HAVING clause is supported for aggregation functions only. Boolean logic and nested HAVING clauses are supported. The following comparison operations are supported in the HAVING clause: =, >=, <=, !=.
A LIMIT clause can be used to limit the number of aggregations results. If no LIMIT clause is provided a default limit of 1000 is applied.
There is specific support for SQL time series reporting through the use of virtual time dimensions. The section below describes how virtual time dimensions are used.
Virtual time dimensions for every datetime field are automatically created and stored in the Alfresco Search Service. The three virtual time dimensions supported for the EA release are: _day, _month, _year. To use the virtual time dimensions you simply append the virtual time dimension to any datetime field and use it in the GROUP BY clause. Below is an example where the _day dimension is appended to the cm_created datetime field.
select cm_created_day, count(*) as total from alfresco where cm_created >= 'NOW/DAY' group by cm_created_day
The query above creates a daily time series report using the cm_created_day virtual time dimension.
A datetime predicate can be used in the WHERE clause to control the datetime range of the time series report. Notice in the example query the where clause:
where cm_created >= 'NOW/DAY'
This is a datetime predicate on the cm_created field. Its important to note that the virtual time dimension field is only used in the field list and GROUP BY clause. The predicate is applied to the non-virtual datetime field in the index.
The example above is using a date math expression to specify a lower boundary for the time series report. The sections below describe how to set lower and upper boundaries using both fixed date and date math predicates.
If no datetime predicate is supplied, the following default lower and upper boundaries for the different time dimensions will be used:
day
lower: current day minus 1 month
upper: current full day
month
lower: current month minus 24 months
upper: current full month
year
lower: current year minus 5 years
upper: current full year
Fixed datetime predicates are formatted according to a subset of ISO 8601. The SQL statement below shows an example of the fixed date predicate formatting:
select cm_created_day, count(*) from alfresco where cm_created >= '2010-02-01T01:01:01Z' and cm_created <= '2010-02-14T23:59:59Z' group by cm_created_day
Note that fixed datetime predicates require the full precision shown in the example above.
There is also support for a rich set of date math expressions. Below is an example of a time series aggregation using date math predicates:
select cm_created_month, count(*) from alfresco where cm_created >= 'NOW/MONTH-6MONTHS' and cm_created <= 'NOW' group by cm_created_month
In the example above NOW signifies the current point in time with milli-second precision. NOW/MONTH rounds the current point in time down to the current MONTH. The -6MONTHS subtracts 6 months from the current month.
See the Solr date math guide for more details on date math syntax.
Time series aggregations that group by virtual time dimensions are currently limited to using a single group by field.
By default time series aggregation results are sorted in datetime ascending order. An ORDER BY clause can be used to change the direction of the datatime sort or sort by an aggregate result.
A HAVING clause can be used to filter time series aggregations results.
The product includes a JDBC thin client that can be used with Apache Zeppelin and other SQL clients. The JDBC driver can also be used directly from a Java application. Below is sample code for using the JDBC driver:
String sql = "select DBID, LID from alfresco where cm_content = 'world' order by DBID limit 10 ";
String alfrescoJson = "{\"tenants\":[\"\"],\"locales\":[\"en_US\"],\"defaultNamespace\":\"http://www.alfresco.org/model/content/1.0\",\"textAttributes\":[],\"defaultFTSOperator\":\"OR\",\"defaultFTSFieldOperator\":\"OR\",\"anyDenyDenies\":true,\"query\":\"name:*\",\"templates\":[],\"allAttributes\":[],\"queryConsistency\":\"DEFAULT\",\"authorities\":[\"GROUP_EVERYONE\",\"ROLE_ADMINISTRATOR\",\"ROLE_AUTHENTICATED\",\"admin\"]}";
Properties props = new Properties();
props.put("alfresco.shards", "http://localhost:8983/solr/alfresco")
props.put("json", alfrescoJson);
String connectionString = "jdbc:alfresco://localhost:8983?collection=alfresco";
Connection con = null;
Statement stmt = null;
ResultSet rs = null;
try {
con = DriverManager.getConnection(connectionString, props);
stmt = con.createStatement();
rs = stmt.executeQuery(sql);
int i=0;
while (rs.next()) {
System.out.println(rs.getString("DBID"));
}
} finally {
try { rs.close(); } catch(Exception e) {}
try { stmt.close();} catch(Exception e) {}
try { con.close();} catch(Exception e) {}
}
}
Below are some commonly used SQL features that are not yet supported, some of these we are working at closing at the moment but we'd love to know which you would find most valuable. Leave a comment below or send me a message to give us your feedback.