cancel
Showing results for 
Search instead for 
Did you mean: 

Search content without extension

mattjourdan
Champ in-the-making
Champ in-the-making

Hello,

I use alfresco community 5.0.d and i would like to know if it is possible to search in Alfresco all the files thatdon't have an extension.

I do not know how to do the search.

Thanks,

 

Matthieu

12 REPLIES 12

mehe
Elite Collaborator
Elite Collaborator

Depends on which Search you want to use. If using the "Aikau" Search in Share or the Alfresco FTS , the Searchstring !=cm:name:*.??? should do it. It should find all nodes not having a name that ends with a three character extension.

afaust
Legendary Innovator
Legendary Innovator

The question isn't necessarily a matter of which UI you use (Aikau faceted search or Node Browser for instances), but if the search services support this type of query. The problem with a wildcard based approach in FTS is that it will by design only scale to a certain amount of documents in the system. This is a result of how the query is translated to the underlying Lucene system in SOLR. Also, the pattern *.??? assumes that all extensions are three-letter extensions only which might have been the standard in the old DOS 8.3 world but all modern MS Office extensions are four-lettered ones.

Without having done a similar query myself on a large document base (i.e. more than just a couple tens of thousands of documents), I would assume the best way to work with this is by doing a CMIS query using the LIKE operator on cmis:name. The reasoning behind this is that a CMIS query using LIKE can actually be applied against the database instead of the SOLR index, and thus is not limited by the index query rewrite restrictions. The only thing you need to ensure is that the additional indexes for transactional metadata queries have been applied on the database system.

mehe
Elite Collaborator
Elite Collaborator

Hi Axel, I mentioned "Aikau" because it's the easiest way to test the FTS String. The query performs well on large document sets (tested with 1000.000 doc repo ) , but  paging throu large resultset gets slower for following pages (and gets worse page by page)

It's true it finds only three character extensions, but is easy to adapt 🙂

I used ??? because I thought Solr would internally invert the query string (???.*) which would not be so expensive - do you know if this is correct?

afaust
Legendary Innovator
Legendary Innovator

I can't say how SOLR / Lucene handles this low level. I just remember issues with running into maxBooleanClause limits with Alfresco SOLR before due to the way that Alfresco was rewriting wildcard queries before sending them off to the SOLR / Lucene layer. Though this may have changed in Alfresco 5.0 or later versions...

mehe
Elite Collaborator
Elite Collaborator

Max boolean clauses should be no problem here - Hits you when using big "or" conjunctions. I hoped to eliminate that by using the '=' Operator. (You see that I used 'hope' - what would we do without it 🙂

cesarista
World-Class Innovator
World-Class Innovator

Hi:

I tried in Alfresco Search with a small set in a site --> TYPE:"cm:content" AND !=cm:name:*.* 

And then I played a little bit with mimetype facet, considering "Binary File (Octet Stream)", HTML and text mimetypes. I obtained some meaningful list, although not exactly accurate.

- What about a database query for doing LIKEs ?

- What about a recursive javascript function checking the filenames with some js regex ?

Regards.

--C.

afaust
Legendary Innovator
Legendary Innovator

Recursive analysis via JavaScript is out of the question. This will be extremely slow, load too much data into Memory (overwhelming Caches) and potentially lock rows/tables in the DB (some DB systems have lock escalation functionalitites when too many rows are read in a transaction).

As I said, you could try CMIS SQL queries using LIKE against the DB - in that case you would basically only test for the presence of a dot, e.g. do a where cmis:name NOT LIKE '%.%' query...

mehe
Elite Collaborator
Elite Collaborator

I think you could use a recursive script, but only for admin use/purpose. Runtime of the script would be long - the browser will probably run in a timeout error, but there should be no lock problem because the script would only do read access which will result in "Shared locks" on the db which are not causing lock escalations BUT if there is any insert or update request on the dataset, it will block the script until the Update/insert is completed.

But your cmis variant is far better. Is there any restriction on the number of fetched results when executing a cmis db query like in the early search service?

mehe
Elite Collaborator
Elite Collaborator

Hi,

a last update for this one

TYPE:"cm:content" !=cm:name:*.?* 

Did it for me. Found all files without an, at least one character, extension. AND is implicit in the newer Alfresco versions.

‌ filter cm:content was a good idea, missed it in the first shot (boaah... so many nodes without extension Smiley Happy ) 

I used the slingshot search via an angularjs SPA, so the count of documents without extension was present in milliseconds. A feature of Solr, which gives you the count of matches directly in the result header.