cancel
Showing results for 
Search instead for 
Did you mean: 

solr4 folder size and WFSTInputIterator files are very large

mattjourdan
Champ in-the-making
Champ in-the-making

Hello,

 

I use alfresco community 5.0.d and i find that the size of /alfresco/alf_data/solr4 is very large. And I have very large WFSTInputIterator files too. (44go)

 

Is it normal? Is it possible to decrease the size?

 

some details on folders sizes :
/alfresco/alf_data/contentstore/ : 210go
/alfresco/alf_data/solr4/index/workspace/ : 111go
/alfresco/alf_data/solr4/index/archive/ : 14go
/alfresco/alf_data/solr4/content/ : 29go

 

The following values come from the database :
number of nodes in the store workspace : 753185
number of nodes in the store archive : 220572
number of transactions in the repository : 490930
number of ACLs in the repository : 2553
number of ACL transactions : 43765

 

Thanks,

 

Matthieu

1 ACCEPTED ANSWER

cesarista
World-Class Innovator
World-Class Innovator

Hi:

As commented by Axel, the size of the indices depends on the amount of documents, nodes and relating metadata. If your content is mainly text-based (Office, PDF, HTML...), your indices can be a substantial (and important) part of the storage, compared to the size of the contentstore. This may be dangerous when your repository grows, maybe not now. Contentstore may be located in a NFS mount point, and SOLR indices may be in local disk (or faster disks for performance) and this is  more expensive in general. Besides, if your indices disk is slow, you will have problems with IO when indexing and searching.

If you have lots of documents deleted in your index you can make it smaller full reindexing (this is when maxdoc is much bigger than numdoc in your searchers). There exists other indexation strategies for making your indices smaller:

  • disabling full text in SOLR (if this is possible for your use case)
  • disabling OCR processes (if any, and also, if possible)
  • disabling automatic metadata extracters (for example, exif metadata in images...)
  • controling your indices with cm:indexControl aspect in Alfresco.

And then reindexing. Also disabling archive searcher in SOLR may be helpful, cause you have to keep in mind that when your repository is growing, your SOLR memory requirements are higher too.

Finally, relating to the WFSIterator* files in /tomcat/temp, it is usual to deactivate SOLR suggester in <solrRootDir>/workspace-SpacesStore/conf/solrcore.properties (solr.suggester.enabled=false) to avoid these huge files in tomcat/temp. Then, you can clean tomcat/temp and restart Alfresco. In fact, this is recommended when migrating from Alfresco 4 to Alfresco 5.

Regards.

--C.

View answer in original post

3 REPLIES 3

afaust
Legendary Innovator
Legendary Innovator

The question "is this normal" cannot be universally answered. It depends extremely on the amount of metadata associated with your nodes as well as scope/sizes of full text that is indexed. Your sizes are definitely not that high that I would consider them to be extreme or not normal.

Typically the index may fragment over time so doing a complete reindex might help reduce the size of the indices. Alfresco also provides various templates for SOLR cores whereas the "rerank" template is said to produce more efficient indices. Last but not least you can technically reduce the amount of full text that is indexed or optimize the amount of metadata you maintain...

cesarista
World-Class Innovator
World-Class Innovator

Hi:

As commented by Axel, the size of the indices depends on the amount of documents, nodes and relating metadata. If your content is mainly text-based (Office, PDF, HTML...), your indices can be a substantial (and important) part of the storage, compared to the size of the contentstore. This may be dangerous when your repository grows, maybe not now. Contentstore may be located in a NFS mount point, and SOLR indices may be in local disk (or faster disks for performance) and this is  more expensive in general. Besides, if your indices disk is slow, you will have problems with IO when indexing and searching.

If you have lots of documents deleted in your index you can make it smaller full reindexing (this is when maxdoc is much bigger than numdoc in your searchers). There exists other indexation strategies for making your indices smaller:

  • disabling full text in SOLR (if this is possible for your use case)
  • disabling OCR processes (if any, and also, if possible)
  • disabling automatic metadata extracters (for example, exif metadata in images...)
  • controling your indices with cm:indexControl aspect in Alfresco.

And then reindexing. Also disabling archive searcher in SOLR may be helpful, cause you have to keep in mind that when your repository is growing, your SOLR memory requirements are higher too.

Finally, relating to the WFSIterator* files in /tomcat/temp, it is usual to deactivate SOLR suggester in <solrRootDir>/workspace-SpacesStore/conf/solrcore.properties (solr.suggester.enabled=false) to avoid these huge files in tomcat/temp. Then, you can clean tomcat/temp and restart Alfresco. In fact, this is recommended when migrating from Alfresco 4 to Alfresco 5.

Regards.

--C.

mattjourdan
Champ in-the-making
Champ in-the-making

Hi,

Thanks a lof for your answers.

Matthieu