cancel
Showing results for 
Search instead for 
Did you mean: 

What is Best Practice for overriding Tika/PDFBox jar files?

devodl
Champ in-the-making
Champ in-the-making
To parse PDF Form fields I had to modify/replace the Tika parser and PDFBox jar files that ship with Alfresco (read more here: https://forums.alfresco.com/en/viewtopic.php?f=7&t=43406)

More specifically the modifications required the need to override the files:  tika-parsers-1.1-20120208.jar and pdfbox-1.6.0.jar that are deployed to tomcat/webapps/alfresco/WEB-INF/lib

I initially placed the modified jar files in tomcat/shared/lib but they were ignored by Tomcat/Alfresco.   When I replace the files that are deployed with Alfresco in tomcat/webapps/alfresco/WEB-INF/lib with the modified jar files then the modifications are used and PDF Form fields are parsed and mapped to the metadata fields in a content model.

Manually replacing the default jar files with the modified jar files is not the right approach. I really don't want to build an alfresco.war file with the modified jar files.

Question
Where can I place the modified jar files so that they override the default files deployed by Alfresco?
4 REPLIES 4

mrogers
Star Contributor
Star Contributor
The tomcat lib dir is the correct place for it.    Did you configure the class loader on Tomcat?

devodl
Champ in-the-making
Champ in-the-making
Thanks for the response.
The tomcat lib dir is the correct place for it.    Did you configure the class loader on Tomcat?
I suspect you mean tomcat/shared/lib and the shared.loader property in catalina.properties.

When I first tested with Community 4.0.c (upgraded to Tomcat 6.0.35) I had configured shared.loader properly.
I installed a trial version of Enterprise (4.0.1) (configured by default) and retested and had the same results. When I place jar files from other projects (e.g. content models, workflows) into tomcat/shared/lib they are loaded just fine. Here is the contents of the catalina.properties file:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#
# List of comma-separated packages that start with or equal this string
# will cause a security exception to be thrown when
# passed to checkPackageAccess unless the
# corresponding RuntimePermission ("accessClassInPackage."+package) has
# been granted.
package.access=sun.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.,sun.beans.
#
# List of comma-separated packages that start with or equal this string
# will cause a security exception to be thrown when
# passed to checkPackageDefinition unless the
# corresponding RuntimePermission ("defineClassInPackage."+package) has
# been granted.
#
# by default, no packages are restricted for definition, and none of
# the class loaders supplied with the JDK call checkPackageDefinition.
#
package.definition=sun.,java.,org.apache.catalina.,org.apache.coyote.,org.apache.tomcat.,org.apache.jasper.

#
#
# List of comma-separated paths defining the contents of the "common"
# classloader. Prefixes should be used to define what is the repository type.
# Path may be relative to the CATALINA_HOME or CATALINA_BASE path or absolute.
# If left as blank,the JVM system loader will be used as Catalina's "common"
# loader.
# Examples:
#     "foo": Add this folder as a class repository
#     "foo/*.jar": Add all the JARs of the specified folder as class
#                  repositories
#     "foo/bar.jar": Add bar.jar as a class repository
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar

#
# List of comma-separated paths defining the contents of the "server"
# classloader. Prefixes should be used to define what is the repository type.
# Path may be relative to the CATALINA_HOME or CATALINA_BASE path or absolute.
# If left as blank, the "common" loader will be used as Catalina's "server"
# loader.
# Examples:
#     "foo": Add this folder as a class repository
#     "foo/*.jar": Add all the JARs of the specified folder as class
#                  repositories
#     "foo/bar.jar": Add bar.jar as a class repository
server.loader=

#
# List of comma-separated paths defining the contents of the "shared"
# classloader. Prefixes should be used to define what is the repository type.
# Path may be relative to the CATALINA_BASE path or absolute. If left as blank,
# the "common" loader will be used as Catalina's "shared" loader.
# Examples:
#     "foo": Add this folder as a class repository
#     "foo/*.jar": Add all the JARs of the specified folder as class
#                  repositories
#     "foo/bar.jar": Add bar.jar as a class repository
# Please note that for single jars, e.g. bar.jar, you need the URL form
# starting with file:.
shared.loader=${catalina.base}/shared/classes,${catalina.base}/shared/lib/*.jar

#
# String cache configuration.
tomcat.util.buf.StringCache.byte.enabled=true
#tomcat.util.buf.StringCache.char.enabled=true
#tomcat.util.buf.StringCache.trainThreshold=500000
#tomcat.util.buf.StringCache.cacheSize=5000
Debug statements were added to the modified org.apache.tika.parser.pdf.PDFParser class and the messages only appear in the log files when the modified tika-parser-1.1.jar is placed into tomcat/webapps/alfresco/WEB-INF/lib

The jar files I am working with are:
Original - deployed by alfresco.war into tomcat/webapps/alfresco/WEB-INF/lib
tika-parser-1.1-20120208.jar
pdfbox-1.6.0.jar        

Modified - deployed manually into tomcat/shared/lib
tika-parser-1.1.jar
pdfbox-1.7.0-SNAPSHOT.jar

If I rename the original jar files with the modified jar files in tomcat/shared/lib at startup Tomcat logs the following error:
 2012-05-23 06:39:04,020  ERROR [web.context.ContextLoader] [main] Context initialization failed
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'extracter.PDFBox' defined in file [C:\apps_x64\Alfresco\tomcat\shared\classes\alfresco\extension\custom-metadata-extractors-context.xml]: Instantiation of bean failed; nested exception is java.lang.NoClassDefFoundError: org/apache/tika/parser/AbstractParser
   at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:965)
I investigated the Tomcat 6 class loader further and found: http://pragmaticjava.blogspot.com/2009/01/tomcat-6-and-class-loading.html which is a clarification of http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html
Tomcat creates a class loader for every webapp that is deployed in its instance. This class loader loads classes under WEB-INF/classes and WEB-INF/lib folder. It is for these class loaders where the delegation model deviates, thanks to the Servlet Specification which states as follows: "It is recommended also that the [web] application class loader be implemented so that classes and resources packaged within the WAR are loaded in preference to classes and resources residing in container-wide library JARs."
However the above specification cannot override the Java standard delegation model of delegating to Bootstrap and System class loaders. It only is used to override the parent-child relationships that are introduced by Tomcat - ie. Common, Shared and WebappX class loaders.
So when an application requests a class, the class loading hierarchy is as follows:
  1. The bootstrap class loader looks in the core Java classes folders.

  2. The system class loader looks in the $CATALINA_HOME/bin/bootstrap.jar and $CATALINA_HOME/bin/tomcat-juli.jar

  3. The WebAppX class loader looks in WEB-INF/classes and then WEB-INF/lib

  4. The common class loader looks in $CATALINA_HOME/lib folder.

  5. The shared class loader looks in $CATALINA_HOME/shared/classes and $CATALINA_HOME/shared/lib if the shared.loader property is set in conf/catalina.properties file.
Apparently the Tomcat context (alfresco) cannot initialize without all the jar files being present in WEB-INF/lib which is why I saw the error. However once the context loads without error the modified jar files loaded by the common class loader are ignored.

Sign me: puzzled  :?

chrisokelly
Champ on-the-rise
Champ on-the-rise
Hi Steve,

Did you find a way around this? I read your initial post with some interest as we are needing to get PDF form data extracted ourselves, was sad to see you hit a problem with overriding the OOtB jars.

Chris

devodl
Champ in-the-making
Champ in-the-making
Chris,
Unfortunately no, the modified tika-parser and pdfbox jar files we placed in tomcat/shared/lib do not override the OOTB tika-parser and pdfbox jar files that reside in tomcat/webapps/alfresco/WEB-INF/lib.

Apparently the shared class loader is configured correctly because the jar files that define our custom content models are placed into tomcat/shared/lib and they work fine. However it is not clear if the modified tika-parser and pdfbox jar files are picked up by the shared class loader because they do not override the OOTB jar files.

The only solution we have found is to rename the OOTB jar files and drop the modified jar files into the  tomcat/webapps/alfresco/WEB-INF/lib.

Please post back here if you have different results in overriding the OOTB jar files.
Steve