03-05-2012 10:59 AM
03-28-2012 10:27 AM
It appears the Acrobat is capable of running JavaScript http://partners.adobe.com/public/developer/en/acrobat/sdk/AcroJSGuide.pdf so it got me thinking that perhaps Alfresco Webscripts could read the PDF Forms.
function writeToProperty() {
var fld = this.getField("dswf_clientName");
this.info.kcms_clientName = fld.value;
}
writeToProperty(); // call my function
This is by no means a complete of the steps required to extract PDF Form data into Alfresco metadata but should provide some direction for other developers.04-05-2012 01:14 AM
04-09-2012 08:19 AM
function writeToProperty() {
var fld = this.getField("sourceFieldName");
this.info.targetPropertyName = fld.value;
}
writeToProperty(); // call my function
Save PDF Form using: Save As => Reader Extended PDF => Enable Additional Features…05-03-2012 03:17 PM
try {
// New - Use a temp file so it can be parsed twice
tstream = TikaInputStream.get(stream, tmp);
tsFile = tstream.getFile();
// PDFBox can process entirely in memory, or can use a temp file
// for unpacked / processed resources
// Decide which to do based on if we're reading from a file or not already
if (tstream != null && tstream.hasFile()) {
// File based, take that as a cue to use a temporary file
scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
pdfDocument = PDDocument.load(tsFile, scratchFile);
} else {
// Go for the normal, stream based in-memory parsing
pdfDocument = PDDocument.load(tsFile);
}
…snip code to cope with encrypted files…
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
extractMetadata(pdfDocument, metadata);
// New - Now parse again but non-sequentially to retrieve any form field data
pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);
extractFormFieldData(pdfFormDoc, metadata);
PDF2XHTML.process(pdfDocument, handler, metadata,
extractAnnotationText, enableAutoSpace,
suppressDuplicateOverlappingText, sortByPosition);
In addition to changing the parse() method above, a new method was added to process the AcroForm fields as follows:
private void extractFormFieldData(PDDocument document, Metadata metadata)
throws TikaException, IOException {
PDDocumentCatalog docCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
List fldList = acroForm.getFields();
Iterator fIter = fldList.iterator();
while(fIter.hasNext()){
PDField field = (PDField)fIter.next();
addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());
if (logger.isDebugEnabled())
{
String logMsg = "extracting: " + field.getFullyQualifiedName();
logMsg += " value: " + field.getValue();
logger.debug(logMsg);
}
}
}
}
I'm sure that there are better ways of doing this but I chose to use a temp file just to get it working.08-02-2012 06:47 PM
08-03-2012 07:13 AM
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
PDDocument pdfDocument = null;
PDDocument pdfFormDoc = null;
TikaInputStream tstream = null;
File tsFile = null;
TemporaryResources tmp = new TemporaryResources();
RandomAccess scratchFile = null;
try {
// SMD - Use a temp file so it can be parsed twice
tstream = TikaInputStream.get(stream, tmp);
tsFile = tstream.getFile();
// PDFBox can process entirely in memory, or can use a temp file
// for unpacked / processed resources
// Decide which to do based on if we're reading from a file or not already
if (tstream != null && tstream.hasFile()) {
// File based, take that as a cue to use a temporary file
scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
// pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
pdfDocument = PDDocument.load(tsFile, scratchFile);
} else {
// Go for the normal, stream based in-memory parsing
// pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
pdfDocument = PDDocument.load(tsFile);
}
if (pdfDocument.isEncrypted()) {
String password = null;
// Did they supply a new style Password Provider?
PasswordProvider passwordProvider = context.get(PasswordProvider.class);
if (passwordProvider != null) {
password = passwordProvider.getPassword(metadata);
}
// Fall back on the old style metadata if set
if (password == null && metadata.get(PASSWORD) != null) {
password = metadata.get(PASSWORD);
}
// If no password is given, use an empty string as the default
if (password == null) {
password = "";
}
try {
pdfDocument.decrypt(password);
} catch (Exception e) {
// Ignore
}
}
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
extractMetadata(pdfDocument, metadata);
// SMD - Now parse non-sequentially to retrieve any form field data
pdfFormDoc = PDDocument.loadNonSeq(tsFile, scratchFile);
extractFormFieldData(pdfFormDoc, metadata);
PDF2XHTML.process(pdfDocument, handler, metadata,
extractAnnotationText, enableAutoSpace,
suppressDuplicateOverlappingText, sortByPosition);
} finally {
if (pdfDocument != null) {
pdfDocument.close();
pdfFormDoc.close();
}
tmp.dispose();
}
}
/**
* Steve Deal - Added to parse PDF Form fields
*
* @param document
* @param metadata
* @throws TikaException
*/
private void extractFormFieldData(PDDocument document, Metadata metadata)
throws TikaException, IOException {
PDDocumentCatalog docCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
List fldList = acroForm.getFields();
Iterator fIter = fldList.iterator();
while(fIter.hasNext()){
PDField field = (PDField)fIter.next();
addMetadata(metadata, field.getFullyQualifiedName(), field.getValue());
if (logger.isDebugEnabled())
{
String logMsg = "extracting: " + field.getFullyQualifiedName();
logMsg += " value: " + field.getValue();
logger.debug(logMsg);
}
}
}
}
08-06-2012 01:03 AM
08-06-2012 02:44 PM
The only solution we have found is to rename the OOTB jar files and drop the modified jar files into the tomcat/webapps/alfresco/WEB-INF/lib.
08-06-2012 06:40 PM
INFO: Adding 'file:/opt/alfresco-4.0.d/alf_data/solr/lib/tika-parsers-1.1-20111128.jar' to classloader
07/08/2012 8:18:22 AM org.apache.solr.core.SolrResourceLoader replaceClassLoader
Which, I know, is in the solr lib not the alfresco lib. My question is - does this indicate a problem? Should SOLR be using the new jars too?<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
<bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter" >
<property name="inheritDefaultMapping">
<value>true</value>
</property>
<property name="mappingProperties">
<bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
<property name="location">
<value>classpath:alfresco/extension/custom-pdfbox-extractor-mappings.properties</value>
</property>
</bean>
</property>
</bean>
</beans>
and custom-pdfbox-extractor-mappings.properties# Namespace Definitions
namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
namespace.prefix.my=my.companyName.root
#Mapping Definitions
testData=my:testData
<aspect name="my:testAspect">
<title>Test Aspect</title>
<properties>
<property name="my:testData">
<type>d:text</type>
</property>
</properties>
</aspect>
Tags
Find what you came for
We want to make your experience in Hyland Connect as valuable as possible, so we put together some helpful links.