topic Re: CDATA on xml extraction are skipped in Alfresco Archive

CDATA on xml extraction are skipped

jsc — Tue, 19 Aug 2008 14:45:37 GMT

Hi,I extract metadata from xml like this :<root> <text><![CDATA[Date de l'événement : 01/07/2008]]></text></root>‍‍‍‍When I try to extract /root/text/text() alfresco, I'm supposed to get <![CDATA[Date de l'événement : 01/07/2008]]> whereas i get nothing. The co

Re: CDATA on xml extraction are skipped

pmonks — Wed, 20 Aug 2008 05:55:55 GMT

Actually, that XPath expression should return "Date de l'événement : 01/07/2008" (the text inside the CDATA section, but without the CDATA markers themselves). As described at http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#omitted, CDATA sections are not part of the XML infoset so are "invisible" to XPath (although the text within them is visible and should be accessible).

That said, it sounds like you're not getting the text inside the CDATA section either, which sounds like a bug. Are you able to reduce this to a small reproducible test case? If so it'd be worth raising in JIRA (http://issues.alfresco.com/).

Cheers,
Peter

Re: CDATA on xml extraction are skipped

jsc — Wed, 20 Aug 2008 07:53:54 GMT

You're right about CDATA sections.

I made a mistake, the problem is where there are end of line beforeCDATA section. The example I provided works well whereas this one does not work :


<root>
    <text>
<![CDATA[Date de l'événement : 01/07/2008]]></text>
</root>
‍‍‍‍‍‍

Re: CDATA on xml extraction are skipped

pmonks — Wed, 20 Aug 2008 16:43:47 GMT

Just to clarify, if there's a newline you don't get any text at all? Or you get the text but without the leading newline (which is what I'd expect to happen)?

I'm not entirely sure what the Infoset is supposed to look like if there's a leading newline prior to a CDATA block - it would be worth verifying that that's well formed & valid XML (I assume it is, but don't know for sure).

Cheers,
Peter

Re: CDATA on xml extraction are skipped

jsc — Thu, 21 Aug 2008 10:26:36 GMT

if there's a newline I get spaces, and newline character. No more.

Re: CDATA on xml extraction are skipped

pmonks — Thu, 21 Aug 2008 16:23:22 GMT

Ok in that case I think the first step is to confirm that a leading newline is allowed prior to a CDATA section. If not the XML is invalid; if so then it sounds like a bug and should be raised in JIRA (http://issues.alfresco.com/).

Cheers,
Peter

Re: CDATA on xml extraction are skipped

jsc — Fri, 22 Aug 2008 13:30:13 GMT

OK I raised a bug in JIRA.

Is there a workaround to reformat xml uploaded to remove leading and trailing whitespaces in content ? I know there is content transformer but I do not want to generate a new file I just want to work on uploaded file.