cancel
Showing results for 
Search instead for 
Did you mean: 

How to 'index' a docx - File

tomtom
Champ in-the-making
Champ in-the-making
I've got a problem to "index" docx Files.
Some documents will work if i upload it on a site in alfresco but other docx -Files will not work.

Has anybody a idea how the indexing of DOCX-Files or older DOC-Files work?

I find out, that "normal" Word Documents like *.doc will be instantly indexed by alfresco (current index time 1 min) but with Word Documents like *.docx it doesn't work as good as possible.

:?:  :?:  :?:  :?:
3 REPLIES 3

progress
Champ in-the-making
Champ in-the-making
Ha! I had to sign up to post this. I noticed you did not get an answer.

I am noticing this on just about every website I go to where this problem is posted.

There is either no answer, or there is an answer where the techie comes on and spits out mumbo jumbo and has the OP'er do a bunch of useless digging and copying and pasting and submitting.

At the end of the day, there is no solution to index a .docx file unless you have a degree in Windows OS coding and like to spend a few weeks reconfiguring your OS hidden files.

I am in the same situation. I have got on this path from a desire to create an inexpensive payments report using MS Word 2010.

I ignorantly thought I had a brilliant idea when I thought I could put unique text in my document to indicate whether a payment was made or not, and put all those docs in a folder, and simply index the folder. And when I would do a search within that folder I woulds get all the results for payments and non payments.

Well here's the bitter truth for you:

No one or no software that I have have thus far researched can index a .docx file because it is basically a zipped (that's right; .zip) file

Detailed .docx Format Information

A .docx file is a replacement format for the traditional Microsoft Word .doc file format.

.docx is the default file format in the latest version of Microsoft Word in Office 2007. All docs in the new office family are based upon an open standardized specification called Office Open XML.

One .docx file is actually a collection of many files, stored in an archive or (zip-file).

For example, if you rename your .docx file to .zip you can use it like any other zip-file. After you do that, you can look inside and see the different files that comprise the .docx file. If you extract to the current folder a number of files and directories appear.

In the root level, we have 3 folders "_rels", "docProps" and "Word". In addition with have a file called [Content_Types].xml. The [Content_Types].xml file describes the contents of the zip-package and is used internally to Word as a table of contents for further processing. The rels folder will hold a map of all the relationships within the package. It is a map over all the files in the package and how they relate to each other.

Now moving on to the word folder we get to the actual content of the word document. From the folder structure above you can see a number of xml-files. The most important of all xml-files in the entire zip-package is the document.xml Why? Because it is here the main text content as you know it is stored.

I am still searching and researching.

I think you and I will have to find something better, because all the techies out there are beating around the bush with a clear cut YES we can index your .docx file, or NO we can not!

Unless you want to dive deep into ifilter packs and other mumbo jumbo root configuration on your OS, then we must come up with a better solution.

the /doc files are indexanle. Those ended with Word 2007 and later

So, I am going to try some of these gadgets like copernic, seekfast, my thicksoft file locator, etc. on a .doc file and see what happens.

Until there is a clear and simple solution for the .docx "zipped" file that contains many other compressed files within it, then I think you and I will have to move on to another solution.

I am also leaning towards trying PDF files, but I need to get the hang of Acrobat and Livecycle. And if I remember correctly, Acrobat 9 presents problems as well.

Also, if you are going to add interactive or dynamic content to your files, then this will throw another wrench in the indexing solution.

Good luck

progress
Champ in-the-making
Champ in-the-making
OK my friend

I just think I came into a solution.

Save your .docx files as .doc files and your indexing should be alright

Go to File > Save as > Select  "Word 97-2003 Document" from the "Save as type" scroll down menu and save it.

It will be saved as an "indexable" .doc file. Make sure you do not use any dynamic functions such as drop down lists and such.

Also, if you put your text in objects like charts and such, they will not be indexed.

Your text must be right on the template and not inside of any object.

SO far so good.

mrogers
Star Contributor
Star Contributor
Open Office 3 and therefore Alfresco should be able to index docx files.   What's the problem?  Is your "indexing" problem even related to Alfresco's indexes?