Hi Benjamin!
At the moment I am involved in developing an Alfresco extension to extract MAGE-ML metadata from .xml files coming from microarray experiments, but I think the problem is similar to yours. I am going to request the creation of a new Forge project for which I will be supervised by an Alfresco's developer.
I am completing a documentation for the problem on the wiki, so if you are interested in I can provide you with some specs about how to create the extractor.
However, I am not sure the approach I was suggested to follow is the only one and, above all, is the best for all the possible extractors. In my case the extractor is very specialized and it will work only for specific types of xml files.
From your short explanation your need seems very similar to mine, so we can talk about a little if you want.
Please let me say.
All the best,