Dear Michael,
I once tried to read an xml file created out of a PDF file. There were a lot of style-attributes in the tags, so my Oracle-Database could not store the original xml-file.
Since I just wanted to get some Table-Data out of that file, I wrote a little bash script to get rid of all unnecessary lines and attributes:
#!/bin/bash
set -x
if [ "$1" != '' ]; then
PDatei=$1
else
# Parameter fehlt
echo "Parameter fehlt: Dateiname der zu vereinfachenden xml-Datei (ohne .xml)" && exit 1
fi
egrep -i "<Data|<Cell|<Row|</Data|</Cell|</Row" $PDatei.xml > $PDatei-1.xml || exit 2
sed 's/<\([A-Za-z]*\) [^\/>]*\(\/>\|>\)/<\1\2/g' $PDatei-1.xml > $PDatei-erg.xml || exit 3
Line 11 cuts out only the lines with data-, cell- or row-tags. This works only, if the data in the file contain no linefeed.
Line 12 cuts out all attributes in the tags.
After cleaning out the file with this script, I was able to load the xml into an XML-VARA.
I do not use that any more though, since I was able to get the original data via csv - this is so much easier and reliable to process!
Regards, Nicole
PS. I forgot the <Table>-tag in the script. I used that, too, for surrounding the rows.