Projekt: Hesla Jednoty bratrské/2010/los10odt-make
Skripty pro extraxci textu Hesel z ODT (pro rok 2010)
los10odt-make
# Converts file Losung.odt to plain text:
#---------------------------------------
# At first, we use the Open-Office word editor to save
# the original MS-Word Losung.doc file to the ODT file.
# The ODT file is the zipped archive consisted of some files.
# We will use the file 'content.xml' only.
yy=10 #year
echo Unzipping the ODT file...
unzip ../w01-Losungen/Losungen2010.odt content.xml
echo Inserting new-lines before every XML-tag...
perl -pe "s/</\n</g" content.xml > los${yy}-01.xml
echo Inserting style-names at the beginning of every line...
perl -w styles.pl los${yy}-01.xml > los${yy}-02.xml
echo Stripping all xml tags...
Výše uvedený skriptík volá:
strip.pl
#! /usr/bin/perl -w
# perl -w strip.pl cont_nl_sty.xml > los07-10.txt
# strip down all tags
while (<>) {
s/<.*?>//g;
next if /^$/;
print;
}
styles.pl
# úprava Losung 2010 z ODT XML tak, že přidá styly do složených závorek
while(<>) {
if(/text:style-name="(.*)">/) {
print "$`";
print "$&";
print "{$1}";
print "$'";
}
else {print;}
}