Trimming html tags with regular expression in Geany

Hey, Pingers!

Today I had some crazy amount of html generated by MS Word that I needed to post on our corporate website. The problem was that MS Word adds a dramatical ammount of shit to your page, when it comes to html. So pasted on a website this text was in multiple colors, multiple language tags and with different crap enclosed into attributes.

Editing this by hand was not a very nice option, because of the amount of the content and broken-MS-word-formatting.

So, take a look on my workaround…

I quickly looked through the text and saw several types of span tags. I needed to address only those junk tags which looked like:

<span lang="kk-KZ"> (It used 3 different languages, however the document contained only info in russian(RU))
<span style="color: #000000;"> (many many colors... we don't want them)
<span style="text-decoration: underline;"> (that was done by text composer, but we don't wanna have this underlined on a website)

I used search and replace function with Regular Expressions enabled. I an a total noob in regex so that was a hell of a journey for me to select them =)

In the end I came to this unified line that did the job selecting them:

<span ([[:alpha:]|[:blank:]|[:digit:]]|-|=|#|:|;|")+>

Let me break down this line…

<span ()+>

We are starting by explicitly selecting ‘<span ‘(with trailing space) this will not select empty tag. Then there is a long parenthesis ‘()’ which ends with ‘+‘ and closing ‘>‘. This means that within this parenthesis multiple characters are selected and the selection must end with ‘>‘. Plus sign means that 1 or more characters should be selected.

([[:alpha:]|[:blank:]|[:digit:]]|-|=|#|:|;|")+

Inside the parenthesis we see things like that: we are selecting any alphabetic character ([:aplha:]) or spacebar ([:blank:]) or digit ([:digit:]) or ‘-‘ or ‘=’ or ‘#’ or ‘:’ or ‘;’ or a double quote. I did not use punctuation expression here because it would select the also which we don’t want insithe the parenthesis (otherwise it would select a complete line, not just the opening tag)

So in the end it selected all the tags above

and I was able to replace them with nothing, so they’re gone.. after that I removed the closing </span>, but that’s a piece of cake.

What I would do the next time I face this loooooong text ruined by MS word tags, I would continue this expression so it selected not only the opening tag, but also the text enclosed and closing tag.. and replace it with the text enclosed

You may also check documentation on how to use RegEx in Geany ot their page