Today I faced a problem with my clients website when I used the regular expression technique that I explained in my post here. The problem raised because his website is in Swedish so the usual character pattern that I used to match was not working.
The pattern matching string used before was: <a [a-zA-Z0-9 /.?_=:"-]*>
But his website used characters like Õ, Ä, Û, é, and so on… To solve this problem I introduced another set of hex based pattern matching rule [\xC0-\xFF] This rule actually includes all the special alphabets (uppercase and lowercase).
So the new pattern matching rule for non-english websites will be:
<a [a-zA-Z\xC0-\xFF0-9 /.?_=:"-]*>
The tools that helped me today are:
- Webmonkey Special Characters Reference – I was able to get the list of special characters that I need to concentrate and their decimal values.
- My Scientific Calculator – To convert the decimal values to HEX values.
- Free RegEx Testing Tool – A multi-platform free adobe air regular expression evaluation tool that helped me to test my ideas.
This is all for today. Will post my experiences like these often. If you have any comments, please let me know through the comments section below so I can improve this further.
Here we use a shorthand version of specifying widths. We use a clockwise representation of border widths.