You'd be best off converting your input to Unicode strings, using text = text.decode("cp1252") doing all your conversions in terms of unicode characters text = text.replace(u"\u2013", "–") ... and finally converting to UTF-8 on output: text = text.encode('utf-8') u'\u0096'.encode('utf-8') Jeff