[Tutor] Unicode question

Tue Sep 11 15:49:24 CEST 2007

János Juhász wrote:
> Dear All,
> 
> I would like to convert my DOS txt file into pdf with reportlab.
> The file can be seen correctly in Central European (DOS) encoding in 
> Explorer.
> 
> My winxp uses cp852 as default codepage.
> 
> When I open the txt file in notepad and set OEM/DOS script for terminal 
> fonts, it shows the file correctly.
> 
> I tried to convert the file with the next way:
> 
> from reportlab.platypus import *
> from reportlab.lib.styles import getSampleStyleSheet
> from reportlab.rl_config import defaultPageSize
> PAGE_HEIGHT=defaultPageSize[1]
> 
> styles = getSampleStyleSheet()
> 
> def MakePdfInvoice(InvoiceNum, page):
>     style = styles["Normal"]
>     PdfInv = [Spacer(0,0)]
>     PdfInv.append(Preformatted(page, styles['Normal']))
>     doc = SimpleDocTemplate(InvoiceNum)
>     doc.build(PdfInv)
> 
> if __name__ == '__main__':
>     content = open('invoice01_0707.txt').readlines()
>     page = ''.join(content[:92])
>     page = unicode(page, 'Latin-1')

Why latin-1? Try
   page = unicode(page, 'cp852')

>     MakePdfInvoice('test.pdf', page)
> 
> But it made funny chars somewhere.
> 
> I tried it so eighter
> 
> if __name__ == '__main__':
>     content = open('invoice01_0707.txt').readlines()
>     page = ''.join(content[:92])
>     page = page.encode('cp852')

Use decode() here, not encode().
decode() goes towards Unicode
encode() goes away from Unicode

As a mnemonic I think of Unicode as pure unencoded data. (This is *not* 
accurate, it is a memory aid!) Then it's easy to remember that decode() 
removes encoding == convert to Unicode, encode() adds encoding == 
convert from Unicode.

>     MakePdfInvoice('test.pdf', page)
> 
> But it raised exception:
>     debugger.run(codeObject, __main__.__dict__, start_stepping=0)
>   File 
> "C:\Python24\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 
> 60, in run
>     _GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
>   File 
> "C:\Python24\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 
> 631, in run
>     exec cmd in globals, locals
>   File "D:\devel\reportlab\MakePdfInvoice.py", line 18, in ?
>     page = page.encode('cp852')
>   File "c:\Python24\lib\encodings\cp852.py", line 18, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 112: 
> ordinal not in range(128)

When you call encode on a string (instead of a unicode object) the 
string is first decoded to Unicode using ascii encoding. This usually fails.

Kent