Simple converter of files into their hex components... but i can't arrange utf-8 parts!

Sun Jun 9 19:07:18 EDT 2013

On Mon, Jun 10, 2013 at 7:06 AM,  <blatt447477 at gmail.com> wrote:
> Hi all,
> I developed a script, which, IMHO, is more useful than the well
> known bash "hexdump".
> Unfortunately i can't arrange very easily the utf-8 encoding,
> so in my output there is a loss of synchronization between the
> the literal and the hex part...
> The script is not very long but is written not very well (no functions,
> no classes...) but I didn't succeed in formulating my doubts in
> a more concise way... so here you can find it!

Functions and classes are entirely optional in Python :) However,
there are a number of points about your code that are not Pythonic, so
I'll take the liberty of commenting on those. You're free to ignore my
comments, of course!

> 004 # qwerty: not   unicode but   ascii
>     2 7767773 667   7666666 677   676660
>     3 175249a ef4   5e93f45 254   13399a
>
> 005 # qwerty: non è unicode bensì ascii
>     2 7767773 666 ca 7666666 6667ca 676660
>     3 175249a efe 38 5e93f45 25e33c 13399a

I'm not 100% sure of what you're trying to accomplish here. You want
to produce a hex-dump output, but:
1) Your hex digits are directly underneath the character concerned (q
= 0x71 ergo line 2 has "7" and line 3 has "1");
2) Spaces are shown as spaces;
3) UTF-8 sequences get merged.

The one part I'm not sure about is what you intend to happen with
UTF-8 sequences. Currently, your line 005 gets offset by the c3 a8 and
then again by the c3 ac, which as you say is undesirable, but what
*do* you want? Are you trying to have the character on top get split
into its bytes? That would look like this:

005 # qwerty: non Ã¨ unicode bensÃ¬ ascii
    2 7767773 666 ca 7666666 6667ca 676660
    3 175249a efe 38 5e93f45 25e33c 13399a

You can do that by simply opening the file as Latin-1 (iso-8859-1)
instead of UTF-8. Each byte will be taken to represent its eight-bit
value, and then when you produce the output, it'll be encoded as
whatever your console requires.

Alternatively, do you want to insert spaces, or some other placeholder?

005 # qwerty: non è  unicode bensì  ascii
    2 7767773 666 ca 7666666 6667ca 676660
    3 175249a efe 38 5e93f45 25e33c 13399a

Or perhaps it'd be better to string them out further vertically?

005 # qwerty: non è unicode bensì ascii
    2 7767773 666 c 7666666 6667c 676660
    3 175249a efe 3 5e93f45 25e33 13399a
                  a             a
                  8             c

Or maybe something else entirely? My understanding of your comment here
> # the insertion of one or more spaces after the unicode characters must be
> # done manually on the output (lP)
is that you want to insert spaces, but that means changing the text
itself. I'm not so sure that's a good thing, but if that really is
what you want, then sure.

> # -*- coding: utf-8 -*-
> # px.py          # python 2.6.6

I would advise, by-the-by, that you consider targeting Python 3. There
are heaps of extremely handy Unicode features in Python 3, most
notably that the default string type is Unicode characters, not bytes.
Also, Py3 has a future, Py2 will receive only bugfixes and security
patches.

> try:
>     sFN=sys.argv[1]
>     f=open(sFN)
>     lF=f.readlines()
>     f.close()
> except:
>     sHD=sys.stdin.read().replace('\n','~\n')
>     lF=sHD.split('\n')
>     for n in xrange(len(lF)):
>         lF[n]=lF[n].replace('~','\n')

This is reading the entire file in before producing any output. This
plays badly with other Unix tools (you can't, for instance, 'tail -f
some-file|your-script' to monitor a growing log), and causes extensive
memory usage. Since you then (as far as I can see) always work
line-by-line, you would probably do better to simply iterate over the
lines of input.

Also: I'm not sure what your replace calls are meant to do, but you're
turning all tildes into newlines. It should be possible to iterate
over the lines without this hassle.

> lP=[]

I have no idea what this name is supposed to represent; longer names
are more usually preferred. Also, all-uppercase names tend to be for
constants, which will confuse people.

>     for k in xrange(len(lNoSpaces)):
>         sHex=lNoSpaces[k].encode('hex')

You're iterating up to the length of something and then using the
index only to retrieve the current element. There's an easier way to
spell that:

for char in lNoSpaces:
    sHex=char.encode('hex')

>         sH=''
>         for c in xrange(0,len(sHexNT),2):
>             sH += sHexNT[c]
>         sHexH += sH+' '
>
>         sL=''
>         for c in xrange(1,len(sHexNT),2):
>             sL  += sHexNT[c]
>         sHexL +=  sL+' '

Here's a really fancy trick you can do: Slicing with a step.
Demonstrating with the interactive interpreter:

>>> s = "7177657274793a206e6f74202020756e69636f6465206275742020206173636969"
>>> hi = s[::2]
>>> lo = s[1::2]
>>> hi
'776777326672227666666267722267666'
>>> lo
'175249a0ef40005e93f45025400013399'

I love Python!

> for n in xrange(0,len(lP),3):
>     try:
>         lP[n].encode('utf-8')
>     except:
>         print lP[n],    # to be modified by hand in presence of utf-8 char
>         print lP[n+1],  #     to syncronize ascii and hex
>         print lP[n+2],

Okay... this is something I'm not understanding. I *think* that IP[n]
here is your original text, as a byte string. In that case, what you
want here is to decode it as UTF-8, I think. But I'm not sure.
Recommendation: Don't use a bare 'except', unless you're logging an
exception and moving on. You're masking an error here; when you
attempt to *encode* a byte string as UTF-8, what it actually does is
first try to *decode* it as ASCII (which produces a Unicode string),
then encode the result. Again using the interactive interpreter:

>>> '\xc2\xa2'.encode('utf-8')
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    '\xc2\xa2'.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
0: ordinal not in range(128)
>>> '\xc2\xa2'.decode('utf-8')
u'\xa2'

This is another area where Python 3 would help, because you would be
working with strings most of the way through.

Here's how I'd put together a Py3 version of that code - this is
pseudo-code, but Python's pretty good at executing pseudo-code...

import fileinput
import binascii
for line in fileinput.input():  # handles all the details of args-or-stdin
    output=['line','headers','here']
    for char in line:
        if char==' ':
            output[*]+=' ' # not Python syntax but so close
            continue
        utf8=char.encode('utf-8')
        output[0]+=char+' '*len(utf8)-1
        utf8=binascii.hexlify(utf8).decode()
        output[1]+=utf8[::2]; output[2]+=utf8[1::2]
    print(output[*])

That's actually very close to real code, feel free to flesh it out a
smidge and run it :) I seriously was going to start by writing just
pseudo-code, but it got closer and closer to actual working Python...

Oh, and since we have a current thread about copyright and license:
This is copyright 2013 Chris Angelico, MIT license. So go ahead, use
it. :)

ChrisA