translating Python to Assembler

Sun Jan 27 05:55:20 EST 2008

On Sat, 26 Jan 2008 14:47:50 +0100, Bjoern Schliessmann
<usenet-mail-0306.20.chr0n0ss at spamgourmet.com> wrote:

>over at thepond.com wrote:
>
>> Intel processors can only process machine language[...] There's no
>> way for a processor to understand any higher level language, even
>> assembler, since it is written with hexadecimal codes and basic
>> instructions like MOV, JMP, etc. The assembler compiler can
>> convert an assembler file to a binary executable, which the
>> processor can understand.
>
>This may be true, but I think it's not bad to assume that machine
>language and assembler are "almost the same" in this context, since
>the translation between them is non-ambiguous (It's
>just "recoding"; this is not the case with HLLs).

I have no problem with your explanation. It's nearly impossible to
program in machine code, which is all 1's and 0's. Assembler makes it
infinitely easier by converting the machine 1's and 0's to their
hexadecimal equivalent and assigning an opcode name to them, like
PUSH, MOV, CALL, etc.

Still, the older machine-programmable processors used switches to set
the 1's and 0's. Or, the machine code was fed in on perforated cards
or tapes that were read. The computer read the switches, cards or
tapes, and set voltages according to what it scanned.

the difference is that machine code can be read directly, whereas
assembler has to be compiled in order to convert the opcodes to binary
data.

>
>> Both Linux and Windows compile down to binary files, which are
>> essentially 1's and 0's arranged in codes that are meaningful to
>> the processor.
>
>(Not really -- object code files are composed of header data and
>different segments, data and code, and only the code segments are
>really meaningful to the processor.)

I agree that the code segments, and the data, are all that's
meaningful to the processor. There are a few others, like interrupts
that affect the processor directly. 

I understand what you're saying but I'm refering to an executable file
ready to be loaded into memory. It's stored on disk in a series of 1's
and 0's. As you say, there are also control codes on disk to separate
each byte along with CRC codes, timing codes, etc. However, that is
all stripped off by the hard drive electronics. 

The actual file on disk is in a certain format that only the operating
system understands. But once the code is read in, it goes into memory
locations which hold individual arrays of bits. Each memory location
holds a precise number of bits corresponding to the particular code it
represents.  For example, the ret instruction you mention below is
represent by hex C3 (0xC3), which represents the bits 11000011. 

That's a machine code, since starting at 00000000 to 11111111, you
have 256 different codes available. When those 1's and 0's are
converted to volatges, the computer can analyze them and set circuits
in action which will bring about the desired operation. Since Linux is
written in C, it must convert down to machine code, just as Windows
must. 

>
>> Once a python py file is compiled into a pyc file, I can
>> disassemble it into assembler. 
>
>But you _do_ know that pyc files are Python byte code, and you could
>only directly disassemble them to Python byte code directly?

that's the part I did not understand, so thanks for pointing that out.
What I disassembled did not make sense. I was looking for assembler
code, but I do understand a little bit about how the interpreter reads
them. 

For example, from os.py, here's part of the script:

# Note:  more names are added to __all__ later.
__all__ = ["altsep", "curdir", "pardir", "sep", "pathsep", "linesep",
           "defpath", "name", "path", "devnull"]

here's the disassembly from os.pyc:

00000C04 	06 00 00 00                    	dd 6
00000C08 	61 6C 74 73 65 70 74   	 db 'altsept'
00000C0F 	06 00 00 00                   	dd 6
00000C13 	63 75 72 64 69 72 74    	db 'curdirt'
00000C1A 	06 00 00 00                  	dd 6
00000C1E 	70 61 72 64 69 72 74   	 db 'pardirt'
00000C25 	03 00 00 00                  	 dd 3
00000C29 	73 65 70                      	db 'sep'
00000C2C 	74 07 00 00                	dd 774h
00000C30 	00                               	db    0
00000C31 	70 61 74 68 73 65 70    	db 'pathsep'
00000C38 	74 07 00 00             	dd 774h
00000C3C 	00                     		 db    0
00000C3D 	6C 69 6E 65 73 65 70    	db 'linesep'
00000C44 	74 07 00 00             	dd 774h
00000C48 	00                      		db    0
00000C49 	64 65 66 70 61 74 68    	db 'defpath'
00000C50 	74 04 00 00             	dd offset unk_474
00000C54	00                      		db    0
00000C55 	6E 61 6D 65             	db 'name'
00000C59 	74 04 00 00             	dd offset unk_474
00000C5D 	00                      		db    0
00000C5E 	70 61 74 68             	db 'path'
00000C62 	74 07 00 00             	dd 774h
00000C66 	00                      		db    0
00000C67	64 65 76 6E 75 6C 6C    	db 'devnull'

you can see all the ASCII names in the disassembly like altsep,
curdir, etc. I'm not clear as to why they are all terminated with 0x74
= t, or if that's my poor interpretation. Some ASCII strings don't use
a 0 terminator. The point is that all the ASCII strings have numbers
between them which mean something to the interpreter. Also, they are
at a particular address. The interpreter has to know where to find
them. 

The script is essentially gone. I'd like to know how to read the pyc
files, but that's getting away from my point that there is a link
between python scripts and assembler. At this point, I admit the code
above is NOT assembler, but sooner or later it will be converted to
machine code by the interpreter and the OS and that can be
disassembled as assembler.  

I realize this is a complicated process and I can understand people
thinking I'm full of beans. Python needs an OS like Windows or Linux
to interface it to the processor. And all a processor can understand
is machine code. 

>
>> Assembler is nothing but codes, which are combinations of 1's and
>> 0's. 
>
>No, assembly language source is readable text like this (gcc):
>
>.LCFI4:
>    movl    $0, %eax
>    popl    %ecx
>    popl    %ebp
>    leal    -4(%ecx), %esp
>    ret
>

Yes, the source is readable like that, but the compiled binary is not.
A disaasembly shows both the source and the opcodes.  The ret
statement above is a mneumonic for hex C3 in assembler. You have left
out the opcodes. Here's another example of assembler which is
disassembled from python.exe:

1D001250     FF 74 24 04                       push    [esp+arg_0]
1D001254     E8 D1 FF FF FF                 call    1D00122A
1D001259     F7 D8                                neg     eax
1D00125B     1B C0                                sbb     eax, eax
1D00125D     F7 D8                                neg     eax
1D00125F     59                                     pop     ecx
1D001260    48                                     dec     eax
1D001261    C3                                    retn

the first column is obviously the address in memory. The second column
are opcodes, and the third column are mneumonics, English words
attached to the codes to give them meaning. The second and third
column mean the same thing.

A single opcode instruction like 59 = pop ecx and 48 = dec eax, are
self-explanatory. 59 is hexadecimal for binary 01011001, which is a
binary code. When a processor receives that binary as voltages, it is
wired to push the contents of the ecx register onto the stack. 

The second instruction, call 1D00122A is not as straight forward. it
is made up of two parts: E8 = the opcode for CALL and the rest 'D1 FF
FF FF' is the opcode operator, or the data which the call is
referencing. In this case it's an address in memory that holds the
next instruction being called. It is written backward, however, which
is convention in certain assemblers. D1 FF FF FF actually means FF FF
FF D1. 

This  instruction uses F's to negate the instruction, telling the
processor to jump back. The signed number FFFFFFD1 = -2E. A call
counts from the end of it's opcode numbers which is 1D001258, and
1D001258 - 2E =  1D00122A, the address being called. 

As you can see, it's all done with binary codes. The English
statements are purely for the convenience of the programmer. If you
look at the Intel definitons for assembler instructions, it lists both
the opcodes and the mneumonics.

I would agree with what you said earlier, that there is a similarity
between machine code and assembler. You can actually write in machine
code, but it is often entered in hexadecimal, requiring a hex to
binary interpreter. In tht case, the similarity to compiled assembler
is quite close. 

>Machine language is binary codes, yes.
>
>> You can't read a pyc file in a hex editor, 
>
if I knew what the intervening numbers meant I could. :-)

>By definition, you can read every file in a hex editor ...
>
>> but you can read it in a disassembler. It doesn't make a lot of
>> sense to me right now, but if I was trying to trace through it
>> with a debugger, the debugger would disassemble it into 
>> assembler, not python.
>
>Not at all. Again: It's Python byte code. Try experimenting with
>pdb.

I will eventually...thanks for reply.