translating Python to Assembler

Mon Jan 28 09:35:20 EST 2008

over at thepond.com wrote:
> On Sat, 26 Jan 2008 14:47:50 +0100, Bjoern Schliessmann

>> This may be true, but I think it's not bad to assume that machine
>> language and assembler are "almost the same" in this context,
>> since the translation between them is non-ambiguous (It's
>> just "recoding"; this is not the case with HLLs).
> 
> I have no problem with your explanation. It's nearly impossible to
> program in machine code, which is all 1's and 0's. 

Not really; it's "voltage" or "no voltage" at different signal lines
in the processor. The dual system is just one representation you
could choose. More common (and practical) are hexadecimal or octal.

> the difference is that machine code can be read directly, whereas
> assembler has to be compiled in order to convert the opcodes to
> binary data.

As I said before, IMHO this "compilation" if trivial compared to HLL
compilation, since it's just a translation from opcodes to numbers
and labels to addresses, respectively.

HLL compilers do much more; they translate high-level control
structures to low-level implementation (which is ambiguous). Often,
optimisation is employed, which may e. g. cause that a loop is
unrolled (vanishes in assembly).

>> (Not really -- object code files are composed of header data and
>> different segments, data and code, and only the code segments are
>> really meaningful to the processor.)
> 
> I agree that the code segments, and the data, are all that's
> meaningful to the processor. There are a few others, like
> interrupts that affect the processor directly.

Interrupts and segments are orthogonal, don't you think?

> I understand what you're saying but I'm refering to an executable
> file ready to be loaded into memory. 

Obviously not, since I was referring to such a file, too. Try
reading about "real" executable formats like ELF.

> It's stored on disk in a series of 1's and 0's. 

No, it's stored using a complex chain of magnetic fields. You _can_
interpret it as dual numbers, yes. But it's impractical and the
choice is up to the viewer.

> The actual file on disk is in a certain format that only the
> operating system understands. But once the code is read in, it
> goes into memory locations which hold individual arrays of bits.

I agree. (Before, you wrote differently:

> Both Linux and Windows compile down to binary files, which are
> essentially 1's and 0's arranged in codes that are meaningful to
> the processor.

E. g. the ELF header and data segments mean nothing of sense to the
processor itself.)

> That's a machine code, since starting at 00000000 to 11111111, you
> have 256 different codes available. 

I'm afraid it's not that simple. IA-32 opcodes, for example, are
complex bit sequences and don't always have the same length.
Primary opcodes consist of up to three bytes in this architecture.

With some RISC CPUs, there is a machine instruction length
limitation of e. g. one word. But the IA-32 doesn't have this
limitation.

>> But you _do_ know that pyc files are Python byte code, and you
>> could only directly disassemble them to Python byte code
>> directly? 
> 
> that's the part I did not understand, so thanks for pointing that
> out. What I disassembled did not make sense. I was looking for
> assembler code, but I do understand a little bit about how the
> interpreter reads them.
> 
> For example, from os.py, here's part of the script:
> 
> # Note:  more names are added to __all__ later.
> __all__ = ["altsep", "curdir", "pardir", "sep", "pathsep",
> "linesep",
>            "defpath", "name", "path", "devnull"]
> 
> here's the disassembly from os.pyc:

... which is completely pointless because this is no IA-32 code
segment which the processor could execute, but a custom data file
format. I'd rather try this, for example:

>>> def increment(i):
...  i += 1
...  return argument
... 
>>> dis.dis(increment)
  2           0 LOAD_FAST                0 (i)
              3 LOAD_CONST               1 (1)
              6 INPLACE_ADD         
              7 STORE_FAST               0 (i)

  3          10 LOAD_GLOBAL              0 (argument)
             13 RETURN_VALUE        
>>> 

The Python VM, though, is stack-based, not register-based as most
CPUs. That's why the opcodes are quite different.

> The script is essentially gone. I'd like to know how to read the
> pyc files, but that's getting away from my point that there is a
> link between python scripts and assembler. At this point, I admit
> the code above is NOT assembler, but sooner or later it will be
> converted to machine code by the interpreter and the OS and that
> can be disassembled as assembler.

Yes. But the interpreter doesn't convert the entire file to machine
language. It reads one instruction after another and, amongst other
things, outputs corresponding machine code which "does" what's
intended by the byte code instruction.

> Python needs an OS like Windows or Linux to interface it to the
> processor.  

Not really. The CPython executable contains machine code directly
executable by the host processor. The OS just 

* provides routines for accessing peripherals and allocating memory,
* makes it possible that multiple programs can run side by side,
* and loads the executable and sets it up in memory for execution.

> Yes, the source is readable like that, but the compiled binary is
> not.

For a machine, it is. The translation is 1:1, trivial.

> A disaasembly shows both the source and the opcodes. 

The output I posted was directly from the GNU C compiler (compiled
from an empty "main" function). I got it by using a parameter that
tells the compiler to leave out the last step of generating machine
code from assembly, and save the source.

A "disassembly" is the other way round. The hexadecimal
representation of the source in the leftmost columns is completely
redundant and practically irrelevant for a human being.

> The second column are opcodes, 

Not only. It's machine code instructions, i. e. opcodes and
operands.

> and the third column are mneumonics, English words attached to the
> codes to give them meaning. 

They're mn_e_monics, and they're not really english (what kind of
english words would RET, JLE or CMP be?).

> The second and third column mean the same thing. 

Not at all! They're the operands and can be memory addresses,
registers or fixed values.

> A single opcode instruction like 59 = pop ecx and 48 = dec eax,
> are self-explanatory. 

It's a machine instruction which consists of the opcode POP and the
operand ECX.

> The second instruction, call 1D00122A is not as straight forward.
> it is made up of two parts: E8 = the opcode for CALL and the rest
> 'D1 FF FF FF' is the opcode operator

I'm afraid not -- it's the operand.

> I would agree with what you said earlier, that there is a
> similarity between machine code and assembler. 

Is there, actually? :)

> You can actually write in machine code, but it is often entered in
> hexadecimal, requiring a hex to binary interpreter. 

IMHO, this makes no sense. For example, the memory contents
represented by binary 1000 and 0x10 are exactly the same. Thus, it
doesn't matter at all how you enter or view it, and it's completely
up to the user. The CPU understands both *exactly* the same way,
since they are the same: voltage levels at signal lines.

>>> You can't read a pyc file in a hex editor,
>
> if I knew what the intervening numbers meant I could. :-)

(*You* wrote the above. Please don't drop quoting headers if you
quote this deep.)

Regards,

Björn

-- 
BOFH excuse #11:

magnetic interference from money/credit cards