diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 2ef6e653ac19d4..dbc858b276833c 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -24,9 +24,7 @@ Compiling Python Source Code Runtime Objects --- -- [Code Objects (coming soon)](code_objects.md) - -- [The Source Code Locations Table](locations.md) +- [Code Objects](code_objects.md) - [Generators (coming soon)](generators.md) diff --git a/InternalDocs/_code_objects.md b/InternalDocs/_code_objects.md new file mode 100644 index 00000000000000..6cd6098132fdfd --- /dev/null +++ b/InternalDocs/_code_objects.md @@ -0,0 +1,43 @@ + +Code objects +============ + +The interpreter uses a code object (``frame->f_code``) as its starting point. +Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. +In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. +(In previous versions the code object was a :class:`bytes` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) + +Code objects are typically produced by the bytecode :ref:`compiler `, although they are often written to disk by one process and read back in by another. +The disk version of a code object is serialized using the :mod:`marshal` protocol. +Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. + +Code objects are nominally immutable. +Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. + +The locations table +------------------- + +Whenever an exception is raised, we add a traceback entry to the exception. +The ``tb_lineno`` field of a traceback entry is (lazily) set to the line number of the instruction that raised it. +This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using :c:func:`PyCode_Addr2Line`. +This table has an entry for every instruction rather than for every ``try`` block, so a compact format is very important. + +The full design of the 3.11 locations table is written up in :cpy-file:`InternalDocs/locations.md`. +While there are rumors that this file is slightly out of date, it is still the best reference we have. +Don't be confused by :cpy-file:`Objects/lnotab_notes.txt`, which describes the 3.10 format. +For backwards compatibility this format is still supported by the ``co_lnotab`` property. + +The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, *and* the start and end column numbers. +Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. +The rest can be computed from the last instruction (``tb_lasti``) with the help of the locations table. +For Python code, a convenient method exists, :meth:`~codeobject.co_positions`, which returns an iterator of :samp:`({line}, {endline}, {column}, {endcolumn})` tuples, one per instruction. +There is also ``co_lines()`` which returns an iterator of :samp:`({start}, {end}, {line})` tuples, where :samp:`{start}` and :samp:`{end}` are bytecode offsets. +The latter is described by :pep:`626`; it is more compact, but doesn't return end line numbers or column offsets. +From C code, you have to call :c:func:`PyCode_Addr2Location`. + +Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). +In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the ``_co_linearray`` field. + + +TODO: +- co_consts, co_names, co_varnames, and their ilk diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index 284a8b7aee5765..f81494bee0390e 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -1,5 +1,140 @@ -Code objects -============ +# Code objects -Coming soon. +A `CodeObject` is a builtin Python type that represents a compiled executable, +such as a compiled function or class. +It contains a sequence of bytecode instructions along with its associated +metadata: data which is necessary to execute the bytecode instructions (such +as the values of the constants they access) or context information such as +the source code location, which is useful for debuggers and other tools. + +Since 3.11, the final field of the `PyCodeObject` C struct is an array +of indeterminate length containing the bytecode, `code->co_code_adaptive`. +(In older versions the code object was a +[`bytes`](https://docs.python.org/dev/library/stdtypes.html#bytes) +object, `code->co_code`; this was changed to save an allocation and to +allow it to be mutated.) + +Code objects are typically produced by the bytecode [compiler](compiler.md), +although they are often written to disk by one process and read back in by another. +The disk version of a code object is serialized using the +[marshal](https://docs.python.org/dev/library/marshal.html) protocol. +Some code objects are pre-loaded into the interpreter using +[`Tools/build/deepfreeze.py`](../Tools/build/deepfreeze.py), +which writes +[`Python/deepfreeze/deepfreeze.c`](../Python/deepfreeze/deepfreeze.c). + +Code objects are nominally immutable. +Some fields (including `co_code_adaptive` and fields for runtime +information such as `_co_monitoring`) are mutable, but mutable fields are +not included when code objects are hashed or compared. + +## Source code locations + +Whenever an exception occurs, the interpreter adds a traceback entry to +the exception for the current frame, as well as each frame on the stack that +it unwinds. +The `tb_lineno` field of a traceback entry is (lazily) set to the line +number of the instruction that was executing in the frame at the time of +the exception. +This field is computed from the locations table, `co_linetable`, by the function +[`PyCode_Addr2Line`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Line). +Despite its name, `co_linetable` includes more than line numbers; it represents +a 4-number source location for every instruction, indicating the precise line +and column at which it begins and ends. This is a significant amount of data, +so a compact format is very important. + +Note that traceback objects don't store all this information -- they store the start line +number, for backward compatibility, and the "last instruction" value. +The rest can be computed from the last instruction (`tb_lasti`) with the help of the +locations table. For Python code, there is a convenience method +(`codeobject.co_positions`)[https://docs.python.org/dev/reference/datamodel.html#codeobject.co_positions] +which returns an iterator of `({line}, {endline}, {column}, {endcolumn})` tuples, +one per instruction. +There is also `co_lines()` which returns an iterator of `({start}, {end}, {line})` tuples, +where `{start}` and `{end}` are bytecode offsets. +The latter is described by [`PEP 626`](https://peps.python.org/pep-0626/); it is more +compact, but doesn't return end line numbers or column offsets. +From C code, you need to call +[`PyCode_Addr2Location`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Location). + +As the locations table is only consulted by exception handling (to set ``tb_lineno``) +and by tracing (to pass the line number to the tracing function), lookup is not +performance critical. +In order to reduce the overhead during tracing, the mapping from instruction offset to +line number is cached in the ``_co_linearray`` field. + +### Format of the locations table + +The `co_linetable` bytes object of code objects contains a compact +representation of the source code positions of instructions, which are +returned by the `co_positions()` iterator. + +> [!NOTE] +> Not to be confused by [`Objects/lnotab_notes.txt`](Objects/lnotab_notes.txt), +> which describes the 3.10 format, that stores only that start line for each instruction. +> For backwards compatibility this format is still supported by the `co_lnotab` property. + +`co_linetable` consists of a sequence of location entries. +Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. + +Each entry contains the following information: +* The number of code units covered by this entry (length) +* The start line +* The end line +* The start column +* The end column + +The first byte has the following format: + +Bit 7 | Bits 3-6 | Bits 0-2 + ---- | ---- | ---- + 1 | Code | Length (in code units) - 1 + +The codes are enumerated in the `_PyCodeLocationInfoKind` enum. + +## Variable length integer encodings + +Integers are often encoded using a variable length integer encoding + +### Unsigned integers (varint) + +Unsigned integers are encoded in 6 bit chunks, least significant first. +Each chunk but the last has bit 6 set. +For example: + +* 63 is encoded as `0x3f` +* 200 is encoded as `0x48`, `0x03` + +### Signed integers (svarint) + +Signed integers are encoded by converting them to unsigned integers, using the following function: +```Python +def convert(s): + if s < 0: + return ((-s)<<1) | 1 + else: + return (s<<1) +``` + +*Location entries* + +The meaning of the codes and the following bytes are as follows: + +Code | Meaning | Start line | End line | Start column | End column + ---- | ---- | ---- | ---- | ---- | ---- + 0-9 | Short form | Δ 0 | Δ 0 | See below | See below + 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte + 13 | No column info | Δ svarint | Δ 0 | None | None + 14 | Long form | Δ svarint | Δ varint | varint | varint + 15 | No location | None | None | None | None + +The Δ means the value is encoded as a delta from another value: +* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. +* End line: Delta from the start line + +*The short forms* + +Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). +* Start column: `(code*8) + ((second_byte>>4)&7)` +* End column: `start_column + (second_byte&15)` diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md index 37964bd99428df..ed4cfb23ca51f7 100644 --- a/InternalDocs/compiler.md +++ b/InternalDocs/compiler.md @@ -443,14 +443,12 @@ reference to the source code (filename, etc). All of this is implemented by Code objects ============ -The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in +The result of `_PyAST_Compile()` is a `PyCodeObject` which is defined in [Include/cpython/code.h](../Include/cpython/code.h). And with that you now have executable Python bytecode! -The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c). -This file will also need a new case statement for the new opcode in the big switch -statement in `_PyEval_EvalFrameDefault()`. - +The code objects (byte code) are executed in `_PyEval_EvalFrameDefault()` +in [Python/ceval.c](../Python/ceval.c). Important files =============== diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index dcfddc99370c0e..4c10cbbed37735 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -16,7 +16,7 @@ from the instruction definitions in [Python/bytecodes.c](../Python/bytecodes.c) which are written in [a DSL](../Tools/cases_generator/interpreter_definition.md) developed for this purpose. -Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_object.md), +Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_objects.md), which contains the bytecode instructions along with static data that is required to execute them, such as the consts list, variable names, [exception table](exception_handling.md#format-of-the-exception-table), and so on. diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md deleted file mode 100644 index 91a7824e2a8e4d..00000000000000 --- a/InternalDocs/locations.md +++ /dev/null @@ -1,69 +0,0 @@ -# Locations table - -The `co_linetable` bytes object of code objects contains a compact -representation of the source code positions of instructions, which are -returned by the `co_positions()` iterator. - -`co_linetable` consists of a sequence of location entries. -Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. - -Each entry contains the following information: -* The number of code units covered by this entry (length) -* The start line -* The end line -* The start column -* The end column - -The first byte has the following format: - -Bit 7 | Bits 3-6 | Bits 0-2 - ---- | ---- | ---- - 1 | Code | Length (in code units) - 1 - -The codes are enumerated in the `_PyCodeLocationInfoKind` enum. - -## Variable length integer encodings - -Integers are often encoded using a variable length integer encoding - -### Unsigned integers (varint) - -Unsigned integers are encoded in 6 bit chunks, least significant first. -Each chunk but the last has bit 6 set. -For example: - -* 63 is encoded as `0x3f` -* 200 is encoded as `0x48`, `0x03` - -### Signed integers (svarint) - -Signed integers are encoded by converting them to unsigned integers, using the following function: -```Python -def convert(s): - if s < 0: - return ((-s)<<1) | 1 - else: - return (s<<1) -``` - -## Location entries - -The meaning of the codes and the following bytes are as follows: - -Code | Meaning | Start line | End line | Start column | End column - ---- | ---- | ---- | ---- | ---- | ---- - 0-9 | Short form | Δ 0 | Δ 0 | See below | See below - 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte - 13 | No column info | Δ svarint | Δ 0 | None | None - 14 | Long form | Δ svarint | Δ varint | varint | varint - 15 | No location | None | None | None | None - -The Δ means the value is encoded as a delta from another value: -* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. -* End line: Delta from the start line - -### The short forms - -Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). -* Start column: `(code*8) + ((second_byte>>4)&7)` -* End column: `start_column + (second_byte&15)` diff --git a/Objects/lnotab_notes.txt b/Objects/lnotab_notes.txt index 0f3599340318f0..003f78acc32193 100644 --- a/Objects/lnotab_notes.txt +++ b/Objects/lnotab_notes.txt @@ -1,7 +1,7 @@ Description of the internal format of the line number table in Python 3.10 and earlier. -(For 3.11 onwards, see Objects/locations.md) +(For 3.11 onwards, see InternalDocs/locations.md) Conceptually, the line number table consists of a sequence of triples: start-offset (inclusive), end-offset (exclusive), line-number.